There is a web service that provides a link to a given DOI (https://en.wikipedia.org/wiki/Digital_object_identifier), which can be used to access the PDF of the associated document. The link has the following structure, see: https://libkey.io/libraries/1420/articles/362897792/full-text-file?utm_source=api_50
If you access the link, you will be redirected to the PDF document. This works fine if I want to access the document with the browser. But if I want to download the PDF document programmatically or via Java, I need the direct link to the PDF.
My question: How can I get direct access to the PDF. Is there a library that can simulate the browser in Java? Do you know other ways to get to the PDF.
If my problem is not understandable enough, ask me specific questions!
Thanks a lot
Related
There is a website that has an HTML <video> element in it when loaded, however this element isn't present if I just download it with wget, so I guess it gets loaded by a script that's only ran when the page is opened in a browser. I need the video's direct link, in an automated fashion.
Could you please tell me if I have the right idea, and if there is a possible solution? Could I for example run a browser from the command line, let it load the page and all of the referenced content, then save the .html file?
You could use headless Chrome, potentially with Puppeteer scripting for that.
Though, depending on the details, there may be easier options that would get you what you need. It sounds like you're currently trying to scrape a third party website using wget. Instead of, or in addition to, requesting the .html content with wget, you could request the relevant javascript file and then extract the video url from there.
I am developing a small labeling tool that given a URL should display a document hosted on that URL and allow a user to choose a label for that document.
I want to display the contents of the URL for this purpose. As far as I know, I can either get the URL content, parse the contents, and display or use an iframe option.
Without using parser
Iframes are not enabled for the target URL, the contents of which I want to display. Is there any other way to do this using javascript without using parser?
Using parser
I can crawl the contents of the URL, get everything between and dump it in the webpage area.
I'm new to javascript and front end development so I am not sure whether these are the only options.
Are there other options to do this?
If the parser is the only option, Can I dump the HTML that I get from the remote URL? I understand that images and other media that may be within on remote url won't be displayed. Is there any other caveat to this method? More importantly, Is this the best way to do this?
Most sites do it via the iframe like you mentioned like codepen.
Also, you can use Puppeteer ( a headless browser ) to do these sort of things. Get the contents using web scraping or take a screenshot or print a pdf. Pretty nifty library.
Most things that you can do manually in the browser can be done using
Puppeteer! Here are a few examples to get you started:
Generate screenshots and PDFs of pages. Crawl a SPA (Single-Page Application) and generate pre-rendered content (i.e. "SSR"
(Server-Side Rendering)).
Automate form submission, UI testing, keyboard input, etc. Create an up-to-date, automated testing environment.
Run your tests directly in the latest version of Chrome using the latest JavaScript and browser features.
Hope this helps !
For a project I am looking for a good way to display pdf files in a webbrowser (IE8 and newer). The browsers used in my project have acrobat reader installed, so thatwillbe the preferred way to visualize thepdf file.
Is there a way to access acrobat reader currently opened in a div to (for instance) switch pages or jump to a given bookmark? Is it also possible to listen for text selection events?
Thanks in advance!
I'm not sure you can control the PDF from the outside of the page. However, pdf.js is a PDF renderer written in Javascript. It allows you to embed a PDF viewer inside a page and fully control it, including flipping pages, and the like. It may just be what you're looking for!
Having worked on this problem for some time, I now use the following solution:
Use pdfObject.js to embed the pdf file in my webpage.
Communicate between pdf and html using the HostContainer. Important here is that you are able to put some javascript in the PDF file.
Important note is that this only works with the embedded Acrobat Reader/Pro version.
see: http://www.javabeat.net/articles/print.php?article_id=301)
Good luck. If you encounter problems, just leave a message, perhaps I can help.
This won't solve all of your feature requests, but you may want to take a look at the PDF open parameters. If you open a pdf with the appropriate hash-url, you can control reader's behavior.
For example, the following will open a PDF and go to the third page of the pdf:
http://example.org/doc.pdf#page=3
Lets say for example i have an google search results page opened in a window or a tab in firefox.Is there a way i can retrieve the html code of that tab or window using javascript?
I suppose that the webpage html is saved temporarily somewhere in computer memory.
Can i load the webpage using memory saved address?
Is there a way for javascript to read html files saved in the same folder as the original?For example i have saved the webpage in a folder on my computer.If i create an html file inside the same folder does javascript consider the saved webpage as the same domain?
No, you most certainly can't do that unless you control both pages. This would be a huge security hole.
There is a custom search API which may help if you specifically want to do Google searches. It appears to have a JSONP implementation which should let you make a cross-domain request, but I haven't tried it out so I'm not sure how well it works.
I am trying to scrape some web pages in Python were the information is generated by Javascript.
I managed to retrieve the information generated on page load by using a headless browser with PyQt4 (example here: http://blog.motane.lu/2009/07/07/downloading-a-pages-content-with-python-and-webkit/)
But now, I'm trying to retrieve some information that is generated by having the user click on a Javascript link.
How can I do that?
Thanks
I guess you need Form Extractor Example. The trick is that you can expose any python object to javascript and call its methods. Pytonic version of this example could be found in PyQt distribution.