How to extract the dynamically generated HTML from a website - javascript

Is it possible to extract the HTML of a page as it shows in the HTML panel of Firebug or the Chrome DevTools?
I have to crawl a lot of websites but sometimes the information is not in the static source code, a JavaScript runs after the page is loaded and creates some new HTML content dynamically. If I then extract the source code, these contents are not there.
I have a web crawler built in Java to do this, but it's using a lot of old libraries. Therefore, I want to move to a Rails/Ruby solution for learning purposes. I already played a bit with Nokogiri and Mechanize.

If the crawler is able to execute JavaScript, you can simply get the dynamically created HTML structure using document.firstElementChild.outerHTML.
Nokogiri and Mechanize are currently not able to parse JavaScript. See
"Ruby Nokogiri Javascript Parsing" and "How do I use Mechanize to process JavaScript?" for this.
You will need another tool like WATIR or Selenium. Those drive a real web browser, and can thus handle any JavaScript.

You can't fetch the records coming from the database side. You can only fetch the HTML code which is static.
JavaScript must be requesting the records from the database using a query request which can't be fetch by the crawler.

Related

What is the best way to display webpage content for a given URL using javascript?

I am developing a small labeling tool that given a URL should display a document hosted on that URL and allow a user to choose a label for that document.
I want to display the contents of the URL for this purpose. As far as I know, I can either get the URL content, parse the contents, and display or use an iframe option.
Without using parser
Iframes are not enabled for the target URL, the contents of which I want to display. Is there any other way to do this using javascript without using parser?
Using parser
I can crawl the contents of the URL, get everything between and dump it in the webpage area.
I'm new to javascript and front end development so I am not sure whether these are the only options.
Are there other options to do this?
If the parser is the only option, Can I dump the HTML that I get from the remote URL? I understand that images and other media that may be within on remote url won't be displayed. Is there any other caveat to this method? More importantly, Is this the best way to do this?
Most sites do it via the iframe like you mentioned like codepen.
Also, you can use Puppeteer ( a headless browser ) to do these sort of things. Get the contents using web scraping or take a screenshot or print a pdf. Pretty nifty library.
Most things that you can do manually in the browser can be done using
Puppeteer! Here are a few examples to get you started:
Generate screenshots and PDFs of pages. Crawl a SPA (Single-Page Application) and generate pre-rendered content (i.e. "SSR"
(Server-Side Rendering)).
Automate form submission, UI testing, keyboard input, etc. Create an up-to-date, automated testing environment.
Run your tests directly in the latest version of Chrome using the latest JavaScript and browser features.
Hope this helps !

How to scrape dynamic website - using python scrapy?

I can scrape static website using scrapy however, this other website that I'm trying to scrape has 2 sections in its HTML namely; "head" and "body onload". And the information that I need is in the body onload part. I believe that content is loaded after html is requested and thus the website is dynamic. Is this doable using scrapy? What additional tools do I need?
Check out scrapy_splash, it's a rendering service for scrapy, that will allow you to crawl javascript based web sites.
You can also create your own downloader middleware and use Selenium with PhantomJS (example). The downside of this technic is that you lose the concurrency provided by scrapy.
Anyway, I think splash is the best way to do this.
Hope this helps.

Finding text from webpage using javascript

I am tring to make a script but I can't really find a solution.
I'm trying to find a string from a website. Hard part here is that I can't use
document.documentElement.innerHTML.search("string")
Since I can't do it locally, I want to use something like this:
var link = "myweb.com"
link.documentElement.innerHTML.search("string")
At the moment, my script generates the link, opens it and closes it: I just need to search the webpage for the word "error."
Javascript run inside of a client's browser won't actually be able to retrieve another website's html for you (unless it is a different page on your own website). You may want to read about the Same-Origin Policy.
You can, however, use javascript as a language to do what you want - just not running inside of a browser. You can use something called Node.js, which is simply a program you can use to run javascript outside of a browser.
What it really boils down to is that if you want to scrape another website (which is the term for what you are trying to do), you typically need to make a scraper that runs on a server, and not a browser.
To be complete, a (probably shady) way to scrape another website is to:
Have your server-side code fetch another website's conents
Use AJAX to pass the contents to a client's browser
Have the client do all of the processing
Optionally send the scraped information back to your server
Here is a good article on scraping with nodeJS.
if you need it just to work on your computer, you can make a userscript that will do this easily. If you want it to work as part of a hosted website, you need a server side solution

how to load javascript with html fragments into an existing web app

I'm developing a small app designed to embed HTML + JavaScript (JavaScript manages the behavior of HTML) into existing websites. My small app is an ASP.Net MVC 3 app. What's the best approach for delivering JavaScript to the web client? I do not have access to the web clients except for giving them the URL to retrieve the HTML/JavaScript. The web clients will be retrieving the HTML/JavaScript using jQuery. jQuery will then load the results of the ASP.Net MVC 3 app into the DOM. Should the JavaScript that's needed to manage the behavior of the embedded HTML simply be a at the end of the HTML fragment? Thanks for your time.
If the loading mechanism is in place, and simply inserts the payload of the HTTP request into the DOM somewhere, then including a <script> as the last tag in the payload is probably the best way to go.
Any DOM elements the script depends on should be ready for use when it is executed, and there isn't anything wrong with that technique that I know of.
You could get more sophisticated, but not without complicating your jQuery loading mechanism.

PyQt4: Trigger click on a javascript link

I am trying to scrape some web pages in Python were the information is generated by Javascript.
I managed to retrieve the information generated on page load by using a headless browser with PyQt4 (example here: http://blog.motane.lu/2009/07/07/downloading-a-pages-content-with-python-and-webkit/)
But now, I'm trying to retrieve some information that is generated by having the user click on a Javascript link.
How can I do that?
Thanks
I guess you need Form Extractor Example. The trick is that you can expose any python object to javascript and call its methods. Pytonic version of this example could be found in PyQt distribution.

Categories

Resources