How to scrape dynamic website - using python scrapy? - javascript

I can scrape static website using scrapy however, this other website that I'm trying to scrape has 2 sections in its HTML namely; "head" and "body onload". And the information that I need is in the body onload part. I believe that content is loaded after html is requested and thus the website is dynamic. Is this doable using scrapy? What additional tools do I need?

Check out scrapy_splash, it's a rendering service for scrapy, that will allow you to crawl javascript based web sites.
You can also create your own downloader middleware and use Selenium with PhantomJS (example). The downside of this technic is that you lose the concurrency provided by scrapy.
Anyway, I think splash is the best way to do this.
Hope this helps.

Related

Vanilla JavaScript version of Google's UrlFetchApp

I'm working on an html file in a google apps script project right now and I want it to be able to retrieve the html content of a web page, extract a snippet, and paste the snippet to the page. I've tried using fetch() and a couple of its options (mostly CORS), but I either get nothing back or an error that says "No 'Access-Control-Allow-Origin' header is present on the requested resource." A workaround I found was using google.script.run.withSuccessHandler() to return the html content via UrlFetchApp.fetch(url).getContentText(). The problem with this is that it is time consuming and I want to try and optimize it as much as possible. Does anyone have any suggestions on how to make this work or will I be forced to stick with my workaround?
Using "Vanilla JavaScript" in a Google Apps Script web application might not work to
retrieve the html content of a web page, extract a snippet, and paste the snippet to the page.
The above because the client-side code of Google Apps Script web applications is embedded in an iframe tag that can't be modified from server-side code. In other words, Google Apps Script is not universally suitable web development platform. One alternative is to use another platform to create your web application, i.e. GitHub Pages, Firebase, etc.
Related
Where is my iframe in the published web application/sidebar?

I need help creating a Windows 7 Gadget

I need to create a Windows 7 Gadget (or Widget) as a mini project. I know how to create a basic HelloWorld gadget (including the xml manifest and the html page), but I do not know how to make a complex one.
My company uses a bug tracking software (say, XYZ). My widget needs to be able to access and display data from XYZ regarding bugs, given a bug ID, or other search criteria.
I currently have the APPGUID and server name for XYZ.
Please help. I do not know where to start.
If your bug tracking software (XYZ) is a web application then you need to use its web service or you need to scrape the site to access the data regarding the bugs. You can simply scrape the site using the Simple HTML DOM.
Example can be seen in PHP Simple HTML DOM Scrape External URL
To download the library the link is http://sourceforge.net/projects/simplehtmldom/files/
Then you can scrape and display the data as the normal HTML code.
OR you have to use the web service provided by the XYZ application.

How to extract the dynamically generated HTML from a website

Is it possible to extract the HTML of a page as it shows in the HTML panel of Firebug or the Chrome DevTools?
I have to crawl a lot of websites but sometimes the information is not in the static source code, a JavaScript runs after the page is loaded and creates some new HTML content dynamically. If I then extract the source code, these contents are not there.
I have a web crawler built in Java to do this, but it's using a lot of old libraries. Therefore, I want to move to a Rails/Ruby solution for learning purposes. I already played a bit with Nokogiri and Mechanize.
If the crawler is able to execute JavaScript, you can simply get the dynamically created HTML structure using document.firstElementChild.outerHTML.
Nokogiri and Mechanize are currently not able to parse JavaScript. See
"Ruby Nokogiri Javascript Parsing" and "How do I use Mechanize to process JavaScript?" for this.
You will need another tool like WATIR or Selenium. Those drive a real web browser, and can thus handle any JavaScript.
You can't fetch the records coming from the database side. You can only fetch the HTML code which is static.
JavaScript must be requesting the records from the database using a query request which can't be fetch by the crawler.

Loading external Div using iFrame Alternative

I'm currently designing a eBay listing template for a client in which i have managed to load JavaScript using a loader (www.test4.solowebs.co.uk). The JavaScript works in eBay without giving me an error which is fantastic as i used a loader rather than script. What i want is - at the bottom of the listing i want an external div to load from another site, (my own site test4.solowebs.co.uk/featured.html) which has in it two offers. The result of this is that i can change the offers on that page and it will automatically load into all of my listings, a bit like a iFrame. (only problem is eBay does not allow iFrames - and i haven't been able to find a workaround.)
The Question - Is there a way i can load an iFrame using Javascript? Or an alternate method which will allow me to load a div from an external page? (as ebay hosts the HTML it will have to be cross-domain compatible!) I'm an expert in HTML CSS and OK at JavaScript so anything other than that will have to be detailed and instructions to implement given.
Thank you in advance!
Use AJAX, or technically speaking: XMLHttpRequest.
Here's an article on SO about an in-line cross-browser XMLHttpRequest script: Cross-Browser XMLHttpRequest
Use it to populate your div asynchronously. Hope this helps.

how to load javascript with html fragments into an existing web app

I'm developing a small app designed to embed HTML + JavaScript (JavaScript manages the behavior of HTML) into existing websites. My small app is an ASP.Net MVC 3 app. What's the best approach for delivering JavaScript to the web client? I do not have access to the web clients except for giving them the URL to retrieve the HTML/JavaScript. The web clients will be retrieving the HTML/JavaScript using jQuery. jQuery will then load the results of the ASP.Net MVC 3 app into the DOM. Should the JavaScript that's needed to manage the behavior of the embedded HTML simply be a at the end of the HTML fragment? Thanks for your time.
If the loading mechanism is in place, and simply inserts the payload of the HTTP request into the DOM somewhere, then including a <script> as the last tag in the payload is probably the best way to go.
Any DOM elements the script depends on should be ready for use when it is executed, and there isn't anything wrong with that technique that I know of.
You could get more sophisticated, but not without complicating your jQuery loading mechanism.

Categories

Resources