How do I scrape data generated with javascript using BeautifulSoup? - javascript

I'm trying to migrate some comments from a blog using web scraping with python and BeautifulSoup. The content I'm looking for isn't in the HTML itself and seems to have been generated in a script tag (which I can't find). I've seen some answers regarding this but most of them are specific to a certain problem and I can't seem to figure out how to apply it to my site. I'm just trying to scrape comments from pages like this one:
http://www.themasterpiececards.com/famous-paintings-reviewed/bid/92327/famous-paintings-duccio-s-maesta
I've also tried Selenium, but I'm using a Cloud9-based IDE currently and it doesn't seem to support web drivers.
I apologize if I botched any of the lingo, I'm pretty new to programming. If anyone has any tips, that would be helpful. Thanks!

You have many ways to scrap such content. One would be to find out how comments are loaded on this website. On quick lookup in chromium developer tools, comments for the page mentioned are loaded via this api call.
This may not be a suitable way for you as you may not generate this url for every different page.
Another more reliable way would be to render such js content using GUIless browser, for ease of implementation i would suggest using scrapy with splash .Splash is a python framework which renders most of the content for your requests.

Related

is it possible to do some simple web scraping in chrome extension?

thanks in advance and Im sorry if this might not be a well formed question, I am relatively new to CS and stackoverflow.
I am hoping to make a simple chrome extension, which overrides the new tab page to display some simple data collected from a couple websites. I am wondering if it is possible to web scrape within the basic JS or chrome API's? Any information or guidance would be greatly appreciated, I have been trying to do research on the subject and haven't found any recent or clear answers to this.
Thanks for your help!
Here is an older stackoverflow question asking the same question but I wasn't able to make any progress from the answers.
Web Scraping in a Google Chrome Extension (JavaScript + Chrome APIs)
Absolutely and not just simple scraping.
If you think about it using the browser itself is as close as possible to replicating a real user session. You don't have to care about manually setting cookies, discover and construct json http requests. The browser does it all for you. After a page has been rendered (with or without javascript) you can access the DOM and extract any content you like.
Take a look at https://github.com/get-set-fetch/extension, an open source browser extension that does more than just basic scraping. It supports infinite scrolling, clicking and extracting content from single page javascript apps.
Disclaimer: I'm the extension author.
If you're serious about the subject start by developing a simple chrome extension (from my own experience Chrome throws more verbose extension errors than Firefox): https://developer.chrome.com/extensions/getstarted
Take a look afterwards at the main get-set-fetch background plugins: FetchPlugin (loads an url in a tab and waits for the DOM to stabilize), ExtractUrlPLugin (identifies additional urls to scrape from the current url), ExtractHtmlContentPlugin (actual scraping based on CSS selectors).
There are downsides though. It’s a lot easier to run a scraping script in your favorite language dumping the scraped content into a database than automatically starting the browser, loading the extension, controlling the extension, exporting scraped data to a format like csv, importing that data into a database.
In my opinion, it only makes sense to use a browser extension if you don’t want to automate the data extraction or the page you’re trying to scrape is so javascript heavy it’s easier to automate the extension than writing a scraping script.
I am wondering if it is possible to web scrape within the basic JS or chrome API's?
Yes, use fetch to call REST APIs.

use of google script editor

Hey so currently working on my first personal project so bear with the questions!
Currently trying to create a Javascript program that will parse info from google forms to produce slides displaying the info. So far from my research the best way I've found to facilitate this process is googles app script editor. However, I was wondering if I can run this code by requesting it from a different javascript (or maybe even java) program that I will write code on webstorm. If I cant do this what is the best way to utilize the google apps script editor?
Thanks!
Google Apps Script is just javascript with extra built-in APIs (like SpreadsheetApp, FormApp, etc.).
It also has a UrlFetchApp API.
So you can run code like this:
// The code below logs the HTML code of the Google home page.
var response = UrlFetchApp.fetch("http://www.google.com/");
Logger.log(response.getContentText());
As such, if you want to provide JavaScript from elsewhere, you could fetch it and then eval it on the Google Apps Script side. (but we all know how tricky eval can get)
One other option is to have your own server side written using Google App Engine (or any other framework) and use Google's OAuth and authorize your app to fetch data from the Forms form
Slides and Google Apps Script
You might like to take a look at the addon "Slides Merge" by Bruce McPherson. I've never used it but it sounds like it might work for you. Here's what it's looks like in the addon store:
Getting information from Google Forms is a snap with google apps script since your can link the form right up to a spreadsheet. The Google Apps Script documentation is really quite good these days. Here's the documentation link. Google Apps Script is loosely based on Javascript 1.6. If your already a programmer my guess is that you'll have few problems learning to use it. In my experience the most difficult thing was dealing with the arrays of arrays produced by the getValues() method of ranges in google apps script and I made a short video that might be of some help to you.
I also have a script that I wrote in Google Apps Script that produces a sheet show that is a slide show inside of a spreadsheet.
I've found that using the Script Editor is pretty easy. There's some documentation in the support section of the documentation. It can be a bit buggy at times but overall I think it's a pretty good tool.

How Can I Scrape Data From Websites Don't Return Simple HTML

I have been using requests and BeautifulSoup for python to scrape html from basic websites, but most modern websites don't just deliver html as a result. I believe they run javascript or something (I'm not very familiar, sort of a noob here). I was wondering if anyone knows how to, say , search for a flight on google flights and scrape the top result aka the cheapest price??
If this were simple html, I could just parse the html tree and find the text result, but this does not appear when you view the "page source". If you inspect the element in your browser, you can see the price inside hmtl tags as if you were looking at the regular page source of a basic website.
What is going on here that the inspect element has the html but the page source doesn't? And does anyone know how to scrape this kind of data?
Thanks so much!
You're spot on -- the page markup is getting added with javascript after the initial server response. I haven't used BeautifulSoup, but from its documentation, it looks like it doesn't execute javascript, so you're out of luck on that front.
You might try Selenium, which is basically a virtual browser -- people use it for front-end testing. It executes javascript, so it might be able to give you what you want.
But if you're specifically looking for Google Flights information, there's an API for that :) https://developers.google.com/qpx-express/v1/
You might consider using Scrapy, which will allow you to scrape a page, along with a lot of other spider functionality. Scrapy has a great integration with Splash, which is a library you can use to execute the javascript in a page. Splash can be used stand-alone, or you can get the Scrapy-Splash.
Note that Splash essentially runs it's own server to do the javascript execution, so it's something that would run alongside your main script and would be called. Scrapy manages this via 'middleware', or a set processes that run on every request: in your case you would fetch the page, run the Javascript in Splash, and then parse the results.
This may be a slightly lighter-weight option than plugging into Selenium or the like, especially if all you're trying to do is render the page rather than render it and then interact with various parts in an automated fashion.

Scraping and External Resources

I've just started to learn about scraping and I just had a quick question.
Scraping images and files through the DOM is no problem but I was curious if it was possible to scrape external resources linked to a document such as web fonts(sorry couldn't think of another example off the top of my head). Things like this are used within the page but not linked through typical means.
If anyone could tell me if such things are possible? I only know Ruby and a bit of JS. Also if you can give me other examples of resources like web fonts that aren't linked normally that would be cool to.
Thanks.

can search engines crawl pure javascript apps?

There's a lot of movement in UI toward pure javascript front-ends like backbone.js or javascript mvc. I know google has some guidelines for adding #hash tags to your urls to make them crawlable...but I'm curious if they can still crawl apps that don't follow this guideline.
I'm debating whether to use a template engine on server side or just use pure javascript solution with json requests to an api. I want people to find pages on my site when searching.
Yes, they can, if you describe for them how to do it. Detailed answer for google is here
You may use hashtags/ html snapshots of every dynamic state.
There is many other ways to make pure js site crawled. You only need to chose one, which better fits your needs.
Distal Templates makes a static webpage, which is indexable by Google, into a dynamic page using JSON and templates.

Categories

Resources