I'm trying to use Nokogiri to parse this ASCAP website to retrieve some song/artist information. Here's an example of what I'd want to query
https://mobile.ascap.com/aceclient/AceClient/#ace/writer/1628840/JAY%20Z
I can't seem to access the DOM properly because the source seems to be hidden behind some kind of JavaScript. I'm pretty new to web scraping so it has been pretty difficult trying to find a way to do this. I tried using Charles to see if data was being drawn from another site, and have been using XHelper to generate accurate XPath queries.
This returns nil, where it should return "1, 2 YA'LL"
page = Nokogiri::HTML(open('https://mobile.ascap.com/aceclient/AceClient/#ace/writer/1628840/JAY%20Z'))
puts page.xpath('/html/body/div[#id="desktopSearch"]/div[#id='ace']/div[#id="aceMain"]/div[#id="aceResults"]/ul[#id="ace_list"]/li[#class="nav"][1]/div[#class="workTitle"]').text
Step #1 when spidering/scraping, is to turn off the JavaScript in your browser, then look at a page. What you see at that point is what Nokogiri sees. If the data you want is visible, then odds are really good you can get at it with a parser.
At that point, do NOT rely on a browser's XPath or CSS selector list seen when you inspect an element to show you the path to the node(s) you want. Browsers do a lot of fix-ups when displaying a page, and the source view usually reflects those, including displaying data retrieved dynamically. In other words, the browser is lying to you about what it originally retrieved from a page. To work around that, use wget, curl or nokogiri http://some_URL at the command-line to retrieve the original page, then locate the node you want.
If you don't see the node you want, then you're going to need to use other tools, such as something from the Watir suite, which lets you drive a browser which understands JavaScript. A browser can retrieve a page, interpret the JavaScript, and retrieve any dynamic page content. Then you should be able to get at the markup and pass it to Nokogiri.
Used the google inspector tools to log the XMLHTTPRequests and was easily able to figure out where the data was actually being loaded from. Thanks to #NickVeys!
Related
I am a beginner in in Python3.6 using BeautifulSoup to perform "web-scraping."
Once I have ran a request.get() and prettyify the output I notice that the webpage does not return the values, it would seem to be storing code which would be related to the value.
Here is the link to the webpage in specific:
http://www.tennisabstract.com/cgi-bin/wplayer.cgi?p=AngeliqueKerber&f=r1
I am trying to extract the hand which the player uses in Tennis. Highlighted Yellow from picture below:
Picture of what I am trying to obtain:
I would appreciate feedback concerning the outline of the question, if it is confusing (or non-standard) feedback such as this will help me in the future to ensure I am asking questions appropriately.
There are two options (mostly).
The first one is easier and slower - browser emulation. You just try to use the site as a normal user - with browser. There is a python module for this task - selenium. It uses specific webdriver to use browser. There are plenty of webdrivers available (for example chromedriver to use chrome). Also, there are headless solutions (PhantomJS for example).
The other way is smarter and faster - XMLHttpRequests (XHRs). Basically - site uses some hidden API to get info via JS, and you try to find out how exactly. In most cases you can use Inspect Element toolbox of your browser. Switch to the network tab of it, clear it an try to get results. Then sort it to see only XHRs. It usually returns JSON-based values that are easily converted into a python dictionary using json() method of Response object.
Here's a really great GitHub that someone made on this website, an API practically you can change/edit few things (fork it) and then use it the way you want to.
HERE
It uses Selenium webdriver but it's high quality.
I have been using requests and BeautifulSoup for python to scrape html from basic websites, but most modern websites don't just deliver html as a result. I believe they run javascript or something (I'm not very familiar, sort of a noob here). I was wondering if anyone knows how to, say , search for a flight on google flights and scrape the top result aka the cheapest price??
If this were simple html, I could just parse the html tree and find the text result, but this does not appear when you view the "page source". If you inspect the element in your browser, you can see the price inside hmtl tags as if you were looking at the regular page source of a basic website.
What is going on here that the inspect element has the html but the page source doesn't? And does anyone know how to scrape this kind of data?
Thanks so much!
You're spot on -- the page markup is getting added with javascript after the initial server response. I haven't used BeautifulSoup, but from its documentation, it looks like it doesn't execute javascript, so you're out of luck on that front.
You might try Selenium, which is basically a virtual browser -- people use it for front-end testing. It executes javascript, so it might be able to give you what you want.
But if you're specifically looking for Google Flights information, there's an API for that :) https://developers.google.com/qpx-express/v1/
You might consider using Scrapy, which will allow you to scrape a page, along with a lot of other spider functionality. Scrapy has a great integration with Splash, which is a library you can use to execute the javascript in a page. Splash can be used stand-alone, or you can get the Scrapy-Splash.
Note that Splash essentially runs it's own server to do the javascript execution, so it's something that would run alongside your main script and would be called. Scrapy manages this via 'middleware', or a set processes that run on every request: in your case you would fetch the page, run the Javascript in Splash, and then parse the results.
This may be a slightly lighter-weight option than plugging into Selenium or the like, especially if all you're trying to do is render the page rather than render it and then interact with various parts in an automated fashion.
So I'm using python and beautifulsoup4(which i'm not tied to) to scrape a website. Problem is when I use urlib to grab the html of a page it's not the entire page because some of it is generated via the javascript. Is there any way to get around this?
There are basically two main options to proceed with:
using browser developer tools, see what ajax requests are going to load the page and simulate them in your script, you will probably need to use json module to load the response json string into python data structure
use tools like selenium that open up a real browser. The browser can also be "headless", see Headless Selenium Testing with Python and PhantomJS
The first option is more difficult to implement and it's, generally speaking, more fragile, but it doesn't require a real browser and can be faster.
The second option is better in terms of you get what any other real user gets and you wouldn't be worried about how the page was loaded. Selenium is pretty powerful in locating elements on a page - you may not need BeautifulSoup at all. But, anyway, this option is slower than the first one.
Hope that helps.
I am stuck with quite a tricky issue for a couple of days now. I have a auto-generated HTML page in a variable as string in node.js. I need to find the height of some HTML elements and do HTML manipulations (like tag creation, deletion, append, css attribute setting etc).
Obviously I need to make a DOM like structure of my HTML page first and then proceed.
While for the HTML manipulations I have many options like cheerio, node.io, jsdom etc but none of these allow me to find the height of the element at the node.
So after wasting quite a lot of time on it, I have decided to look for heavier solutions, something like implementing a headless browser (phantomjs etc.) at the node and drive an elements offsetHeight through plain javascript.
Can anyone tell me if it is possible to reach my objective like this? What headless browser will be best suited for this task?
If i am going in the wrong direction, then can anyone suggest me any other working solution?
At this point I am ready to try anything.
Thnx in advance!!
Note: Using javascript at the client side has many problems in my particular case because the contents of the generated HTML page are supposed to be used by the client to paste in his website. Leaving a running javascript that re-structures the HTML will make it difficult at his end.
Node's server-side HTML libraries (like cheerio and jsdom) are strictly DOM API emulation libraries. They do not attempt to actually render a document, which is necessary to compute element size and position.
If you really need to calculate the size of an element on the server, you need a headless browser like PhantomJS. It is a full WebKit renderer with a JavaScript API. It is entirely separate from node, so you either need to write a utility script using Phantom's API, or use a npm module that lets you control Phantom from node.
After reading the comments under your question, it is pretty clear that you should not be calculating heights on the server. Client-side code is the proper place to do it.
I'm having some trouble figuring out how to make the "page load" architecture of a website.
The basic idea is, that I would use XSLT to present it but instead of doing it the classic way with the XSL tags I would do it with JavaScript. Each link should therefore refer to a JavaScript function that would change the content and menus of the page.
The reason why I want to do it this way, is having the option of letting JavaScript dynamically show each page using the data provided in the first, initial XML file instead of making a "complete" server request for the specific page, which simply has too many downsides.
The basic problem of that is, that after having searched the web for a solution to access the "underlying" XML of the document with JavaScript, I only find solutions to access external XML files.
I could of course just "print" all the XML data into a JavaScript array fully declared in the document header, but I believe this would be a very, very nasty solution. And ugly, for that matter.
My questions therefore are:
Is it even possible to do what I'm
thinking of?
Would it be SEO-friendly to have all
the website pages' content loaded
initially in the XML file?
My alternative would be to dynamically load the specific page's content using AJAX on demand. However, I find it difficult to find a way that would be the least SEO-friendly. I can't imagine that a search engine would execute any JavaScript.
I'm very sorry if this is unclear, but it's really freaking me out.
Thanks in advance.
Is it even possible to do what I'm thinking of?
Sure.
Would it be SEO-friendly to have all the website pages' content loaded initially in the XML file?
No, it would be total insanity.
I can't imagine that a search engine would execute any JavaScript.
Well, quite. It's also pretty bad for accessibility: non-JS browsers, or browsers with a slight difference in JS implementation (eg new reserved words) that causes your script to have an error and boom! no page. And unless you provide proper navigation through hash links, usability will be terrible too.
All-JavaScript in-page content creation can be useful for raw web applications (infamously, GMail), but for a content-driven site it would be largely disastrous. You'd essentially have to build up the same pages from the client side for JS browsers and the server side for all other agents, at which point you've lost the advantage of doing it all on the client.
Probably better to do it like SO: primarily HTML-based, but with client-side progressive enhancement to do useful tasks like checking the server for updates and printing the “this question has new answers” announce.
maybe the following scenario works for you:
a browser requests your xml file.
once loaded, the xslt associated with the xml file is executed. result: your initial html is outputted together with a script tag.
in the javascript, an ajax call to the current location is made to get the "underlying" xml-dom. from then on, your javascript manages all the xml-processing.
you made sure that in step 3, the xml is not loaded from the server again but is taken from the browser cache.
that's it.