I am currently using selenium with PhantomJS to scrape javascript generated content from a web page. While this does get me the results I am looking for, it is a slow approach as I need to wait for the page to load before scraping. Is there a way to directly run the javascript that generates the content I am looking for? If there is, will it be a faster approach than I am currently using?
Thanks!
Unfortunately, there is not. I've encountered this problem several times and the only solution I've come up with is to approach the problem the way you are already doing.
Being the content js-generated, the only way to fetch it is to get it from a browser, therefore using selenium with whichever driver you prefer.
Related
So there is a problem with JavaScript and requests (in Python) and that is, it does not use JavaScript when requesting a webpage.
The website I'm working with (https://access.paylocity.com/) requires JavaScript and without it, it changes the content of the page to just a text at the top saying, "Please enable JavaScript to view the page content."
(I could be wrong here but) I think one solution is the use of Selenium, but that would replace requests which I'm fine with as long as there are no other ways of fixing/bypassing this JavaScript detection.
(For those wondering, this python project of mine is supposed to automatically fetch the events on the Paylocity calendar, then port those events to another calendar that I frequently use everyday. It's also just intended for myself.)
Edit: Here is the code I have if that will help https://pastecode.io/s/GXTUO1BgtR (I didn't know where to paste my code, so I decided on that website. If I should change it, please comment or say something about it.)
Since the website you're working with is dynamically loading the JS as far I can tell, I think you have no other choice as to making use of Selenium. I had a project on my own a couple weeks ago and run into a similar problem which I could also solve using Selenium. But, I'm no expert, I'm just giving away my thoughts on this.
I have been using requests and BeautifulSoup for python to scrape html from basic websites, but most modern websites don't just deliver html as a result. I believe they run javascript or something (I'm not very familiar, sort of a noob here). I was wondering if anyone knows how to, say , search for a flight on google flights and scrape the top result aka the cheapest price??
If this were simple html, I could just parse the html tree and find the text result, but this does not appear when you view the "page source". If you inspect the element in your browser, you can see the price inside hmtl tags as if you were looking at the regular page source of a basic website.
What is going on here that the inspect element has the html but the page source doesn't? And does anyone know how to scrape this kind of data?
Thanks so much!
You're spot on -- the page markup is getting added with javascript after the initial server response. I haven't used BeautifulSoup, but from its documentation, it looks like it doesn't execute javascript, so you're out of luck on that front.
You might try Selenium, which is basically a virtual browser -- people use it for front-end testing. It executes javascript, so it might be able to give you what you want.
But if you're specifically looking for Google Flights information, there's an API for that :) https://developers.google.com/qpx-express/v1/
You might consider using Scrapy, which will allow you to scrape a page, along with a lot of other spider functionality. Scrapy has a great integration with Splash, which is a library you can use to execute the javascript in a page. Splash can be used stand-alone, or you can get the Scrapy-Splash.
Note that Splash essentially runs it's own server to do the javascript execution, so it's something that would run alongside your main script and would be called. Scrapy manages this via 'middleware', or a set processes that run on every request: in your case you would fetch the page, run the Javascript in Splash, and then parse the results.
This may be a slightly lighter-weight option than plugging into Selenium or the like, especially if all you're trying to do is render the page rather than render it and then interact with various parts in an automated fashion.
Alright, so I'm in a small pickle. I'm running into issues with JSoup since the page needs Javascript to to finish loading some of the page. Fortunately, I've worked around it in the past (parsed the raw javascript code), and it's very tedious. As of late, I tried to make a program to login to a website but it requires a token from a element. That form element is not visible unless JavaScript is executed, so it wont show up at all for me to even extract. So I decided to look into Selenium.
First question, is this the library I should be looking into? The reason why I'm so bent on using HttpClient is because some of these websites are very high in traffic and doesn't load up all the way BUT I don't need these pages to load up all the way. I just need it to load up enough to where I can retrieve the login token. I prefer to communicate with the webserver with raw JSON/POST methods once I discover the the methods required vs. having Selenium automate a click/wait/type sequence.
Basically, I only need selenium to load up 1/4 of the page, just to retrieve request tokens. The rest of my program will send POST methods using HttpClient.
Or should I just let selenium do all the work? My goal is speed. I need to login, purchase an item fast.
Edit: Actually, I might go with HtmlUnit because it's very minimal. I only need to scrape information, and I don't want to run Selenium's StandAlone Server. Is this the better approach?
Basically, HtmlUnit is quicker than Selenium so if you are going for speed you should use that. Anyway, keep in mind that Selenium has its own implementation of HtmlUnitDriver. So, as another option, you could use Selenium with HtmlUnit. The difference between them is that HtmlUnit is a browser itself without GUI, meanwhile Selenium works calling browsers feature. You may want to take a look at this other question for further details: Selenium vs HtmlUnit?
So I'm using python and beautifulsoup4(which i'm not tied to) to scrape a website. Problem is when I use urlib to grab the html of a page it's not the entire page because some of it is generated via the javascript. Is there any way to get around this?
There are basically two main options to proceed with:
using browser developer tools, see what ajax requests are going to load the page and simulate them in your script, you will probably need to use json module to load the response json string into python data structure
use tools like selenium that open up a real browser. The browser can also be "headless", see Headless Selenium Testing with Python and PhantomJS
The first option is more difficult to implement and it's, generally speaking, more fragile, but it doesn't require a real browser and can be faster.
The second option is better in terms of you get what any other real user gets and you wouldn't be worried about how the page was loaded. Selenium is pretty powerful in locating elements on a page - you may not need BeautifulSoup at all. But, anyway, this option is slower than the first one.
Hope that helps.
many webpages use onload JavaScript to manipulate their DOM. Is there a way I can automate accessing the state of the HTML after these JavaScript operations?
A took like wget is not useful here because it just downloads the original source.
Is there perhaps a way to use a web browser rendering engine?
Ideally I am after a solution that I can interface with from Python.
thanks!
The only good way I know to do such things is to automate a browser, for example via Selenium RC. If you have no idea of how to deduce that the page has finished running the relevant javascript, then, just a real live user visiting that page, you'll just have to wait a while, grab a snapshot, wait some more, grab another, and check there was no change between them to convince yourself that it's really finished.
Please see related info at stackoverflow:
screen-scraping
Screen Scraping from a web page with a lot of Javascript