I'm trying to scrape a particular webpage which works as follows.
First the page loads, then it runs some sort of javascript to fetch the data it needs to populate the page. I'm interested in that data.
If I Get the page with HtmlAgilityPack - the script doesn't run so I get what it essentially a mostly-blank page.
Is there a way to force it to run a script, so I can get the data?
You are getting what the server is returning - the same as a web browser. A web browser, of course, then runs the scripts. Html Agility Pack is an HTML parser only - it has no way to interpret the javascript or bind it to its internal representation of the document. If you wanted to run the script you would need a web browser. The perfect answer to your problem would be a complete "headless" web browser. That is something that incorporates an HTML parser, a javascript interpreter, and a model that simulates the browser DOM, all working together. Basically, that's a web browser, except without the rendering part of it. At this time there isn't such a thing that works entirely within the .NET environment.
Your best bet is to use a WebBrowser control and actually load and run the page in Internet Explorer under programmatic control. This won't be fast or pretty, but it will do what you need to do.
Also see my answer to a similar question: Load a DOM and Execute javascript, server side, with .Net which discusses the available technology in .NET to do this. Most of the pieces exist right now but just aren't quite there yet or haven't been integrated in the right way, unfortunately.
You can use Awesomium for this, http://www.awesomium.com/. It works fairly well but has no support for x64 and is not thread safe. I'm using it to scan some web sites 24x7 and it's running fine for at least a couple of days in a row but then it usually crashes.
Related
This question might have been asked in the past but I cannot seem to find the answer. I would like to navigate between pages without avec to redirect to a brand new xxx.html file. Basically, I want to keep only one html file being the index.html
In order to understand what I mean, here is a small preview of this functionality I want to achieve.
Preview
As you can see, the piece of clothing is not its individual html file. What method is used to achieve this?
What you are seeing is called a Single-Page-Application. There are a lot of frameworks with which you can create a page like this. If you are going for plain HTML/CSS/JavaScript it will be a lot harder to do correctly.
What you're seeing here is a dynamic webpage that is taking advantage of client-side technology to create this effect. To help further explain, let's quickly go over some web development terminology:
Client-Side: Code that is executed on the user's computer (in this case in their web browser).
Server-Side: Code that is executed on a server, then a response of some sort is sent to the client.
With server-side code, the value cannot change unless a new call is made to the server to get a new response. This is because the code isn't actually running on the computer the user is running, it's running on some other computer probably thousands of miles away. However, with client-side code, dynamic changes could be made in real-time because the code is actually running on the user's computer.
When it comes to server-side code, we as developers have a myriad of options. Any language that can send an HTTP response to a web browser could theoretically be used as a server side language. In 2018, that's basically every major language in existence! That being said, some popular options today include Python, Ruby, Java, and Javascript (Node JS).
When it comes to client-side code, however, we're limited by what can run in a user's web browser. In general, modern web browsers only understand Javascript. However, while the language has gotten better over the years, writing code in pure JavaScript can sometimes be cumbersome, so there are libraries that help make writing Javascript easier (such as jQuery) and there are even languages that compile down to Javascript to add new syntax and functionality (such as Typescript and Coffeescript).
If you'd like to start writing dynamic web applications, a good place to start would be to learn the basics of JavaScript. Then, maybe start learning jQuery, or front-end libraries such as Angular or React. Good luck!
You will have to use javascript for this. Either you can load all content at once and just show/hide the content you need, or you could ajax to fetch the content and then render it without page reload.
So I'm using python and beautifulsoup4(which i'm not tied to) to scrape a website. Problem is when I use urlib to grab the html of a page it's not the entire page because some of it is generated via the javascript. Is there any way to get around this?
There are basically two main options to proceed with:
using browser developer tools, see what ajax requests are going to load the page and simulate them in your script, you will probably need to use json module to load the response json string into python data structure
use tools like selenium that open up a real browser. The browser can also be "headless", see Headless Selenium Testing with Python and PhantomJS
The first option is more difficult to implement and it's, generally speaking, more fragile, but it doesn't require a real browser and can be faster.
The second option is better in terms of you get what any other real user gets and you wouldn't be worried about how the page was loaded. Selenium is pretty powerful in locating elements on a page - you may not need BeautifulSoup at all. But, anyway, this option is slower than the first one.
Hope that helps.
I have a little work project I'm trying to solve. It involves automating data entry into a web-based database (ASP page). Part of the data entry requires clicking on a button to show a form. The button makes a call to WebForm_DoPostBackWithOptions(). I've been looking into trying to simulate the form POST using Javascript, but I am getting the impression it's a difficult thing to pull off, as I see that you need to supply the hidden VIEWSTATE data in the post, which just seems like a lot of work for little gain.
Anyway, I'm strictly limited to using IE 8, client-side scripting, and no external libs. There is no API or provision for automated access to the web database. The environment is totally Windows, and I do have .NET available. At this point, it seems the only viable option is to try to use a .NET WebBrowser object from javascript.
Are there any other ways of going about this?
Are you able to use xDomainRequests to post the data to the ASP page, and then handle it there? It's pretty much the same thing as ajax, except some versions of IE (8 included) don't support ajax.
This is part of a project I am working on for work.
I want to automate a Sharepoint site, specifically to pull data out of a database that I and my coworkers only have front-end access to.
I FINALLY managed to get mechanize (in python) to accomplish this using Python-NTLM, and by patching part of it's source code to fix a reoccurring error.
Now, I am at what I would hope is my final roadblock: Part of the form I need to submit seems to be output of a JavaScript function :| and lo and behold... Mechanize does not support javascript. I don't want to emulate the javascript functionality myself in python because I would ideally like a reusable solution...
So, does anyone know how I could evaluate the javascript on the local html I download from sharepoint? I just want to run the javascript somehow (to complete the loading of the page), but without a browser.
I have already looked into selenium, but it's pretty slow for the amount of work I need to get done... I am currently looking into PyV8 to try and evaluate the javascript myself... but surely there must be an app or library (or anything) that can do this??
Well, in the end I came down to the following possible solutions:
Run Chrome headless and collect the html output (thanks to koenp for the link!)
Run PhantomJS, a headless browser with a javascript api
Run HTMLUnit; same thing but for Java
Use Ghost.py, a python-based headless browser (that I haven't seen suggested anyyyywhere for some reason!)
Write a DOM-based javascript interpreter based on Pyv8 (Google v8 javascript engine) and add this to my current "half-solution" with mechanize.
For now, I have decided to use either use Ghost.py or my own modification of the PySide/PyQT Webkit (how ghost works) to evaluate the javascript, as apparently they can run quite fast if you optimize them to not download images and disable the GUI.
Hopefully others will find this list useful!
Well you will need something that both understands the DOM and understand Javascript, so that comes down to a headless browser of some sort. Maybe you can take a look at the selenium webdriver, but I guess you already did that. I don't hink there is an easy way of doing this without running the stuff in an actually browser engine.
I want to analyse some data of a webpage, but here's the problem: the site has more pages which gets called with a __doPostBack function.
How can I "simulate" to go a page further and analyse this site, and so on..
At this time I analyse the data with JSoup in java - but I'm open to use some other language if it's necessary.
A postback-based system (.NET, Prado/PHP, etc) works in a manner that it keeps a complete snapshot of the browser contents on the server side. This is called a pagestate. Any attempt to manipulate with a client that is not JavaScript-capable is almost sure to fail.
What you need is a JavaScript-capable browser. The easiest solution I found is to use the framework Firefox is written in - XUL - to create such a desktop application. What you do is basically create a desktop application with a single browser element in it, which you can then script from the application itself without restrictions of the security container. Alternatively, you could also use the Greasemonkey plugin to do your bidding. The latter is a bit easier to get started with, but it's fairly limited since it's running on a per-page basis.
With both solutions you then have access to the page's DOM to gather data and you can also fire events (like clicking on a button). Unfortunately you have to learn JavaScript for this to work.
I used an automation library which is Selenium, which you can use in a lot of languages (C#, Java, Perl,...)
For more information how to start this link is very helpful: this.
As well as Selenium, you can use http://watin.org/