Problems scraping pages with JavaScript function in python

Problems scraping pages with JavaScript function in python - javascript

I'm stuck with a python problem, look.
I have to scrap a page that has JS functions, but that's not the real problem, the real problem is that the information I need is provided by the function.. So I need to run the function to make sure it'll build the HTML code that I need, and then work on it to get what I want..
Just to make sure i'm clear, the JS function build the HTML code, but when I scrap it, it doesn't get HTML buid, it just return the JS function...
I am using mechanize and beautifulSoup for the scraping process.. does anyone know what do I have to do to emulate the JS function to get the HTML code that I need?
Thanks in advance.

You need a scrapping framework that supports javascript. Selenium is one of them and I got good results using along with BeautifulSoup.
You may want to check PyVirtualDisplay if you are going to use Selenium with Mozilla Firefox.

Related

Webscraper in node.js, JS modifies DOM

I'm trying to write a webscraper, to get some sales leads. The problem is that in modern webdesign, most of websites uses some JavaScript to modify DOM (usually using React, Angular, or even just some jQuery). The problem is, that if I scrap some website by request node.js package, and pass html code to cheerio, then I'm simply not able to parse the code and get the info I want. Instead, all I can see are some React.js components ¯_ツ_/¯
Any resources on this topic will be helpful, thanks in advance.

Because the request package will not execute any of the javascript on the page. It will just download the html as is. If you want to see the actual page like a browser does, you would have to create a javascript parser that executes all javascript code in the state you want it to.
Luckily, there are some other options here:
You could take a look at the developer tools on the website you want to scrape and try to find the xhr requests that fetches the data you need. Then you can call this url directly.
You could use headless browser scraping like PhantomJS or CasperJS. These are packages that will try and modify the downloaded dom as good as possible with the included javascript resources.

Get the HTML source code of a web page on JavaScript

I'm building an application and I need to get the html code source of a web page in order to parse it (this web page is not on my server).
I'm coding in Javascript and I can't find a way to do it, I know there is a way to do it in Python (with requests library) and I want basicaly the same thing in Javascript.
Does someone know how to do this ?
Thanks

Try this
document.documentElement.outerHTML

Django - Load template with jQuery into variable

I'm working with a client that has a view that, after a user logs in, this view loads a template, that dynamically draws a canvas with jQuery, and generates an image copy of the canvas.
They want to protect the jQuery code, hiding the process in the python code.
I tried using PyExecJS, but it doesn't support jQuery, since there is no DOM.
I've also tried urllib2, mechanize and Selenium, but none worked.
Is there an alternative or have I missed something?
Update/Resolution: In case someone stumbles onto this question: I ended up using Selenium for Python to load the JS function, fed it the necessary data and extracted the image from it. It has a bit of an overhead, but since the main goal was to keep the JS code obfuscated, it worked.

If I understand correctly, you are trying to hide jquery code.
You can't hide jquery code from the user, because django processes python code before it serves up the template, there's no way to protect jquery code with python. Really the best thing you can do is to minimize and obfuscate the code, but that only makes it difficult for human reading.

How do I scrape something after JS has changed the DOM?

I'm using Mechanize, although I'm open to Nokogiri if Mechanize can't do it.
I'd like to scrape the page after all the scripts have loaded as opposed to beforehand.
How might I do this?

I think a good option is something like this with Nokogiri, Watir, and PhantomJs:
b = Watir::Browser.new(:phantomjs)
b.goto URL
doc = Nokogiri::HTML(b.html)
The resulting doc will be from when after the scripts have been loaded. And phantomjs is nice because there is no need to load a browser.

Nokogiri and Mechanize are not full web browsers and do not run JavaScript in a browser-model DOM. You want to use something like Watir or Selenium which allow you to use Ruby to control an actual web browser.

In addition to watir-webdriver and capybara-webkit, celerity is a good option although it is jruby only.

I don't know anything about mechanize or nokogiri so I can't comment specifically on them. However, the issue of getting JavaScript after it's been modified is one I believe can only be solved with more JavaScript. In order to get the newly generated HTML you would need to get the .innerHTML of the document element. This can be tricky since you would have to inject js into a page.
The only way I know of to accomplish this is to write a FireFox plugin. With a plugin you can run JavaScript on a page even though it's not your page. Sorry I'm not more help, I hope that this helps to put you on the right path.
If you're interested in plug-ins this is one place to start:http://anthonystechblog.wordpress.com/category/internet/firefox/

How can I debug my code using WebMatrix and Visual Studio?

I am currently writing some code in a question here. The code is employing the fullcalendar JQuery plugin. I am trying to adapt it to use C# and an SQL server as its feed. I have a .cshtml file, which reads from a database, creates a JSON object from the data and returns this to the JQuery plugin.
The kicker for me, is that I am a Java/PHP programmer, who has never worked using JavaScript, C# or Visual Studio before (and probably never will again!) so really I am fighting blind in trying to debug my code. I was hoping someone here could help me to help myself!
At the moment, I am using WebMatrix as an IDE and trying to execute my code after every change. What are better ways that I can test this code?
Edit : I am now using Firebug, it's great! Does anyone have any more tips?

You can do old school print statements using the javascript alerts, http://www.w3schools.com/jsref/met_win_alert.asp , or simply print stuff out as plaintext in the HTML.
Inspired by http://saucelabs.com/blog/index.php/2012/05/goodbye-couchdb/ , I am storing raw JSON to the DB.
WebMatrix has a nice quickref page, http://www.asp.net/web-pages/overview/more-resources/asp-net-web-pages-api-reference.
Mike Brind's book and webpage, http://www.mikesdotnetting.com/Article/160/WebMatrix-Working-With-The-JSON-Helper , are also great references.

Develop Reference

JavaScript is the programming language of the Web.

Problems scraping pages with JavaScript function in python - javascript

You need a scrapping framework that supports javascript. Selenium is one of them and I got good results using along with BeautifulSoup. You may want to check PyVirtualDisplay if you are going to use Selenium with Mozilla Firefox.

Related

Webscraper in node.js, JS modifies DOM

Get the HTML source code of a web page on JavaScript

Django - Load template with jQuery into variable

How do I scrape something after JS has changed the DOM?

How can I debug my code using WebMatrix and Visual Studio?

Categories

Resources