Scraping dynamically generated html inside Android app

Scraping dynamically generated html inside Android app - javascript

I am currently writing an Android app that, among other things, uses text information from websites which I do not own. In addition, some of the pages require authentification.
For some pages I have been able to log in and retrieve the html code using BasicNameValuePairs and an HTTPClient with its associated objects.
Unfortunately, these methods retrieve the webpage source without running any javascript functions that a browser (Android Webview even) would normally run. I need the text that some of these scripts are retrieving.
I've done my research, but everything I've found is guesswork & extremely confusing. I'm okay with ignoring pages that require login for now. Also, I am willing to post any code that may be useful for constructing a solution; It is an independent project.
Any concrete solutions for scraping the html result from javascript calls? An example would be absolutely top-notch.

Final Success:
Rhino. Used this jar file.
Other Things I Tried:
HttpClient provided by Android
Cannot run javascript
HtmlUnit
4 hours, no success. Also huge, added 12 mb to my apk.
SL4A
Finally compiled. Used THIS guide to set-up. Abandoned as overkill for a simple rhino jar.
Things That Might Work:
Selenium
Further results will be posted. Others results will be added if posted.
Note: many of the options listed above reference each other. I think rhino is included in both sl4a and htmlunit. Also, I think htmlunit contains selenium.

The aforementioned solutions are very slow and restrict you to 1 url (well, not really, but I dare you to scrape 10 urls with Rhino while your user is impatiently waiting for results).
An alternative is to use a cloud scraping solution. You get the benefit of not wasting phone bandwidth on downloading content you won't use.
Try this solution: Bobik Java SDK
It gives you the ability to scrape up to hundreds of sites in a matter of seconds

Related

Call JavaScript (3rd party library) from Python

I've already searched quite a bit but came to now clear conclusion as some projects (pyv8) seem to be dead and I'm not sure if that is suitable at all. The 3rd part lib requires a DOM, eg. a container element in which it runs. It also uses web assembly and in general is pretty heavy.
Not sure if libs like pyv8 would actually be suitable for that. Other approach would be to go with selenium and headless chrome or a local node.js service but both of these sound very heavy. Oh, and the lib must work in windows as that's simply company policy, windows servers so PyMiniRacer is out.
What are my other options?

Consider taking a look at this post: How do I call a Javascript function from Python?.
However, if your objective is to access JS code in a webpage for reasons such as webscraping, you could also consider using selenium webdriver + python to do so. Take a look at this medium.com post: How to Run JavaScript in Python | Web Scraping | Web Testing
Other Resources:
https://www.quora.com/How-do-we-use-JavaScript-with-Python
Python to JS: https://pypi.org/project/javascripthon/
P.S: I am not sure if this would help you. There is another library (PyExecJS) which is maintained no longer; but I think you have looked it up already.

How to download and query html pages where JS processing is necessary?

I often compile informal datasets by running some kind of XPath/XQuery on publicly available web pages. Usually the structure of the HTML is regular enough that useful information can be extracted easily.
But today I've come across tunefind.com. This website makes extensive use of the REACTJS framework, and so most of the structure of the page is configured client-side by Javascript. The pages, when initially downloaded, are very basic and missing a lot of information. The pages are populated by a script that uses a hopelessly messy blob of JSON data at the bottom of the page.
The only way I can think of to deal with this would be to use some kind of GUI-based web engine and just not display the GUI part. But that is a preposterous amount of work for these casual little CLI tools that I use to gather information.
Is there any way to perform the javascript preprocessing without dealing with unnecessary graphics?

Even if you were to process without the graphics the react javascript will be geared towards running in a browser context, at the very least it will expect a functioning DOM to exist, the application itself may also require clicks / transitions to happen before you can see some data.
Your best bet then is to load the page in a browser, to keep this simple, there are plenty of good browser automation frameworks designed for this.
I've used a fair few libraries over the years including phantomJS and recently I've gotten the most mileage out of nightmarejs.
It runs an electron browser for you and gives you a useful promisified javascript API to control it with, that has common browser functions such as clicking, following links etc.
You can configure it to hide the browser which is useful for making a CLI tool, however its a bit of a pseudo-headless mode and will still require a windowing/graphical context (e.g. x window).
Hope this helps.
PS - If you're at all used to docker it's not hard to make this just a running container!

Creating a printable/downloadable PDF of a web application

I have been searching for an answer to this problem now for several weeks. I also previously tried to research this a few years ago to no avail.
Problem Summary:
My company has developed a web-based data analytics suite for a major beverage distributor. They have recently asked for a feature that allows the user to print or download a visually pleasing version of the rendered app as a PDF. I have had no luck in finding a solid, controllable, or reliable method to do this. I was hoping the stack community might be able to point me in the right direction.
Current Tech Stack:
Plack servers
Perl base on the Dancer framework
Standard web dev front-ends: HTML5, CSS3, Javascript, Jquery/UI
Client is using IE9/10 and Chrome.
Attempted Solutions Summary:
Obviously I started with the window.print() and tried to control what printed using classes and a specialized print.css but the output was still awful.
I looked in to pdfmachine and pdfbox and even contacted Adobe's acrobat development team directly to see if they had an out of the box solution our company could purchase. I was informed that such a product would be counter intuitive to their desired business model of putting an acrobat subscription on each client computer rather than a single server side application.
I have extensively searched the stack articles but did not feel that the articles I found covered what I was looking for.
At present, I am all out of ideas and am hoping somebody out there has had better luck at this than I have.
tl;dr = I need a pdf version of the rendered output of a complex reporting app.
Thanks for your time stack, I appreciate it.

A solution I have used in the past is to use PhantomJS running on a server to generate the PDF for download/email. Usually if the content is sensitive the server (that handles authentication) would provide a single use viewing token that is then passed to a PhantomJS process. It loads the URL with the viewing token then saves as a PDF.
Further info on Phantoms screen cap API can be found here on GitHub.
https://github.com/ariya/phantomjs/wiki/Screen-Capture

Is it something you can create in Perl using PDF::API2 or PDF::Create? You can load and modify and existing PDF (handy if you want standard headers and footers), and then insert the relevant content. The learning curve can be a bit steep, but simple reports should be easy enough.
See PDF::TextBlock and PDF::Table too - they are great little helpers.

Consider this service http://pdfmyurl.com/ . I try to use many perl modules, but they dont satisfy my problems.

How does disqus work?

Does anyone know how disqus works?
It manages comments on a blog, but the comments are all held on third-party site. Seems like a neat use of cross-site communication.

The general pattern used is JSONP
Its actually implemented in a fairly sophisticated way (at least on the jQuery site) ... they defer the loading of the disqus.js and thread.js files until the user scrolls to the comment section.
The thread.js file contains json content for the comments, which are rendered into the page after its loaded.

You have three options when adding Disqus commenting to a site:
Use one of the many integrated solutions (WordPress, Blogger, Tumblr, etc. are supported)
Use the universal JavaScript code
Write your own code to communicate with the Disqus API
The main advantage of the integrated solutions is that they're easy to set up. In the case of WordPress, for example, it's as easy as activating a plug-in.
Having the ability to communicate with the API directly is very useful, and offers two advantages over the other options. First, it gives you as the developer complete control over the markup. Secondly, you're able to process comments server-side, which may be preferable.

Looks like that using easyXDM library, which uses the best available way for current browser to communicate with other site.

Quoting Anton Kovalyov's (former engineer at Disqus) answer to the same question on a different site that was really helpful to me:
Disqus is a third-party JavaScript application that runs in your browser and injects itself on publishers' websites. These publishers need to install a small snippet of JavaScript code that makes the first request to our servers and loads initial JavaScript loader. This loader then creates all necessary iframe elements, gets the data from our servers, renders templates and injects the result into some element on the page.
As you can probably guess there are quite a few different technologies supporting what seems like a simple operation. On the back-end you have to run and scale a gigantic web application that serves millions of requests (mostly read). We use Python, Django, PostgreSQL and Redis (for our realtime service).
On the front-end you have to minimize your payload, make sure your app is super fast and that it doesn't break in extremely hostile environments (you will be surprised how screwed up publisher websites can be). Cross-domain communication—ability to send messages from hosting website to your servers—can be tricky as well.
Unfortunately, it is impossible to explain how everything works in a comment on Quora, or even in an article. So if you're interested in the back-end side of Disqus just learn how to write, run and operate highly-scalable websites and you'll be golden. And if you're interested in the front-end side, Ben Vinegar and myself (both front-end engineers at Disqus) wrote a book on the topic called Third-party JavaScript (http://thirdpartyjs.com/).
I'm planning to read the book he mentioned, I guess it will be quite helpful.
Here's also a link to the official answer to this question on the Disqus site.

short answer? AJAX, you get your own url eg "site.com/?comments=ID" included via javascript... but with real time updates like that you would need a polling server.

I think they keep the content on their site and your site will only send & receive the data to/from disqus. Now I wonder what happens if you decide that you want to bring your commenting in house without losing all existing comments!. How easy would you get to your data I wonder? They claim that the data belongs to you, but they have the control over it, and there is not much explanation on their site about this.

I'm always leaving comment in disqus platform. Sometimes, comment seems to be removed once you refreshed it and sometimes it's not. I think the one that was removed are held for moderation without saying it.

JavaScript extractor: extract functions/objects that are really used in a web page from a library

This is a twin question to the following:
JavaScript stripper: remove functions/objects that are not used in a web page
To maximize my chance of getting my problem solved, I'm asking the question in opposite manner:
All of my web pages use a JavaScript library, to improve the performance of my web pages, I'd only include only the needed functions/objects from the library for each page. I'm looking for a tool that can do the intelligent extraction automatically.
Thanks for your help,
Yu

Are you sure this is a real problem?
The reason I ask is because it should not be a problem to include the same, full JavaScript library on every page. In fact, serving different versions of the library to each page will actually slow down your site.
The reason is that JavaScript is cached by the browser. If each page requests the same library, they will not have to actually download the library from your site after the first time.
The key is to make sure your library is sent with an HTTP Expires header that tells the browser to cache the response.

You are doing it wrong. Separate versions of a javascript library for each page is a bad idea since the library won't be cached but fetched separatly for each page. You're better off minifying, concatinating and GZIP your scripts and serve the exact same script file for all pages.
However, if you need to know what lines are actually run, you can probably find out using JSCoverage.

Develop Reference

JavaScript is the programming language of the Web.