Scraping dynamic page content phantomjs

Scraping dynamic page content phantomjs - javascript

My company is using a website that hosts all of our FAQ and customer questions. We have plans to go through and wipe out all of the old data and input new and the service does not have a backup, or archive option for questions we don't want to appear anymore.
I've gone through and tried to scape the site using perl and mechanize, but I'm missing the customer comments on the page as they are loaded through ajax. I have looked at phantomjs and can get the pages to save to an image using an example page, however, I'd like to get an full page html dump of the page, but can't figure out how. I used this example code on our site
var page = new WebPage();
page.open('http://espn.go.com/nfl/', function (status) {
//once page loaded, include jQuery from cdn
page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
//once jQuery loaded, run some code
//inserts our custom text into the page
page.evaluate(function(){$("h2").html('Many NFL Players Scared that Chad Moon Will Enter League');});
//take screenshot and exit
page.render('espn.png');
phantom.exit();
});
});
Is there a way using phantomjs that I can just get a full page dump of the data, similar to if I did a view source in chrome? I can do this with perl + mechanize, but don't see how to do this using phantomjs.

You can use page.content to get the full HTML DOM

I would recommend pjscrape http://nrabinowitz.github.com/pjscrape/ if you want to scrape using PhantomJS

Related

djangobyexample book - jquery bookmarklet not working at all

I'm working through Django By Example and in one chapter a jQuery bookmarklet is built within a Django app so that a user can easily save jpg images from a website into their user profile area within the Django app.
The tutorial does give exact instructions on what to do which I have followed and although I have managed to get the bookmarklet button to appear in my bookmarks bar in Chrome, nothing happens when I click it when browsing a webpage with jpg images.
This is my local Django dashboard where the bookmarklet button is added to the bookmarks bar and this part works fine
and this is how it must look like when clicked on the bookmarklet, this is the part where nothing happens for me when I clicked on bookmarklet.
(how to solve this?)
These are the relevant js files
https://github.com/davejonesbkk/bookmarks/blob/master/images/templates/bookmarklet_launcher.js
https://github.com/davejonesbkk/bookmarks/blob/master/images/static/js/bookmarklet.js
I believe the JavaScript launcher is unable to load the JavaScript files or the JS launcher itself is not getting loaded.
The JavaScript launcher is getting called through a Django template tag "include" inside the anchor tag.
this is the link:
https://github.com/davejonesbkk/bookmarks/blob/master/account/templates/account/dashboard.html
I tried debugging it through CTRL+SHIFT+I console where trouble showed that "include" tag not working properly.

Your include tag is split over two lines:
images from other websites → <a href="javascript:{% include
"bookmarklet_launcher.js" %}" class="button">Bookmark it</a><p>
Django does not support multiple line tags. Change it to:
images from other websites → Bookmark it<p>

I have faced the similar error while going through the book.
The bookmark button is not functioning,when i debugged it through chrome debugger,i could able to see errors at js level.I have made two changes to resolve these errors.
1.Error message: net::ERR_ABORTED
Action step:
In the book its mentioned that to place bookmarklet.js in images application directory,but in bookmarklet_launcher.js the source is refered to below path
http://127.0.0.1:8000/static/js/bookmarklet.js?r=
So place bookmarklet.js in /static/js/ directory inside images application(if folder structure not available create it).
2.Error message: net::ERR_ABORTED
There is one more file that should be placed which is bookmarklet.css which is being refereed at below line in bookmarklet.js.
href: static_url + 'css/bookmarklet.css?r=' + Math.floor(Math.random()*99999999999999999999)
Action step:
create a file bookmarklet.css inside /static/css/ directory and place the css code.Refer below link for css code:
Git Hub link for css code reference
1.After the above steps,restart the development server.
2.Drag the bookmark it button to create a bookmark in browser.
3.Open any website that is HTTP(not https) and click on bookmark it(The one which is bookmarked in browser not bookmark it button).
4.The below pop up appears

the problem is that template doesn't exist so try t do the following
**
1. make sure your include code in same line
2. make sure same name of the template exist on your project director /images/templates/file_name.js.
3. or go to setting and add your templates directory to templates.
4. if it load but no images views don't forget that you only accept jpeg and jpg images only
** so you can try another sites like wikipedia **

I was able to solve this by making sure that the url from ngrok tunnel in the bookmarklet.js and bookmarklet_luncher.js starts with https not http
Instead of this http://127.0.0.1:8000/static/js/bookmarklet.js?r= it should rather be
https://127.0.0.1:8000/static/js/bookmarklet.js?r=

After 4 hours of doing everything... googling, deleting code, and rewriting code...
Only had to hit Ctrl+C to stop server and re-run server .
Just take a break and come back to fix it :)
Mine works same as instructed in the book – no changes, no nothing.
Only restarted the server.

Trace URL's that link to my iframe/widget

I have created a widget.html page with, for example, a "Powered by Example.com" box/widget. And I have an HTML iframe that links to that specific page (widget.html) on my site.
<iframe src="http://example.com/widget.html"></iframe>
I share that iframe code with website owners who want to use my widget on their sites.
I want to be able to see every single site that uses my iframe. I would prefer a code that creates a txt file or even a MySQL Table with all websites URLs that use my widget on their websites.
I basically want to track the sites that use my widget as an iframe . How do I do that? With Javascript? PHP? MySQL?
P.S. I'm not sure if an iframe is the best way to link widgets off my site, but I'm open for your suggestions. Thanks in advance.

use jquery
then load a request page throgh jquery like : $("#div1").load("demo_test.txt");
and send a request uri parameter to it
you will find the current url using the widget and alos you can get the parameter

Trouble in scraping from a page

Refering to the one of my previous question, I have to scrape reviews(all reviews) of a hotel, for example this hotel
With using BeautifulSoap, what I have done that I first get all the review pages links from pagination within the div having class BVRRPager BVRRPageBasedPager, and then scrape reviews from all pages.
Problem with BeautifulSoap is that the content in div.BVRRRatingSummary does not come along(try loaing that page with JS disabled)
I have scraped the reviews using Selinium but my client does not want to use Selinium because it loads full page with JS and images
I want to know that what kind of process they might be using to load review? And is there any way I can scrape the content in div.BVRRRatingSummary with BeautifulSoap?

You could try using firefox with the firebug addon. Open up firebug when loading the webpage and go to Net and then click on XHR. That will show you which json files are being loaded. You can then try to get those files directly and work with those using a library like simplejson.

Changing html content using javascript[Theory]

Suppose i open a website say stackoverflow.com now it contains a lot of html content,texts,spans etc .How will i be able to change the content of html automatically using script(jquery) if i know the classes and id of the elements of html.
say a webpage has these elements
<span class="a">Great</span>
Now i want to change it when i open the webpage to
<span class="a">Not Great</span>
Since i have no control on servers or resources of files i can change it from dev but that is manually . how to do it automatically ?

You want to create a Bookmarklet.
A bookmark can be more than just a URL -- they can also be snippets of injectable JavaScript.
Bookmarklets can be used to alter or modify any page that you're viewing on the web. They can add additional information about the page, aide the user in reading it, or perform some automated action in a web app.
javascript:(function(){
if (window.jQuery) alert("jQuery is present");
else alert("jQuery is NOT present");
})();
Just create a new bookmark, and paste the above as its URL. Click the bookmarklet to execute it on any page (to see if jQuery is present).
You can also use the bookmarklet code as an <a> tag's [href] attribute -- then the link can be right-clicked and bookmarked from there (the easiest way to get a user to install your bookmarklet).
My favorite bookmarklet is H5o, the HTML5 Document Outliner! Use it to check any web page's document outline structure, highly important for good SEO!
H5o Bookmarklet Installation Page
H5o on Google Code

Very simple. You can use:
$(document).ready(function(){
$(".a").text("Not Great");
});

Given your example...
jQuery(".a").text("Not Great");
Or, if you want to include tags:
jQuery(".a").html("<strong>Not Great</strong>");

Capturing a part of web page for mobile devices

I have an android app where I want to show a page to users inside the webview but the problem I am facing is that I can't use the web page as it is because the page is not responsive to mobile devices and user needs to scroll horizontally and vertically a lot. The web page is:
http://www.ielts.org/test_centre_search/search_results.aspx
I just need the drop down search functionality from that page. I tried copying the html source code on my local to replicate the page but the since the html form's action has to be http://www.ielts.org/test_centre_search/search_results.aspx for fetching the results, when I select an option on my local version, it goes to the http://www.ielts.org/test_centre_search/search_results.aspx url and displays their version of page next time.
I came across this page:
http://www.ieltsessentials.com/test_centre_search.aspx
which is implementing the same functionality. How can I replicate the same and add it inside local .html document

i think the easiest way to implement this will be to inject your own css style into their html, and hide/restyle the elements that are not responsive. that way you don't have to analyze any of the logic that they have, as it will be safely on css level.
the only thing you have to figure out is how to re-inject your css into the web view after the page is reloaded. there's actually a way to do that by simply injecting a javascript call into their page like here https://stackoverflow.com/a/5010864/467198
to detect that page is reloaded i think you can use onPageFinished

you could use asp to proxy the page you want to canibalize and then in jQuery you could traverse that proxy'ed page and pull out the pieces you want to use and then create your new, responsive doc from items scraped from the original page.
i'm not an asp.net developer so i've used php in my example. here's a link to an example of how asp.net could be used to proxy a page
Simplest Possible ASP .NET AJAX Proxy Page
<?php echo file_get_contents( $_GET['u'] );
then in jQuery use $.ajax() to read the proxy'ed page as HTML and scrape the page as needed
<script>
$(function(){
$.ajax({
url:'proxy.php?u=http://www.ielts.org/test_centre_search/search_results.aspx',
dataType:'html',
success:function(data){
console.log($('#header',data));
}
})
});
</script>
in this example i'm just reading the contents of the #head but you could scrape whatever you need from the original page and then inserting them into your target dom or pass them to a template. to get what you're looking for you'd use '#Template_TestCentreSearch1_SearchTable' where i use '#head' to retrieve your drop down markup

Develop Reference

JavaScript is the programming language of the Web.

Scraping dynamic page content phantomjs - javascript

You can use page.content to get the full HTML DOM

I would recommend pjscrape http://nrabinowitz.github.com/pjscrape/ if you want to scrape using PhantomJS

Related

djangobyexample book - jquery bookmarklet not working at all

Trace URL's that link to my iframe/widget

Trouble in scraping from a page

Changing html content using javascript[Theory]

Capturing a part of web page for mobile devices

Categories

Resources