Trouble in scraping from a page

Trouble in scraping from a page - javascript

Refering to the one of my previous question, I have to scrape reviews(all reviews) of a hotel, for example this hotel
With using BeautifulSoap, what I have done that I first get all the review pages links from pagination within the div having class BVRRPager BVRRPageBasedPager, and then scrape reviews from all pages.
Problem with BeautifulSoap is that the content in div.BVRRRatingSummary does not come along(try loaing that page with JS disabled)
I have scraped the reviews using Selinium but my client does not want to use Selinium because it loads full page with JS and images
I want to know that what kind of process they might be using to load review? And is there any way I can scrape the content in div.BVRRRatingSummary with BeautifulSoap?

You could try using firefox with the firebug addon. Open up firebug when loading the webpage and go to Net and then click on XHR. That will show you which json files are being loaded. You can then try to get those files directly and work with those using a library like simplejson.

Related

get source code of whole website - which loads additional content after scrolling down

I want to fetch this site
https://www.film-fish.com/modern-mindless-action
to fetch the IMDB IDs of all movies listed there.
The problem is that the page loads all movies listed there just after scrolling down. So, a simple wget doesn't work.
Even if I scroll to the bottom of the page and view the source code, I do not see the last movie in the list (Hard Kill (2020)).
So the problem seems to be that the content is being created via JavaScript.
Has anybody a tip on how to achieve that?

So the problem seems to be that the content is being created via a js
script. Has anybody a tip on how to achieve that?
Indeed, executing JavaScript code is beyond scope of GNU Wget. You would need browser automation tool. If you know some Node.js or JavaScript I suggest taking look at PhantomJS Quick Start, Page Automation. Please take look at first example in 2nd link, you should be probably able to rework to your needs, i.e. instruct page to scroll down using JavaScript then extract what you need using JavaScript.

Facebook like and share button slowing website

I have setup a dynamic competition page where the query string determines what content you see.
For example (http://nectarfinance.com.au/dc=korinadrogan will show Korina's content, while no query string will show generic head office content).
The site (as is) is loading slowly, and I know it is happening because of the Facebook 'like and share' dynamic Facebook scripts on the page.
I was wondering if there is anyway to minify these script into one? Or if there is anyone to increase the load time of these scripts? or reduce the size of these scripts?
I'm not sure how to work around it as the files are externally hosted by Facebook.
I'll post the GTMetrix report in the answer below, as I can't post two links.
Thanks for your help

Trace URL's that link to my iframe/widget

I have created a widget.html page with, for example, a "Powered by Example.com" box/widget. And I have an HTML iframe that links to that specific page (widget.html) on my site.
<iframe src="http://example.com/widget.html"></iframe>
I share that iframe code with website owners who want to use my widget on their sites.
I want to be able to see every single site that uses my iframe. I would prefer a code that creates a txt file or even a MySQL Table with all websites URLs that use my widget on their websites.
I basically want to track the sites that use my widget as an iframe . How do I do that? With Javascript? PHP? MySQL?
P.S. I'm not sure if an iframe is the best way to link widgets off my site, but I'm open for your suggestions. Thanks in advance.

use jquery
then load a request page throgh jquery like : $("#div1").load("demo_test.txt");
and send a request uri parameter to it
you will find the current url using the widget and alos you can get the parameter

Capturing a part of web page for mobile devices

I have an android app where I want to show a page to users inside the webview but the problem I am facing is that I can't use the web page as it is because the page is not responsive to mobile devices and user needs to scroll horizontally and vertically a lot. The web page is:
http://www.ielts.org/test_centre_search/search_results.aspx
I just need the drop down search functionality from that page. I tried copying the html source code on my local to replicate the page but the since the html form's action has to be http://www.ielts.org/test_centre_search/search_results.aspx for fetching the results, when I select an option on my local version, it goes to the http://www.ielts.org/test_centre_search/search_results.aspx url and displays their version of page next time.
I came across this page:
http://www.ieltsessentials.com/test_centre_search.aspx
which is implementing the same functionality. How can I replicate the same and add it inside local .html document

i think the easiest way to implement this will be to inject your own css style into their html, and hide/restyle the elements that are not responsive. that way you don't have to analyze any of the logic that they have, as it will be safely on css level.
the only thing you have to figure out is how to re-inject your css into the web view after the page is reloaded. there's actually a way to do that by simply injecting a javascript call into their page like here https://stackoverflow.com/a/5010864/467198
to detect that page is reloaded i think you can use onPageFinished

you could use asp to proxy the page you want to canibalize and then in jQuery you could traverse that proxy'ed page and pull out the pieces you want to use and then create your new, responsive doc from items scraped from the original page.
i'm not an asp.net developer so i've used php in my example. here's a link to an example of how asp.net could be used to proxy a page
Simplest Possible ASP .NET AJAX Proxy Page
<?php echo file_get_contents( $_GET['u'] );
then in jQuery use $.ajax() to read the proxy'ed page as HTML and scrape the page as needed
<script>
$(function(){
$.ajax({
url:'proxy.php?u=http://www.ielts.org/test_centre_search/search_results.aspx',
dataType:'html',
success:function(data){
console.log($('#header',data));
}
})
});
</script>
in this example i'm just reading the contents of the #head but you could scrape whatever you need from the original page and then inserting them into your target dom or pass them to a template. to get what you're looking for you'd use '#Template_TestCentreSearch1_SearchTable' where i use '#head' to retrieve your drop down markup

Scraping dynamic page content phantomjs

My company is using a website that hosts all of our FAQ and customer questions. We have plans to go through and wipe out all of the old data and input new and the service does not have a backup, or archive option for questions we don't want to appear anymore.
I've gone through and tried to scape the site using perl and mechanize, but I'm missing the customer comments on the page as they are loaded through ajax. I have looked at phantomjs and can get the pages to save to an image using an example page, however, I'd like to get an full page html dump of the page, but can't figure out how. I used this example code on our site
var page = new WebPage();
page.open('http://espn.go.com/nfl/', function (status) {
//once page loaded, include jQuery from cdn
page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
//once jQuery loaded, run some code
//inserts our custom text into the page
page.evaluate(function(){$("h2").html('Many NFL Players Scared that Chad Moon Will Enter League');});
//take screenshot and exit
page.render('espn.png');
phantom.exit();
});
});
Is there a way using phantomjs that I can just get a full page dump of the data, similar to if I did a view source in chrome? I can do this with perl + mechanize, but don't see how to do this using phantomjs.

You can use page.content to get the full HTML DOM

I would recommend pjscrape http://nrabinowitz.github.com/pjscrape/ if you want to scrape using PhantomJS

Develop Reference

JavaScript is the programming language of the Web.

Trouble in scraping from a page - javascript

You could try using firefox with the firebug addon. Open up firebug when loading the webpage and go to Net and then click on XHR. That will show you which json files are being loaded. You can then try to get those files directly and work with those using a library like simplejson.

Related

get source code of whole website - which loads additional content after scrolling down

Facebook like and share button slowing website

Trace URL's that link to my iframe/widget

Capturing a part of web page for mobile devices

Scraping dynamic page content phantomjs

Categories

Resources