Im working in a script that get an screenshot of a website every day. I already did it for other sites and it worked correctly but for the first time i have the next problem... my phantomjs script capture almost all the data in the website, but not all (in fact it doesn't print the most important for my case).
Until now i was using this simple script adapted:
var page = require('webpage').create();
page.open('http://www.website.com', function() {
setTimeout(function() {
page.render('render.png');
phantom.exit();
}, 200);
});
But when i run the same script for this site its losing some data. Take the screenshot but miss the prices...
Screenshot of the site with phantomjs
After exploring a bit i saw that if i make a DOM capture (for example using a PHP Simple HTML DOM parser) i can get most of the data but not the prices.
$html = file_get_html('https://www.falabella.com.ar/falabella-ar/category/cat10178/TV-LED-y-Smart-TV');
$prods = $html->find('div[class=fb-pod-group__item]');
foreach ($prods as $prod) {
// For example i can get the title
$title = $prod->find('h4[class=fb-responsive-hdng-5 fb-pod__product-title]',0)->plaintext;
// But not the price
$price = $prod->find('h4[class=fb-price]',0)->plaintext;
}
Exploring the console log i found the javascript objects where these values are. If i return the object fbra_browseProductListConfig.state.searchItemList.resultList[0].prices[0].originalPrice;
i see the price of the first product and so on and so on...:
Console log of the site
also i can get it with a phantomjs script like this:
var page = require("webpage").create();
page.open("https://www.falabella.com.ar/falabella-ar/category/cat10122/Cafeteras-express", function(status) {
var price = page.evaluate(function() {
return fbra_browseProductListConfig.state.searchItemList.resultList[0].prices[0].originalPrice;
});
console.log("The price is " + price);
phantom.exit();
});
In other posts (like this) i read about changing the timeout intervals but its not working for me (i tried all the scripts shared in the quoted post). The problem is not that the website doesn't fully charge. But it seems that this data (the prices) is not printed in the DOM. I even downloaded the full site from the terminal with wget command and the prices are not there o_O.
Edited
When i execute the script i get the next errors:
./phantomjs fala.js
ReferenceError: Can't find variable: Set
https://www.falabella.com.ar/static/assets/scripts/react/vendor.js?vid=111111111:22
https://www.falabella.com.ar/static/assets/scripts/react/vendor.js?vid=111111111:1 in t
https://www.falabella.com.ar/static/assets/scripts/react/vendor.js?vid=111111111:22
https://www.falabella.com.ar/static/assets/scripts/react/vendor.js?vid=111111111:1 in t
https://www.falabella.com.ar/static/assets/scripts/react/vendor.js?vid=111111111:22
https://www.falabella.com.ar/static/assets/scripts/react/vendor.js?vid=111111111:1 in t
https://www.falabella.com.ar/static/assets/scripts/react/vendor.js?vid=111111111:1
TypeError: undefined is not an object (evaluating 't.componentDomId')
https://www.falabella.com.ar/static/assets/scripts/react/productListApp.js?vid=111111111:3
https://www.falabella.com.ar/static/assets/scripts/react/vendor.js?vid=111111111:22
Maybe the problem is there because the script "productListApp.js" executes the prices?
Related
I'm building a google chrome extension that looks at a webpage, does some calculations based on features of the page, and then loads an iFrame to display the results. Currently, I am working on trying to create a more accessible version for visually-impared users. I have an option in my options page to allow users to click if they want to use the visually accessible option, and then that information is stored as a boolean in browser storage. The issue is, I have to check for that boolean in storage every single time I load the iFrame (which is every time the page switches or refreshes), and it adds roughly 500ms of latency to the iFrame load.
I have tried using both chrome.storage.sync, and localStorage (from background, with message passing to my content script) to see if the synchronous version would be a bit faster, but they both add roughly 500ms to the running of my content script. Right now I have two different html files, the standard one and the visually accessible one, and the content script chooses which to load based on retrieving the accessibility boolean from storage. If there is a faster way to just programmatically switch the css that the standard html file loads, I could do that as well. The thing is, any way I figure it, I just can't seem to think of a way to avoid having to retrieve the boolean from storage every single time the iFrame loads.
I suppose I'm wondering if there is some other way around this, like if I could somehow direct the extension to just automatically use a certain version of the html based on which option the user selects when they install the extension. Any suggestions would be greatly appreciated.
Here is the function in question (from content script):
function insertFrame(){
var extensionOrigin = 'chrome-extension://' + chrome.runtime.id;
if (!location.ancestorOrigins.contains(extensionOrigin)) {
chrome.runtime.sendMessage({contentScriptQuery: "accessible?"}, function(response){
var accessible = response;
if(accessible === "true"){
//load the accessible frame
var iframe = document.createElement('iframe');
iframe.id = "myFrame";
iframe.src = chrome.runtime.getURL('accessible.html');
document.body.appendChild(iframe);
}else{
//load the regular frame
var iframe = document.createElement('iframe');
iframe.id = "myFrame";
iframe.src = chrome.runtime.getURL('popup.html');
document.body.appendChild(iframe);
}
console.log("Time to run content script:", Date.now() - timer);
});
}
//for testing purposes
// var extensionOrigin = 'chrome-extension://' + chrome.runtime.id;
// if (!location.ancestorOrigins.contains(extensionOrigin)) {
// var iframe = document.createElement('iframe');
// iframe.id = "myFrame";
// iframe.src = chrome.runtime.getURL('popup.html');
// document.body.appendChild(iframe);
// }
// console.log("Time to run content script:", Date.now() - startTime);
}
And in my background page:
chrome.runtime.onMessage.addListener(function (request, sender, sendResponse){
if(request.contentScriptQuery === 'accessible?'){
var a = localStorage.getItem('accessible');
sendResponse(a);
}
return true; //with or without this line, timing is the same
});
I've been testing the content script just by commenting out each half (with and without reading from storage). You can see the line where I am logging how many milliseconds have elapsed since the content script started running. I have also verified this latency by testing load times in the network panel of dev tools. I get an average of 6.33s for load time without reading storage, and 6.72s with reading storage, which confirms the timing discrepancy I am logging in my content script. The only thing I am changing between tests are commenting out the half of the function so that I can test the other half.
I am preparing JavaScript code that shows a random number for user as follows: if the user spend more than two minutes to pass to the next web page or if the actual page has the GET parameter "&source", the random number is replaced by another one. otherwise, the same random number is displayed for all the web pages.
The problem is that the JavaScript code should be executed manually from browser console on each page load: I should prepare a code that can be integrated to any web page from console.
Is there any difference from the normal case (include script with<script></script>)
Thanks for posting! In future posts, please try to provide some code or an example of something you've tried previously.
Anyways, here is a brief example of a script that will check for an existing number, check to see if there is a &source parameter set, begin the timer if there isn't one, and generate a new number if the timer finishes or the parameter is set.
To save the information between pages, you should consider using window.localStorage. This will allow you to check for and save the number to be used on later loads.
Note that this snippet isn't going to work until you bring it into your own page. Also, as #Sorin-Vladu mentioned, you'll have to use a browser extension if you don't have access to modify the pages you're running the script on.
const timeout = 120000
// This can be replaced by your manual execution
window.onload = () => {
start()
}
function start() {
// Attempt to pull the code from storage
let code = localStorage.getItem('code')
console.log(code)
// Get the URL parameters
let urlParams = new URLSearchParams(window.location.search)
// Check to see if the source parameter exists
if (!urlParams.has('source')) {
// If not, begin the timer
setTimeout(() => {
setCode()
}, timeout)
} else {
setCode()
}
}
function setCode() {
const code = Math.floor(Math.random() * 1000000)
localStorage.setItem('code', code)
console.log(code)
}
I am scraping web pages and sometimes the age does not load correctly and the error occurs
IndexError: list index out of range
This is because with the page not loading correctly it does not have the index. reloading the page solves this.
Is there away to add in error handling so if page is not loaded and the error occurs... reload the page?
I have searched the internet and cannot find anything
for link in links:
#print('Fetching from link: ' + link)
browser.get('http://www.racingpost.com' + link)
time.sleep(5)
print('http://www.racingpost.com' + link)
tree = html.fromstring(browser.page_source)
#print(browser.page_source)
if count == 0:
browser.find_element_by_xpath("//*[#id='re_']/div[2]/a[1]").click()
browser.find_element_by_xpath("//*[#id='re_']/div[2]/a[2]").click()
count = count + 1
#first of all pull all the data about the event its self like going distance ect
title = tree.xpath('//*[#id="mainwrapper"]/div/div/div[2]/div[1]/div[2]/h3/text()[2]')
title = map(lambda x:x.strip(),title)
title = [x.strip(' ') for x in title]
details = tree.xpath('//*[#id="mainwrapper"]/div/div/div[2]/div[1]/div[2]/ul/li[1]/text()[1]')
prizemoney = tree.xpath('//*[#id="mainwrapper"]/div/div/div[2]/div[1]/div[2]/ul/li[2]/text()[1]')
setoff = tree.xpath('//*[#id="mainwrapper"]/div/div/div[2]/div[1]/div[2]/h3/span/text()')
course = tree.xpath('//*[#id="mainwrapper"]/div/div/div[2]/div[1]/div[2]/h1/text()[1]')
print(course)
course[0] = course[0].replace('Result', '')
date = tree.xpath('//*[#id="mainwrapper"]/div/div/div[2]/div[1]/div[2]/h1/text()[2]')
timeoff = tree.xpath('//div[#class="raceInfo"]/text()[3]')
above is a code snippit -> if borwser.get does not grab page (server rejects or timeout) then id need to retry.
You can add a try/exception error which will return some kind of variable to tell the page that it has not loaded properly then you can use javascript function location.reload() to reload the page.
For example:
In your Python Script:
try:
'Your Code Goes Here'
except IndexError:
return 'e'
And in your Javascript:
if(xml.responsetext=='e'){
location.reload(true);//true if you dont want to load from cache,otherwise leave it blank
}
Report me if there is any error , as I am also trying to use python script with AJAX in my website and I am quite new to it.I would have loved to comment instead of answering it but my reputation doesnt allow me to do so.
Cheers
I think you need to make a little refactoring.
It should be something like this:
def get_page(link):
# all code stuff for fetching page
# this code could return ether error code or throw Exception
return data
for link in links:
try:
result = get_page(link)
# here you need to add this result
except IndexError:
#log this error
result = get_page(link) #this is retry. you can add slip() here too
This is quick and dirty example, you can improve it with better retries logging, counting retries for each link globally and so on.
This has been giving me a lot of trouble and so far every stackoverflow question/answer I've found and every other bit of googling hasn't helped me all that much.
I have two HTML files. One is called cs, and the other is called csi. They both are linked to a common js file where I'm trying to implement the variable that will persist across both of them.
The variable is defined in the cs html by user input, and is then brought up in csi.
Here's what the Javascript looks like. I have it to run onlyCS on the cs html on body load, and onlyCSI on the csi html on body load.
The colors persist, and the variable MyApp.str is established in cs, but when it loads to csi, MyApp.str becomes "undefined"
I figured I would've avoided this because I established MyApp.str = strChar, which is itself established as a bit of user input that's only available in cs.
var MyApp = {}; // Globally scoped object
function onlyCS(){
MyApp.color = 'green';
setInterval(strdefine, 3000)
}
function onlyCSI(){
MyApp.color = 'red';
setInterval(bar, 3000)
}
function strdefine(){
alert(MyApp.color); // Alerts 'green'
strChar = parseInt($('#Xdemo').text(), 10);
$('#result').text(strChar);
MyApp.str = strChar;
alert('the myapp global obj (str) is currently ' + MyApp.str);
}
function bar(){
alert(MyApp.color); // Should alert 'red'
alert('the myapp global obj (str) is currently ' + MyApp.str);
}
If anyone could help me out, I'd really appreciate it.
EDIT: The comments help me figure out that using localstorage and variables is a good solution for my problem.
within strdefine I put
strChar = parseInt($('#Xdemo').text(), 10);
localStorage.setItem('str', strChar);
and within bar I put
alert('LS "str" is currently ' + localStorage.str);
var ex = localStorage.getItem('str') || 0;
$('#result').text(ex);
You have choices of:
Persist/read in a cookie
Persist/read in local storage
Persist/read in session storage
send to server then get it back from the server (likely in some session variable)
Just putting a variable in a JavaScript object does NOT persist past a page refresh. You also can use AJAX to read some HTML portion then apply that to your page in some container (not a full page, just some partial like a <div>my new</div> etc. Stored in some server side file
I'm trying to download the HTML of a website that is almost entirely generated by JavaScript. So, I need to simulate browser access and have been playing around with PhantomJS. Problem is, the site uses hashbang URLs and I can't seem to get PhantomJS to process the hashbang -- it just keeps calling up the homepage.
The site is http://www.regulations.gov. The default takes you to #!home. I've tried using the following code (from here) to try and process different hashbangs.
if (phantom.state.length === 0) {
if (phantom.args.length === 0) {
console.log('Usage: loadreg_1.js <some hash>');
phantom.exit();
}
var address = 'http://www.regulations.gov/';
console.log(address);
phantom.state = Date.now().toString();
phantom.open(address);
} else {
var hash = phantom.args[0];
document.location = hash;
console.log(document.location.hash);
var elapsed = Date.now() - new Date().setTime(phantom.state);
if (phantom.loadStatus === 'success') {
if (!first_time) {
var first_time = true;
if (!document.addEventListener) {
console.log('Not SUPPORTED!');
}
phantom.render('result.png');
var markup = document.documentElement.innerHTML;
console.log(markup);
phantom.exit();
}
} else {
console.log('FAIL to load the address');
phantom.exit();
}
}
This code produces the correct hashbang (for instance, I can set the hash to '#!contactus') but it doesn't dynamically generate any different HTML--just the default page. It does, however, correctly output that has when I call document.location.hash.
I've also tried to set the initial address to the hashbang, but then the script just hangs and doesn't do anything. For example, if I set the url to http://www.regulations.gov/#!searchResults;rpp=10;po=0 the script just hangs after printing the address to the terminal and nothing ever happens.
The issue here is that the content of the page loads asynchronously, but you're expecting it to be available as soon as the page is loaded.
In order to scrape a page that loads content asynchronously, you need to wait to scrape until the content you're interested in has been loaded. Depending on the page, there might be different ways of checking, but the easiest is just to check at regular intervals for something you expect to see, until you find it.
The trick here is figuring out what to look for - you need something that won't be present on the page until your desired content has been loaded. In this case, the easiest option I found for top-level pages is to manually input the H1 tags you expect to see on each page, keying them to the hash:
var titleMap = {
'#!contactUs': 'Contact Us',
'#!aboutUs': 'About Us'
// etc for the other pages
};
Then in your success block, you can set a recurring timeout to look for the title you want in an h1 tag. When it shows up, you know you can render the page:
if (phantom.loadStatus === 'success') {
// set a recurring timeout for 300 milliseconds
var timeoutId = window.setInterval(function () {
// check for title element you expect to see
var h1s = document.querySelectorAll('h1');
if (h1s) {
// h1s is a node list, not an array, hence the
// weird syntax here
Array.prototype.forEach.call(h1s, function(h1) {
if (h1.textContent.trim() === titleMap[hash]) {
// we found it!
console.log('Found H1: ' + h1.textContent.trim());
phantom.render('result.png');
console.log("Rendered image.");
// stop the cycle
window.clearInterval(timeoutId);
phantom.exit();
}
});
console.log('Found H1 tags, but not ' + titleMap[hash]);
}
console.log('No H1 tags found.');
}, 300);
}
The above code works for me. But it won't work if you need to scrape search results - you'll need to figure out an identifying element or bit of text that you can look for without having to know the title ahead of time.
Edit: Also, it looks like the newest version of PhantomJS now triggers an onResourceReceived event when it gets new data. I haven't looked into this, but you might be able to bind a listener to this event to achieve the same effect.