How to parse a web use javascript to load .html by Python? - javascript

I'm using Python to parse an auction site.
If I use browser to open this site, it will go to a loading page, then jump to the search result page automatically.
If I use urllib2 to open the webpage, the read() method only return the loading page.
Is there any python package could wait until all contents are loaded then read() method return all results?
Thanks.

How does the search page work? If it loads anything using Ajax, you could do some basic reverse engineering and find the URLs involved using Firebug's Net panel or Wireshark and then use urllib2 to load those.
If it's more complicated than that, you could simulate the actions JS performs manually without loading and interpreting JavaScript. It all depends on how the search page works.
Lastly, I know there are ways to run scripting on pages without a browser, since that's what some functional testing suites do, but my guess is that this could be the most complicated approach.

After tracing for the auction web source code, I found that it uses .php to create loading page and redirect to result page. Reverse engineering to find the ture URLs is not working because it's the same URL as loading page.
And #Manoj Govindan, I've tried Mechanize, but even if I add
br.set_handle_refresh(True)
br.set_handle_redirect(True)
it still read the loading page.
After hours of searching on www, I found a possible solution : using pywin32
import win32com.client
import time
url = 'http://search.ruten.com.tw/search/s000.php?searchfrom=headbar&k=halo+reach'
ie = win32com.client.Dispatch("InternetExplorer.Application")
ie.Visible = 0
ie.Navigate(url)
while 1:
state = ie.ReadyState
if state == 4:
break
time.sleep(1)
print ie.Document.body.innerHTML
However this only works on win32 platform, I'm looking for a cross platform solutoin.
If anyone know how to deal this, please tell me.

Related

Windmill in Python hanging when using getPageText()

I'm trying to write a simple script with Windmill to open a page (which has javascript) and then download the entire html. My code is:
from windmill.authoring import setup_module, WindmillTestClient
from windmill.conf import global_settings
import sys
global_settings.START_FIREFOX = True
setup_module(sys.modules[__name__])
def my_func():
url = "a certain url"
client = WindmillTestClient(__name__)
client.open(url=cur_url)
html = client.commands.getPageText()
This last line, with getPageText() just seems to hang. Nothing happens and it never returns.
Also, is it necessary for windmill to open up the whole GUI every time? And if it is, is there a function in python to close it when I'm done (a link to any actual documentation would be helpful; all I've found are a few examples)?
Edit: solved the problem by just using Selenium instead, took about 15 minutes vs 3 hours of trying to make Windmill work.
A colleague of mine came up with an alternate solution, which was to actually watch the network traffic coming into the browser and scrape the GET requests. Not totally sure how he did it though.

Scrape sub elements in tables which cannot found in html but only in Chrome>F12>Element

I have tried to scrape scoring/event time and also player name http://en.gooooal.com/soccer/analysis/8401/events_840182.html.However cannot work.
require(RCurl);
require(XML);
lnk = "http://en.gooooal.com/soccer/analysis/8401/events_840182.html";
doc = htmlTreeParse(lnk,useInternalNodes=TRUE);
x = unlist(xpathApply(doc, "//table/tr/td"));
normal html page doesn't show the details of the table contents.
the nodes only can get from
>>> open Chrome >>> click F12 >>> click Element
Can someone help? Thanks a lot.
If you reload the page while Chrome developer tools are active, you can see that real data is fetched via XHR from http://en.gooooal.com/soccer/analysis/8401/goal_840182.js?GmFEjC8MND. This URL contains event id 840182 which you can scrape from the page. The part after ? seems to be just a way to circumvent browser caching. 8401, again, seems to be just first digits of the id.
So, you can load the original page, construct the second URL, and get real data from there.
Anyway... In most cases it's a morally questionalble practice to scrape data from web sites. I hope you know what you're doing :)
It sounds as if the content was inserted asynchronously using javascript, so using Curl won't help you there.
You'll need a headless browser which can actually parse and execute javascript (If you know ruby you could start looking for the cucumber-selenium-chromedriver combo), or maybe just use your browser with greasemonkey/tampermonkey to actually mimic a real user browsing the score scraping.
The contents are probably generated (by Javascript, like from an ajax call) after loading the (HTML) page. You can check that by loading the page in Chrome after disabling Javascript.
I don't think you can instruct RCurl to execute Javascript...

javascript failing with permission denied error message

I have a classic ASP web page that used to work... but the network guys have made a lot of changes including moving the app to winodws 2008 server running iis 7.5. We also upgraded to IE 9.
I'm getting a Permission denied error message when I try to click on the following link:
<a href=javascript:window.parent.ElementContent('SearchCriteria','OBJECT=321402.EV806','cmboSearchType','D',false)>
But other links like the following one work just fine:
<a href="javascript:ElementContent('SearchCriteria','OBJECT=321402.EV806', 'cmboSearchType','D',false)">
The difference is that the link that is failing is in an iframe. I noticed on other posts, it makes a difference whether or not the iframe content is coming from another domain.
In my case, it's not. But I am getting data from another server by doing the following...
set objhttp = Server.CreateObject("winhttp.winhttprequest.5.1")
objhttp.open "get", strURL
objhttp.send
and then i change the actual html that i get back ... add some hyperlinks etc. Then i save it to a file on my local server. (saved as *.html files)
Then when my page is loading, i look for the specific html file and load it into the iframe.
I know some group policy options in IE have changed... and i'm looking into those changes. but the fact that one javascript link works makes me wonder whether the problem lies somewhere else...???
any suggestions would be appreciated.
thanks.
You could try with Msxml2.ServerXMLHTTP instead of WinHttp.WinHttpRequest.
See differences between Msxml2.ServerXMLHTTP and WinHttp.WinHttpRequest? for the difference between Msxml2.ServerXMLHTTP.
On this exellent site about ASP you get plenty of codesamples on how to use Msxml2.ServerXMLHTTP which is the most recent of the two:
http://classicasp.aspfaq.com/general/how-do-i-read-the-contents-of-a-remote-web-page.html
About the IE9 issue: connect a pc with an older IE or another browser to test if the browser that is the culprit. Also in IE9 (or better in Firefox/Firebug) use the development tools (F12) and watch the console for errors while the contents of the iFrame load.
Your method to get dynamic pages is not efficient i'm afraid, ASP itself can do that and you could use eg a div instead of an iframe and replace the contents with what you get from the request. I will need to see more code to give better advice.

Take Screenshot of Browser via JavaScript (or something else)

For support reasons I want to be able for a user to take a screenshot of the current browser window as easy as possible and send it over to the server.
Any (crazy) ideas?
That would appear to be a pretty big security hole in JavaScript if you could do this. Imagine a malicious user installing that code on your site with a XSS attack and then screenshotting all of your daily work. Imagine that happening with your online banking...
However, it is possible to do this sort of thing outside of JavaScript. I developed a Swing application that used screen capture code like this which did a great job of sending an email to the helpdesk with an attached screenshot whenever the user encountered a RuntimeException.
I suppose you could experiment with a signed Java applet (shock! horror! noooooo!) that hung around in the corner. If executed with the appropriate security privileges given at installation it might be coerced into executing that kind of screenshot code.
For convenience, here is the code from the site I linked to:
import java.awt.Dimension;
import java.awt.Rectangle;
import java.awt.Robot;
import java.awt.Toolkit;
import java.awt.image.BufferedImage;
import javax.imageio.ImageIO;
import java.io.File;
...
public void captureScreen(String fileName) throws Exception {
Dimension screenSize = Toolkit.getDefaultToolkit().getScreenSize();
Rectangle screenRectangle = new Rectangle(screenSize);
Robot robot = new Robot();
BufferedImage image = robot.createScreenCapture(screenRectangle);
ImageIO.write(image, "png", new File(fileName));
}
...
Please see the answer shared here for a relatively successful implementation of this:
https://stackoverflow.com/a/6678156/291640
Utilizing:
https://github.com/niklasvh/html2canvas
You could try to render the whole page in canvas and save this image back to server. have fun :)
A webpage can't do this (or at least, I would be very surprised if it could, in any browser) but a Firefox extension can. See https://developer.mozilla.org/en/Drawing_Graphics_with_Canvas#Rendering_Web_Content_Into_A_Canvas -- when that page says "Chrome privileges" that means an extension can do it, but a web page can't.
Seems to me that support needs (at least) the answers for two questions:
What does the screen look like? and
Why does it look that way?
A screenshot -- a visual -- is very necessary and answers the first question, but it can't answer the second.
As a first attempt, I'd try to send the entire page up to support. The support tech could display that page in his browser (answers the first question); and could also see the current state of the customer's html (helps to answer the second question).
I'd try to send as much of the page as is available to the client JS by way of AJAX or as the payload of a form. I'd also send info not on the page: anything that affects the state of the page, like cookies or session IDs or whatever.
The cust might have a submit-like button to start the process.
I think that would work. Let's see: it needs some CGI somewhere on the server that catches the incoming user page and makes it available to support, maybe by writing a disk file. Then the support person can load (or have loaded automatically) that same page. All the other info (cookies and so on) can be put into the page that support sees.
PLUS: the client JS that handles the submit-button onclick( ) could also include any useful JS variable values!
Hey, this can work! I'm getting psyched :-)
HTH
-- pete
I've seen people either do this with two approaches:
setup a separate server for screenshotting and run a bunch of firefox instances on there, check out these two gem if you're doing it in ruby: selenium-webdriver and headless
use a hosted solution like http://url2png.com (way easier)
You can also do this with the Fireshot plugin. I use the following code (that I extracted from the API code so I don't need to include the API JS) to make a direct call to the Fireshot object:
var element = document.createElement("FireShotDataElement");
element.setAttribute("Entire", true);
element.setAttribute("Action", 1);
element.setAttribute("Key", "");
element.setAttribute("BASE64Content", "");
element.setAttribute("Data", "C:/Users/jagilber/Downloads/whatev.jpg");
if (typeof(CapturedFrameId) != "undefined")
element.setAttribute("CapturedFrameId", CapturedFrameId);
document.documentElement.appendChild(element);
var evt = document.createEvent("Events");
evt.initEvent("capturePageEvt", true, false);
element.dispatchEvent(evt);
Note: I don't know if this functionality is only available for the paid version or not.
Perhaps http://html2canvas.hertzen.com/ could be used. Then you can capture the display and then process it.
You might try PhantomJs, a headlesss browsing toolkit.
http://phantomjs.org/
The following Javascript example demonstrates basic screenshot functionality:
var page = require('webpage').create();
page.settings.userAgent = 'UltimateBrowser/100';
page.viewportSize = { width: 1200, height: 1200 };
page.clipRect = { top: 0, left: 0, width: 1200, height: 1200 };
page.open('https://google.com/', function () {
page.render('output.png');
phantom.exit();
});
I understand this post is 5 years old, but for the sake of future visits I'll add my own solution here which I think solves the original post's question without any third-party libraries apart from jQuery.
pageClone = $('html').clone();
// Make sure that CSS and images load correctly when opening this clone
pageClone.find('head').append("<base href='" + location.href + "' />");
// OPTIONAL: Remove potentially interfering scripts so the page is totally static
pageClone.find('script').remove();
htmlString = pageClone.html();
You could remove other parts of the DOM you think are unnecessary, such as the support form if it is in a modal window. Or you could choose not to remove scripts if you prefer to maintain some interaction with dynamic controls.
Send that string to the server, either in a hidden field or by AJAX, and then on the server side just attach the whole lot as an HTML file to the support email.
The benefits of this are that you'll get not just a screenshot but the entire scrollable page in its current form, plus you can even inspect and debug the DOM.
Print Screen? Old school and a couple of keypresses, but it works!
This may not work for you, but on IE you can use the snapsie plugin. It doesn't seem to be in development anymore, but the last release is available from the linked site.
i thing you need a activeX controls. without it i can't imagine. you can force user to install them first after the installation on client side activex controls should work and you can capture.
We are temporarily collecting Ajax states, data in form fields and session information. Then we re-render it at the support desk. Since we test and integrate for all browsers, there are hardly any support cases for display reasons.
Have a look at the red button at the bottom on holidaycheck
Alternatively there is html2canvas of Google. But it is only applicable for never browsers and I've never tried it.
In JavaScript? No. I do work for a security company (sort of NetNanny type stuff) and the only effective way we've found to do screen captures of the user is with a hidden application.

ajax based webpage - good way to do it?

I build a website focussing on loading only data that has to be loaded.
I've build an example here and would like to know if this is a good way to build a wegpage.
There are some problems when building a site like that, e.g.
bookmarking
going back and forth in
history SEO (since the content is basically not really connected)
so here is the example:
index.html
<html>
<head>
<title>Somebodys Website</title>
<!-- JavaScript -->
<script type="text/javascript" src="jquery-1.3.2.min.js"></script>
<script type="text/javascript" src="pagecode.js"></script>
</head>
<body>
<div id="navigation">
<ul>
<li>Welcome</li>
<li>Page1</li>
</ul>
</div>
<div id="content">
</div>
</body>
</html>
pagecode.js
var http = null;
$(document).ready(function()
{
// create XMLHttpRequest
try {
http = new XMLHttpRequest();
}
catch(e){
try{
http = new ActiveXObject("MS2XML.XMLHTTP");
}
catch(e){
http = new ActiveXObject("Microsoft.XMLHTTP");
}
}
// set navigation click events
$('.nav').click(function(e)
{
loadPage(e);
});
});
function loadPage(e){
// e.g. "link_Welcome" becomes "Welcome.html"
var page = e.currentTarget.id.slice(5) + ".html";
http.open("POST", page);
http.setRequestHeader("Content-Type", "application/x-www-form-urlencoded");
http.setRequestHeader("Connection", "close");
http.onreadystatechange = function(){changeContent(e);};
http.send(null);
}
function changeContent(e){
if(http.readyState == 4){
// load page
var response = http.responseText;
$('#content')[0].innerHTML = response;
}
}
Welcome.html
<b>Welcome</b>
<br />
To this website....
So as you can see, I'm loading the content based on the IDs of the links in the navigation section. So to make the "Page1" navigation item linkable, i would have to create a "Page1.html" file with some content in it.
Is this way of loading data for your web page very wrong? and if so, what is a better way to do it?
thanks for your time
EDIT:
this was just a very short example and i'd like to say that for users with javascript disabled it is still possible to provide the whole page (additionally) in static form.
e.g.
<li>Welcome</li>
and this Welcome.html would contain all the overhead of the basic index.html file.
By doing so, the ajax using version of the page would be some kind of extra feature, wouldn't it?
No, it isn't a good way to do it.
Ajax is a tool best used with a light touch.
Reinventing frames using it simply recreates all the problems of frames except that it replaces the issue of orphan pages with complete invisibility to search engines (and other use agents that don't support JS or have it disabled).
By doing so, the ajax using version of the page would be some kind of extra feature, wouldn't it?
No. Users won't notice, and you break bookmarking, linksharing, etc.
It's wrong to use AJAX (or any javascript for that matter) only to use it (unless you're learning how to use ajax which is diffrent matter).
There are situations where the use of javascript is good (mostly when you're building a custom user interface inside your browser window) and when AJAX really shines. But loading static web pages with javascript is very wrong: first, you tie yourself with a browser that can run your JS, second you increase the load on your server and on the client side.
More technical details:
The function loadPage should be re-written using jquery : $post(). This is a random shot, not tested:
function loadPage(e){
// e.g. "link_Welcome" becomes "Welcome.html"
var page = e.currentTarget.id.slice(5) + ".html";
$.post( page, null, function(response){
$('#content')[0].innerHTML = response;
} );
}
Be warned, I did not test it, and I might get this function a little wrong. But... dud, you are using jQuery already - now abuse it! :)
When considering implementing an AJAX pattern on a website you should first ask yourself the question: why? There are several good reasons to implement AJAX but also several bad reasons depending on what you're trying to achieve.
For example, if your website is like Facebook, where you want to offer end-users with a rich user interface where you can immediately see responses from friends in chat, notifications when users post something to your wall or tag you in a photo, without having to refresh the entire page, AJAX is GREAT!
However, if you are creating a website where the main content area changes for each of the top-level menu items using AJAX, this is a bad idea for several reasons: First, and what I consider to be very important, SEO (Search Engine Optimization) is
NOT optimized. Search engine
crawlers do not follow AJAX requests
unless they are loaded via the
onclick event of an anchor tag.
Ultimately, in this example, you are
not getting the value out of the rich
experience, and you are losing a lot
of potential viewers.
Second, users will have trouble bookmarking pages unless you implement a smart way to parse URLs to map to AJAX calls.
Third, users will have problems properly navigating using the back and forward buttons if you have not implemented a custom client-side mechanism to manage history.
Lastly, each browser interprets JavaScript differently, and with the more JavaScript you write, the more potential there is for losing cross browser compatibility unless you implement a framework that such as jQuery, Dojo, EXT, or MooTools that handles most of that for you.
gabtub you are not wrong, you can get working AJAX intensive web sites SEO compatible, with bookmarking, Back/Forward buttons (history navigation in general), working with JavaScript disabled (avoiding site duplication), accessible...
There is one problem, you must get back to server-centric.
You can get some "howtos" here.
And take a look to ItsNat.
How about unobtrusivity (or how should I call it?)?
If the user has no javascript for some reason, he'll only see a list with Welcome and Page1.
Yes it's wrong. What about users without JavaScript? Why not do this sort of work server-side? Why pay the cost of multiple HTTP requests instead of including the files server-side so they can be downloaded in a single fetch? Why pay the cost of non-JavaScript enabled clients not being able to view your stuff (Google's spider being an important user who'll be alienated by this approach)? Why? Why?

Categories

Resources