I am trying to parse a sequence of html pages using python, I am having trouble grabbing the pages in iterative fashion. The link to the web page.
Milano Library
After peeking through the source, I found a function that responds to the click event on button element for the next page.
function SaltaAPagina() {
var CalcPag = VAIAPAGINA.value;
if (CalcPag > 0) {
CalcPag=CalcPag;
}
else {
CalcPag="1";
}
document.location = "/OPACMI01/cat/SDW?W=CODICE_BIBLIO+%3D+%27LO1+01%27+AND+EDITORE+PH+WORDS+%27sonzogno%27+AND+DATA_PUBBLICAZIONE+%3C+1943+ORDER+BY+ORDINAMENTO/Ascend&M=" + CalcPag + "&R=Y";
}
I know that I can encode parameters using pythons urllib2 module using the urlencode method. But I am not sure what I should be including as a parameter
lomba_link='http://www.biblioteche.regione.lombardia.it/OPACMI01/cat/SDW?W%3DCODICE_BIBLIO+%3D+%27LO1+01%27+AND+EDITORE+PH+WORDS+%27sonzogno%27+AND+DATA_PUBBLICAZIONE+%3C+1943+ORDER+BY+ORDINAMENTO/Ascend%26M%3D1%26R%3DY'
params = urllib.urlencode([('CalcPag',4)])
# this has not worked.
req = urllib2.Request(lomba_link)
print req
response = urllib2.urlopen(req,params)
html_doc = response.read()
What am I missing here?
Thanks
The javascript function you posted is passing several parameters to the target page:
document.location = "/OPACMI01/cat/SDW" + // This is the path of the page
"?W=CODICE_BIBLIO+%3D+%27LO1+01%27+AND+EDITORE+PH+WORDS+%27sonzogno%27+AND+DATA_PUBBLICAZIONE+%3C+1943+ORDER+BY+ORDINAMENTO/Ascend" + // The first parameter
"&M=" + CalcPag + // The second parameter
"&R=Y"; // The third parameter
In your code, you've encoded all of the & and = symbols in the URL, so you're passing a single, long parameter with no value - changing those symbols back to what they were in the javascript function should do the trick.
lomba_link='http://www.biblioteche.regione.lombardia.it/OPACMI01/cat/SDW'
params = urllib.urlencode([
('W', 'CODICE_BIBLIO+%3D+%27LO1+01%27+AND+EDITORE+PH+WORDS+%27sonzogno%27+AND+DATA_PUBBLICAZIONE+%3C+1943+ORDER+BY+ORDINAMENTO/Ascend'),
('M', 4),
('R', 'Y')
])
It's much easier to work with the brilliant requests library, rather than the urllib2 library...
In regards to urllib2.urlopen the params is for POST requests. Unfortunately you need to append the query string to the url to make a GET request.
eg:
req = urllib2.urlopen(req + '?' + params)
With requests, this would be much simpler:
page = requests.get(some_url, params={'CalcPag': '4'})
Related
I'm doing a little bit of reverse engineering on the Rapportive API in Gmail.
I make this request
import requests
url ='https://api.linkedin.com/uas/js/xdrpc.html'
r = requests.get(url)
print r.text
The response is an empty HTML file that has a lot of Javascript in it. On line 3661, it sets the RequestHeader for the subsequent call to Rapportive:
ak.setRequestHeader("oauth_token", ae);
Is there a way I can request that page and then return ae?
I think you can try:
Get the page as you already does;
Remove all non-javascript elements from the response page;
Prepend a javascript (described below) in the page's javascript to override some code;
Execute it with eval('<code>');
Check if the token has been set correctly;
I'm proposing the following code to override the XMLHttpRequest.setRequestHeader functionality to be able to get the token:
// this will keep the token
var headerToken;
// create a backup method
XMLHttpRequest.prototype.setRequestHeaderBkp =
XMLHttpRequest.prototype.setRequestHeader;
// override the "setRequestHeader" method
XMLHttpRequest.prototype.setRequestHeader = function(key, val)
{
if ('oauth_token' === key)
headerToken = val;
this.setRequestHeaderBkp(key, val);
}
If you are just interested in retrieving the token can't you just do a regex match:
var str = '<script>var a = 1;...ak.setRequestHeader("oauth_token", ae);...</script>';
var token = str.match(/setRequestHeader\("oauth_token",\s*([^)]+)/)[1];
Although this assumes ae is the actual string value. If it's a variable this approach wouldn't work as easily.
Edit: If it's a variable you could do something like:
str.replace(/\w+\.setRequestHeader\([^,]+,\s*([^)]+)\s*\);/, 'oauthToken = \1';
Before running the JavaScript returned from the page, then the global oauthToken (notice the missing 'var') will contain the value of the token, assuming the the evaluation of the code is run in the same scope as the caller.
This is a simple to understand question, I will explain step by step as to make everything clear.
I am using the Google Feed API to load an RSS file into my JavaScript application.
I have a setting to bypass the Google cache, if needed, and I do this by appending a random number at the end of the RSS file link that I send to the Google Feed API.
For example, let's say this is a link to an RSS:
http://example.com/feed.xml
To bypass the cache, I append a random number at the end as a parameter:
http://example.com/feed.xml?0.12345
The whole url to the Google Feed API would look like this, where "q" is the above link:
https://ajax.googleapis.com/ajax/services/feed/load?v=1.0&num=5&q=http://example.com/feed.xml?0.12345
This bypasses the cache and works well in most cases but there is a problem when the RSS link that I want to use already has parameters. For example, like this:
http://example.com/feed?type=rss
Appending the number at the end like before would give an error and the RSS file would not be returned:
http://example.com/feed?type=rss?0.12345 // ERROR
I have tried using "&" to attach the random number, as so:
http://example.com/feed?type=rss&0.12345
This no longer gives an error and the RSS file is correctly returned. But if I use the above in the Google Feed API url, it no longer bypasses the cache:
https://ajax.googleapis.com/ajax/services/feed/load?v=1.0&num=5&q=http://example.com/feed.xml&0.1234
This is because "0.1234" is considered a parameter of the whole url and not a parameter of the "q" url. Therefore "q" remains only as "http://example.com/feed.xml", it is not unique so the cached version is loaded.
Is there a way to make the number parameter be a part of the "q" url and not a part of the whole url?
You need to use encodeURIComponent like this:
var url = 'http://example.com/feed.xml&0.1234';
document.getElementById('results').innerHTML = 'https://ajax.googleapis.com/ajax/services/feed/load?v=1.0&num=5&q=' + encodeURIComponent(url);
<pre id="results"></pre>
You are escaping the special characters that would have been treated as part of the url otherwise.
To append or create a queryString:
var url = 'http://example.com/feed.xml';
var randomParameter = '0.1234';
var queryString = url.indexOf('?') > - 1;
if(queryString){
url = url + '&' + randomParameter;
} else {
url = url + '?' + randomParameter;
}
//url needs to be escaped with encodeURIComponent;
You need to use encodeURIComponent to do this.
encodeURIComponent('http://example.com/feed.xml&0.1234')
will result in
http%3A%2F%2Fexample.com%2Ffeed.xml%260.1234
and when appended to the end result you'll get
https://ajax.googleapis.com/ajax/services/feed/load?v=1.0&num=5&q=http%3A%2F%2Fexample.com%2Ffeed.xml%260.1234
I have a page with an select box, which fires an onChange event. In this Java-Script snippet, I would like to reload the current page, including the GET and POST parameters that where sent during request. AFAIK, this can be achieved by using window.location.reload(), or window.location.href = window.location.href when sending POST data is not required.
However, I need to append an additional value (actually, the value of the select element), additionally to the previously sent element. I do not care whether the data is sent using POST or GET. Is there a way to achieve the desired behavior?
To accomplish this you are going to have to rebuild a request from scratch. In the case of get requests, the arguments are easily accessible in the query string but post requests are a little trickier. You will need to stash all that data in hidden input elements or something so that you can access it.
Then you can try something like this:
var queryString = windlow.location.search; //args from a previous get
var postArgs = $("#myPostArgsForm").serialize(); //args from a previous post... these need to be saved and added to the html by the server
//your additional data... this part you probably need to adapt
//to fit your specific needs. this is an example
var myNewArgName = encodeURIComponent("newArg");
var myNewArgVal = encodeURIComponent("Hello!");
var myNewArgString = myNewArgName + "=" + myNewArgVal;
//if there is no queryString, begin with ?
if(!queryString) {
queryString = "?"
}
//if there is, then we need an & before the next args
else {
myNewArgString = "&" + myNewArgString;
}
//add your new data
queryString += myNewArgString;
//add anything from a previous post
if(postArgs) {
queryString += "&" + postArgs;
}
window.location.href = window.location.hostname + window.location.pathname + querystring
<form id="myPostArgsForm">
<input type="hidden" name="prevQuery" value="whats up?" />
</form>
Pretty simple really; have onChange fire a function that uses getElementById to figure out the selector value and then just use window.location to send the browser to the literal: http://yourdomain.com/yourpage.html?selectval=123
then, in the body onload() method, fire another JS function that checks the "get var" like:
function (GetSelector) {
var TheSelectorWas = getUrlVars()["selectval"];
alert(TheSelectorWas);
}
and do whatever you need to do in that function (document.writes, etc). BTW, posting the actual code you're using is always a good idea.
-Arne
I am used to sending AJAX requests with jQuery. I now find myself with the task of having to send them using 'vanilla' JS. Using my limited knowledge, I managed to get everything working except for passing the data along with the request. The two variables that are supposed to be being passed along are always filled in as NULL in my database. Every example I have been able to find on here shows the jQuery way, which I have no problem doing.
Can anyone tell me what I am doing wrong? I assume it has something to do with the format of the data, but cannot for the live of me figure it out.
Here is the code for the request. The request object is built in the createXMLHttp() function.
var xmlHttp = createXMLHttp();
var data = {Referrer: document.referrer, Path: window.location.pathname};
xmlHttp.open('post', '/PagePilotSiteHits.php', true);
xmlHttp.setRequestHeader("Content-Type", "application/x-www-form-urlencoded");
xmlHttp.send(data);
var data = {Referrer: document.referrer, Path: window.location.pathname};
function buildQueryString(data)
{
var dataKeys = Object.keys(data);
var queryString = "";
for (var i = 0; i < dataKeys.length; ++i)
{
queryString += "&" + dataKeys[i] + "=" + encodeURICompenent(data[dataKeys[i]]);
}
return queryString.substr(1);
}
xmlHttp.send(buildQueryString(data));
This should do it. The data needs to be passed as a querystring. This functions will create a querystring from the data object you've provided and encodes the uri components as mentioned by #AlexV and #Quentin.
I have an action that takes 2 strings. One of the strings is a big, ugly json string. I suspect that the action will not allow the special characters to be passed because I keep getting a 400 - Bad Request.
Can a serialized json object be passed to an action?
public ActionResult SaveState(string file, string state)
{
string filePath = GetDpFilePath(file);
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.Load(filePath);
HtmlNode stateScriptNode =
htmlDocument.DocumentNode.SelectSingleNode("/html/head/script[#id ='applicationState']");
stateScriptNode.InnerHtml = "var applicationStateJSON =" + state;
htmlDocument.Save(filePath);
return null;
}
ClientScript
'e' is a large json string
$.post('/State/SaveState/' + fileName+'/' + '/' + e + '/');
strong text
I am now encoding the text using UriEncoding() but it makes no difference. I don't think that MVC Actions allow me to send these special characters by default.. is that true? How do you work around this?
$.post('/State/SaveState/' + encodeURIComponent(fileName) + '/' + '/' + encodeURIComponent(e) + '/');
Sample request:
Request URL:http://localhost:51825/State/SaveState/aa6282.html//%7B%22uid%22%3A%22testUser%22%2C%22a
You need to encode it when the request is made:
$.post('/State/SaveState/' + encodeURIComponent(fileName) + '/' + encodeURIComponent(e));
Yes, serialized JSON object can be passed to an action method. MVC3 makes this even easier with built-in JSON binding. I use the json2 library to serialize the objects. See this post for more details. Works really great.
http://haacked.com/archive/2010/04/15/sending-json-to-an-asp-net-mvc-action-method-argument.aspx
Because I am send this data to the sever and the size of the string I am sending is large. I really should be sending the data in the post body.
It seems that there is also a limitation on the amount of data that you can send via the query string. I cannot be certain that this was the source of the error message but it certainly would make sense. In case the following post works correctly:
$.post('/State/SaveState/', { file: fileName, state: e });
You probably need to HTML-encode e before you add it to the URL. Also, you have an extra / that you don't need.