Phantomjs/Casperjs get url from JS script inside page - javascript

I'm building a scraper with phantom/casper.
At this point, I need to extract a URL that appears in the page only inside a js script.
Example of the page source code :
<script>
queueRequest('URL.aspx?var1='+VAR1+'&var2='+VAR2, getPageMenu');
</script>
I have no problem evaluating VAR1 and VAR2, as they are in the page context, but I need URL, which is hardcoded and has no reference to it. URL is of course different according to the page I'm on and I have no way of guessing it. Any ideas?
My ideas :
As the URL is called on page load to fill a div wih AJAX, I was thinking of maybe capturing the XHR request, but I don't know how.
I managed to get the script elem I need, using document.getElementsByTagName('script'). That may be one way to go, but how do I get only the line I need out of 200+ lines? (the one starting with queueRequest)
SO to make my question clear :
Which idea is better, 1 or 2?
if 1 : How do I capture the request URL with casper?
if 2 : How do I get the right line in my script?

If you want to search your script blocks, you can try something like this:
found = null;
scripts = document.getElementsByTagName('script');
for (i = 0; i < scripts.length; i++)
{
matches = /queueRequest\('(.+)\?/.exec(scripts[i].innerText)
if (matches)
{
found = matches[1];
break;
}
}
alert(found);
There might be tighter ways to implement the same thing but the regex is roughly what you're after. Note that this will only get you the URL part of the first appearance of queueRequest('something.something?...) in embedded script blocks.

Related

Get script content [duplicate]

If I have a script tag like this:
<script
id = "myscript"
src = "http://www.example.com/script.js"
type = "text/javascript">
</script>
I would like to get the content of the "script.js" file. I'm thinking about something like document.getElementById("myscript").text but it doesn't work in this case.
tl;dr script tags are not subject to CORS and same-origin-policy and therefore javascript/DOM cannot offer access to the text content of the resource loaded via a <script> tag, or it would break same-origin-policy.
long version:
Most of the other answers (and the accepted answer) indicate correctly that the "correct" way to get the text content of a javascript file inserted via a <script> loaded into the page, is using an XMLHttpRequest to perform another seperate additional request for the resource indicated in the scripts src property, something which the short javascript code below will demonstrate. I however found that the other answers did not address the point why to get the javascript files text content, which is that allowing to access content of the file included via the <script src=[url]></script> would break the CORS policies, e.g. modern browsers prevent the XHR of resources that do not provide the Access-Control-Allow-Origin header, hence browsers do not allow any other way than those subject to CORS, to get the content.
With the following code (as mentioned in the other questions "use XHR/AJAX") it is possible to do another request for all not inline script tags in the document.
function printScriptTextContent(script)
{
var xhr = new XMLHttpRequest();
xhr.open("GET",script.src)
xhr.onreadystatechange = function () {
if(xhr.readyState === XMLHttpRequest.DONE && xhr.status === 200) {
console.log("the script text content is",xhr.responseText);
}
};
xhr.send();
}
Array.prototype.slice.call(document.querySelectorAll("script[src]")).forEach(printScriptTextContent);
and so I will not repeat that, but instead would like to add via this answer upon the aspect why itthat
Do you want to get the contents of the file http://www.example.com/script.js? If so, you could turn to AJAX methods to fetch its content, assuming it resides on the same server as the page itself.
Update: HTML Imports are now deprecated (alternatives).
---
I know it's a little late but some browsers support the tag LINK rel="import" property.
http://www.html5rocks.com/en/tutorials/webcomponents/imports/
<link rel="import" href="/path/to/imports/stuff.html">
For the rest, ajax is still the preferred way.
I don't think the contents will be available via the DOM. You could get the value of the src attribute and use AJAX to request the file from the server.
yes, Ajax is the way to do it, as in accepted answer. If you get down to the details, there are many pitfalls. If you use jQuery.load(...), the wrong content type is assumed (html instead of application/javascript), which can mess things up by putting unwanted <br> into your (scriptNode).innerText, and things like that. Then, if you use jQuery.getScript(...), the downloaded script is immediately executed, which might not be what you want (might screw up the order in which you want to load the files, in case you have several of those.)
I found it best to use jQuery.ajax with dataType: "text"
I used this Ajax technique in a project with a frameset, where the frameset and/or several frames need the same JavaScript, in order to avoid having the server send that JavaScript multiple times.
Here is code, tested and working:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd">
<html>
<head>
<script id="scriptData">
var scriptData = [
{ name: "foo" , url: "path/to/foo" },
{ name: "bar" , url: "path/to/bar" }
];
</script>
<script id="scriptLoader">
var LOADER = {
loadedCount: 0,
toBeLoadedCount: 0,
load_jQuery: function (){
var jqNode = document.createElement("script");
jqNode.setAttribute("src", "/path/to/jquery");
jqNode.setAttribute("onload", "LOADER.loadScripts();");
jqNode.setAttribute("id", "jquery");
document.head.appendChild(jqNode);
},
loadScripts: function (){
var scriptDataLookup = this.scriptDataLookup = {};
var scriptNodes = this.scriptNodes = {};
var scriptNodesArr = this.scriptNodesArr = [];
for (var j=0; j<scriptData.length; j++){
var theEntry = scriptData[j];
scriptDataLookup[theEntry.name] = theEntry;
}
//console.log(JSON.stringify(scriptDataLookup, null, 4));
for (var i=0; i<scriptData.length; i++){
var entry = scriptData[i];
var name = entry.name;
var theURL = entry.url;
this.toBeLoadedCount++;
var node = document.createElement("script");
node.setAttribute("id", name);
scriptNodes[name] = node;
scriptNodesArr.push(node);
jQuery.ajax({
method : "GET",
url : theURL,
dataType : "text"
}).done(this.makeHandler(name, node)).fail(this.makeFailHandler(name, node));
}
},
makeFailHandler: function(name, node){
var THIS = this;
return function(xhr, errorName, errorMessage){
console.log(name, "FAIL");
console.log(xhr);
console.log(errorName);
console.log(errorMessage);
debugger;
}
},
makeHandler: function(name, node){
var THIS = this;
return function (fileContents, status, xhr){
THIS.loadedCount++;
//console.log("loaded", name, "content length", fileContents.length, "status", status);
//console.log("loaded:", THIS.loadedCount, "/", THIS.toBeLoadedCount);
THIS.scriptDataLookup[name].fileContents = fileContents;
if (THIS.loadedCount >= THIS.toBeLoadedCount){
THIS.allScriptsLoaded();
}
}
},
allScriptsLoaded: function(){
for (var i=0; i<this.scriptNodesArr.length; i++){
var scriptNode = this.scriptNodesArr[i];
var name = scriptNode.id;
var data = this.scriptDataLookup[name];
var fileContents = data.fileContents;
var textNode = document.createTextNode(fileContents);
scriptNode.appendChild(textNode);
document.head.appendChild(scriptNode); // execution is here
//console.log(scriptNode);
}
// call code to make the frames here
}
};
</script>
</head>
<frameset rows="200pixels,*" onload="LOADER.load_jQuery();">
<frame src="about:blank"></frame>
<frame src="about:blank"></frame>
</frameset>
</html>
related question
.text did get you contents of the tag, it's just that you have nothing between your open tag and your end tag. You can get the src attribute of the element using .src, and then if you want to get the javascript file you would follow the link and make an ajax request for it.
In a comment to my previous answer:
I want to store the content of the script so that I can cache it and use it directly some time later without having to fetch it from the external web server (not on the same server as the page)
In that case you're better off using a server side script to fetch and cache the script file. Depending on your server setup you could just wget the file (periodically via cron if you expect it to change) or do something similar with a small script inthe language of your choice.
if you want the contents of the src attribute, you would have to do an ajax request and look at the responsetext. If you where to have the js between and you could access it through innerHTML.
This might be of interest: http://ejohn.org/blog/degrading-script-tags/
I had a same issue, so i solve it this way:
The js file contains something like
window.someVarForReturn = `content for return`
On html
<script src="file.js"></script>
<script>console.log(someVarForReturn)</script>
In my case the content was html template. So i did something like this:
On js file
window.someVarForReturn = `<did>My template</div>`
On html
<script src="file.js"></script>
<script>
new DOMParser().parseFromString(someVarForReturn, 'text/html').body.children[0]
</script>
You cannot directly get what browser loaded as the content of your specific script tag (security hazard);
But
you can request the same resource (src) again ( which will succeed immediately due to cache ) and read it's text:
const scriptSrc = document.querySelector('script#yours').src;
// re-request the same location
const scriptContent = await fetch(scriptSrc).then((res) => res.text());
If you're looking to access the attributes of the <script> tag rather than the contents of script.js, then XPath may well be what you're after.
It will allow you to get each of the script attributes.
If it's the example.js file contents you're after, then you can fire off an AJAX request to fetch it.
It's funny but we can't, we have to fetch them again over the internet.
Likely the browser will read his cache, but a ping is still sent to verify the content-length.
[...document.scripts].forEach((script) => {
fetch(script.src)
.then((response) => response.text() )
.then((source) => console.log(source) )
})
Using 2008-style DOM-binding it would rather be:
document.getElementById('myscript').getAttribute("src");
document.getElementById('myscript').getAttribute("type");
You want to use the innerHTML property to get the contents of the script tag:
document.getElementById("myscript").innerHTML
But as #olle said in another answer you probably want to have a read of:
http://ejohn.org/blog/degrading-script-tags/
If a src attribute is provided, user agents are required to ignore the content of the element, if you need to access it from the external script, then you are probably doing something wrong.
Update: I see you've added a comment to the effect that you want to cache the script and use it later. To what end? Assuming your HTTP is cache friendly, then your caching needs are likely taken care of by the browser already.
I'd suggest the answer to this question is using the "innerHTML" property of the DOM element. Certainly, if the script has loaded, you do not need to make an Ajax call to get it.
So Sugendran should be correct (not sure why he was voted down without explanation).
var scriptContent = document.getElementById("myscript").innerHTML;
The innerHTML property of the script element should give you the scripts content as a string provided the script element is:
an inline script, or
that the script has loaded (if using the src attribute)
olle also gives the answer, but I think it got 'muddled' by his suggesting it needs to be loaded through ajax first, and i think he meant "inline" instead of between.
if you where to have the js between and you could access it through innerHTML.
Regarding the usefulness of this technique:
I've looked to use this technique for client side error logging (of javascript exceptions) after getting "undefined variables" which aren't contained within my own scripts (such as badly injected scripts from toolbars or extensions) - so I don't think it's such a way out idea.
Not sure why you would need to do this?
Another way round would be to hold the script in a hidden element somewhere and use Eval to run it. You could then query the objects innerHtml property.

Cache static HTML pages with get variables

I have a website with a lot of iframes like this:
<iframes src="expamle.com\page.html?var=blabla&id=42" scrolling="no"></iframe>
I have to change var=blabla&id=42 for each iFrame. These parameters are used in the javascript of the iframe. Is there any way to cache(give hints to the browser) page.html (static) once for all variables ?
I have to use an iframe since I want to be able to update this code ( from another server) & to run it in another scope.
No - Anything changing the query string represents a seperate resource for the browser.
However, you may be able to achieve that effect if you can make some slight changes to page.html. If you write it this way:
<iframes src="expamle.com\page.html#var=blabla&id=42" scrolling="no"></iframe>
Note the use of the # character - that's the key there.
The query string becomes simply "page.html" and will cache that way. However, the Javascript of that page will have access to the variable document.location.hash, which will contain "var=blabla&id=42". It'll be written as a single string, but it shouldn't be difficult to parse. Some libraries even use that tag to pass parameters in semi-real-time to iframes for IE6 compatibility.
If it's only used in the javascript but is really only 1 page server side don't use ? But use # it will consider it as the same page but at diferent anchor pounts. So if test.com/#foo is cached then test.col/#bar is too (same page, different anchor points)
You can update the frame URLs from code:
var fr = document.getElementsByTagName('iframe');
var sites = "1.com,2.com".split(",");
for(var x=0;x<fr.length;x++) {
document.getElementsByTagName('iframe')[x].src="http://"+sites[x];
}

javascript replace string in js file

Ok so another issue i got
I have a js file that i need to include on my page (i dont have access to edit this js file)
Inside that javascript there is a function wich has a line that i need to edit a variable in there.
lets assume:
Code:
var links = [{"offer_id":"1","title":"Text!","url":"http:\/\/www.site.com\/tracker\/1\/?http:\/\/site.com\/click.php?aff=9917&camp=5626&crt=13346&sid=e6a00014f247fe39de1b_1","footer_text":"text","blank_referrer":"1","active":"1","value":"0.40","creation_time":"1327785202"}];
notice : '&sid=e6a00014f247fe39de1b_1'
i need to add something right after sid=
so that i becomes for example:
Code:
&sid=AlssfIT_e6a00014f247fe39de1b_1
i added: AlssfIT_
any ideas how to achieve this ?
i tried something like
Code:
str.replace("&sid=","&sid="+kwd);
right after i "include" the js file but aparently is not working
I think you're going about it the wrong way. If notice is a variable in the global space you can just replace it normally.
window.someObject.notice = window.someObject.notice.replace("&sid=","&sid="+kwd);
This will of course only work if notice is a variable that is navigable to in the global namespace and is not inside a closure. It is inside a closure if it has a var declaration inside a function() {...}
But, assuming that there is global access to that variable, that will be your easiest way to achieve this.
If not, you can try grabbing the contents of the script and executing it hopefully overwriting the original code. This will only work if your script and the script you are fetching are from the same origin (domain, subdomain, port, protocol, a few other things) - it is impossible otherwise due to the _Same Origin Policy_
Assuming you are at the same origin, you could do something like this (using jquery for simplicity)
( function() {
// First we need the url of the script, we can grab it out of the element directly or it can be hard coded
var scriptLocation = $('script#the-id-of-the-script-element').prop('src');
// Now that we have the location fetch the script again so we can get it as plaintext
// this will usually not do another HTTP request since your browser has it cached
$.get(scriptLocation).done(function(text) { // I prefer the deferred syntax as being more explicit, this is equivalent to $.get(scriptLocation, function(text) {
var newText = text.replace("&sid=","&sid="+kwd);
eval(newText);
});
} )()
Something like this could work.
try regex: (not tested)
myregexp = new RegExp("/&sid=/", "gims");
str.replace(myregexp, "&sid=" + kwd);

How to get the currently loading script name and the url variable of the script?

I am loading a script using the script embed tag and I want to get the url variables and the loading script name inside the script. Is it possible? For Example,
<script src="test.js?id=12"></script>
Inside test.js is there any way to get the url variable id?
Any help would be appreciated.
Thanks,
Karthik
Aside from the answers in the linked post, FWIW with Firefox 4 only you can (with caveats); document.currentScript.src which will return the full url, including arguments.
Thanks for all your efforts I have made that working by assigning an id attribute in the script tag and accessed via jQuery,
<script src="test.js?id=12" id="myScript"></script>
var currentScript = $("#myScript").attr('src'); //This will give me my script src
Thanks,
Karthik
If you want to get a variable from the current URL you can use this:
function queryParser(url){
this.get=function(p){return this.q[p];}
this.q={};
this.map=function(url){
url=url || window.location.search.substring(1);
var url=url.split('&');
var part;
for(var i=0;i<url.length;i++){
part=url[i].split('=');
this.q[part[0]]=part[1];
}
}
this.map(url);
}
var query=new queryParser();
// assuming you have ?test=something
alert(query.get('test'));
I recommend you map the result, so you don't re-parse whenever you want to find a specific element.
I don't really know why you'd pass a query string in a script tag like that, unless you specifically want off-site includes with a simple robust system for various effects. Or are actually using PHP to handle that request.
If you want to "send" a variable to one of your scripts, you can always do:
<script type="text/javascript">
var myVar="I'm in global scope, all scripts can access me";
</script>
<script src="test.js?id=12"></script>
If you really need to get the URL of the currently included script, you can use the code supplied by my peers in the other answers, you can then use:
var query=new queryParser(scriptURL);
alert(query.get('id'));// would alert 12 in your case
Navigating through the link on your comments you can get the proper answer.
Anyway, to make things easier:
var allScripts=document.getElementsByTagName('script');
var indexLastScript= allScripts.length -1;
alert (allScripts[indexLastScript].src);
This will show up "test.js?id=12" as a regular String so its up to you to split it in order to get de param.
Hope it helps, I've tried it on the run over the Chrome Javascript Console. :D.

Is there a way to refresh just the javascript include while doing development?

While doing development on a .js file I'd like to just refresh that file instead of the entire page to save time. Anyone know of any techniques for this?
Here is a function to create a new script element. It appends an incremented integer to make the URL of the script unique (as Kon suggested) in order to force a download.
var index = 0;
function refreshScript (src) {
var scriptElement = document.createElement('script');
scriptElement.type = 'text/javascript';
scriptElement.src = src + '?' + index++;
document.getElementsByTagName('head')[0].appendChild(scriptElement);
}
Then in the Firebug console, you can call it as:
refreshScript('my_script.js');
You'll need to make sure that the index itself is not part of the script being reloaded!
The Firebug Net panel will help you see whether the script is being downloaded. The response status should be "200 OK" and not "304 Not Modified. Also, you should see the index appended in the query string.
The Firebug HTML panel will help you see whether the script element was appended to the head element.
UPDATE:
Here is a version that uses a timestamp instead of an index variable. As #davyM suggests, it is a more flexible approach:
function refreshScript (src) {
var scriptElement = document.createElement('script');
scriptElement.type = 'text/javascript';
scriptElement.src = src + '?' + (new Date).getTime();
document.getElementsByTagName('head')[0].appendChild(scriptElement);
}
Alexei's points are also well-stated.
I suggest you to use Firebug for this purpose.
See this video, it helped me a lot.
http://encosia.com/2009/09/21/updated-see-how-i-used-firebug-to-learn-jquery/
If you're talking about the unfortunate case of client-side/browser caching of your .js file, then you can simply version your .js file. You can:
Rename the .js file itself (not preferred)
Update the include line to reference yourfile.js?1, yourfile.js?2, etc.. Thus forcing the browser to request the latest version from the server. (preferred)
Unfortunately, you have to refresh the web page to see edits to your JavaScript take place. There is no way that I know of to edit JavaScript in "real-time" and see those edits effect without a refresh.
You can use Firebug to insert new JavaScript, and make real-time changes to DOM objects; but you cannot edit JavaScript that has already been run.
If you just fed up refilling the forms while developing just use form recover extensions like this one https://addons.mozilla.org/ru/firefox/addon/lazarus-form-recovery/

Categories

Resources