Is it possible access other webpages from within another page - javascript

Basically, what I'm trying to do is simply make a small script that accesses finds the most recent post in a forum and pulls some text or an image out of it. I have this working in python, using the htmllib module and some regex. But, the script still isn't very convenient as is, it would be much nicer if I could somehow put it into an HTML document. It appears that simply embedding Python scripts is not possible, so I'm looking to see if theres a similar feature like python's htmllib that can be used to access some other webpage and extract some information from it.
(Essentially, if I could get this script going in the form of an html document, I could just open one html document, rather than navigate to several different pages to get the information I want to check)
I'm pretty sure that javascript doesn't have the functionality I need, but I was wondering about other languages such as jQuery, or even something like AJAX?

As Greg mentions, an Ajax solution will not work "out of the box" when trying to load from remote servers.
If, however, you are trying to load from the same server, it should be fairly straightforward. I'm presenting this answer to show how this could be done using jQuery in just a few lines of code.
<div id="placeholder">Please wait, loading...</div>
<script type="text/javascript" src="/path/to/jquery.js">
</script>
<script type="text/javascript>
$(document).ready(function() {
$('#placeholder').load('/path/to/my/locally-served/page.html');
});
</script>
If you are trying to load a resource from a different server than the one you're on, one way around the security limitations would be to offer a proxy script, which could fetch the remote content on the server, and make it seem like it's coming from your own domain.
Here are the docs on jQuery's load method : http://docs.jquery.com/Ajax/load
There is one other nice feature to note, which is partial-page-loading. For example, lets say your remote page is a full HTML document, but you only want the content of a single div in that page. You can pass a selector to the load method, as in my example above, and this will further simplify your task. For example,
$('#placeholder').load('/path/to/my/locally-served/page.html #someTargetDiv');
Best of luck!-Mike

There are two general approaches:
Modify your Python code so that it runs as a CGI (or WSGI or whatever) module and generate the page of interest by running some server side code.
Use Javascript with jQuery to load the content of interest by running some client side code.
The difference between these two approaches is where the third party server sees the requests coming from. In the first case, it's from your web server. In the second case, it's from the browser of the user accessing your page.
Some browsers may not handle loading content from third party servers very gracefully (that is, they might pop up warning boxes or something).

You can embed Python. The most straightforward way would be to use the cgi module. If the script will be run often and you're using Apache it would be more efficient to use mod_python or mod_wsgi. You could even use a Python framework like Django and code the entire site in Python.
You could also code this in Javascript, but it would be much trickier. There's a lot of security concerns with cross-site requests (ah, the unsafe internet) and so it tends to be a tricky domain when you try to do it through the browser.

Related

copying html from another website in javascript

How can I copy the source code from a website (with javascript)? I want to copy the text that is showing the temperature from this website: http://www.accuweather.com/
I want to copy only the number that is displaying the temperature. Is there a way of copying that exact line from source code on the website? I heard about html scraping. if not javascript, what would be simplest way of doing it? Just copying the temeprature, and displaying it on my webpage.
Well the way you could do something like that in a simple way by loading the site into a hidden HTML element via AJAX and then search DOM for the element you want.
There is also a jQuery command that allows that directly. It would be something like:
<div id='temp'></div>
<script>
$('div#temp').load('https://www.accuweather.com/ #popular-locations-ul .large-temp', { limit: 1 });
</script>
#popular-locations-ul .large-temp is a css locator for the specific elements that contain the temperature.
However for some time web has a security feature called CORS. To be able to load something from one site via AJAX, the target site has to allow CORS headers explicitly. In the case of this particular site, CORS headers aren't present in the site configuration, so that means that any connection that tries to load something via AJAX won't be allowed.
You can only use a command like the above mentioned in a site you control and that you specify to allow CORS headers or in a site who already has this specification.
But as people have told you that's not a good thing from the start due to web sites impermanent nature. Things change a lot. So even if you could get a value in the way I mentioned from some other site, sometime later, the site would change and your code would be broken.
The reason I answered is because you are just learning and need guidance and not trying to do 'serious work'. Serious work would be using an API as people told you.
An web api is a special url you access (something like https://www.accuweather.com:1234/api/temperature/somecity) normally with some kind of security and that responds with the result you need for the function you want. For this kind service CORS is allowed because you are accessing in a secure and 'official' way.
Hope I clarified a bit.

Finding text from webpage using javascript

I am tring to make a script but I can't really find a solution.
I'm trying to find a string from a website. Hard part here is that I can't use
document.documentElement.innerHTML.search("string")
Since I can't do it locally, I want to use something like this:
var link = "myweb.com"
link.documentElement.innerHTML.search("string")
At the moment, my script generates the link, opens it and closes it: I just need to search the webpage for the word "error."
Javascript run inside of a client's browser won't actually be able to retrieve another website's html for you (unless it is a different page on your own website). You may want to read about the Same-Origin Policy.
You can, however, use javascript as a language to do what you want - just not running inside of a browser. You can use something called Node.js, which is simply a program you can use to run javascript outside of a browser.
What it really boils down to is that if you want to scrape another website (which is the term for what you are trying to do), you typically need to make a scraper that runs on a server, and not a browser.
To be complete, a (probably shady) way to scrape another website is to:
Have your server-side code fetch another website's conents
Use AJAX to pass the contents to a client's browser
Have the client do all of the processing
Optionally send the scraped information back to your server
Here is a good article on scraping with nodeJS.
if you need it just to work on your computer, you can make a userscript that will do this easily. If you want it to work as part of a hosted website, you need a server side solution

Use python to a open web browser (on windows), trigger javascript actions, and get the html contents?

Yes that sounds overly complicated.
I am trying to mine data from pages on our intranet. The pages are secure. The connection is refused when I try to get the contents with urllib.urlopen().
So I would like to use python to open a web browser to open the site then click some links that trigger javascript pop ups containing tables of info that I want to collect.
Any suggestions on where to begin?
I know the format of the page. It is something like this:
<div id="list">
<ul id="list item">
<li><a onclict="Openpopup('1');">blah</a></li>
</ul>
<ul></ul>
etc
Then a hidden frame becomes visible and the fields in the table within are filled.
<div>
<table>
<tr><td><span id="info_i_want">...
First off, I suggest that it's better to figure out what the page needs that JS is providing, and fake that - you'll have an easier time scraping the page if a browser isn't involved.
If it's just Javascript making an XMLHttpRequest, you can find the page from which the Javascript fetches the iframe data and connect directly to that.
But in spite of that you may need a library that does Javascript execution (if the reverse-engineering is too hard or it uses challenge tokens). A web-rendering framework like Gecko or WebKit might be appropriate.
Take a good look at Selenium if you insist on using a true web browser or cannot get the programmatic methods to work.
Once you've gotten the page contents via whatever method, you need an HTML parser (such as sgmllib or [almost] xml.dom). I suggest a DOM library. Parse the DOM and extract the contents from the appropriate node in the resulting tree.
The connection is refused when I try to get the contents with urllib.urlopen(). probably means you have to make a post request using python urllib module.I would suggest you use urllib2.You may also need to handle cookies, referrer,user-agent from your python code.
To see all the post request fired from your browser use firefox's live-http-headers.
For the javascript part,
Your best bet is to run a headless browser e.g phantomjs which understands all the intricacies of JavaScript, DOM etc but you will have to write your code in Javascript, benefit is that you can do whatever you want.
As, #phihag mentioned selenium is also a good option.
First of all, you should really find out why the connection is refused when you access the page with Python. Most likely, you'll have to perform HTTP authentication or specify a different User-Agent.
Firing up a browser, navigating, and getting the HTML back is a complex task. Luckily, you can implement it using selenium.
Consider taking a look at splinter which is a simpler webdriver API than Selenium.

How do JavaScript-based modal/popup services like KissInsights and Hello Bar work?

I'm developing a modal/popup system for my users to embed in their sites, along the lines of what KissInsights and Hello Bar (example here and here) do.
What is the best practice for architecting services like this? It looks like users embed a bit of JS but that code then inserts additional script tag.
I'm wondering how it communicates with the web service to get the user's content, etc.
TIA
You're right that usually it's simply a script that the customer embeds on their website. However, what comes after that is a bit more complicated matter.
1. Embed a script
The first step as said is to have a script on the target page.
Essentially this script is just a piece of JavaScript code. It's pretty similar to what you'd have on your own page.
This script should generate the content on the customer's page that you wish to display.
However, there are some things you need to take into account:
You can't use any libraries (or if you do, be very careful what you use): These may conflict with what is already on the page, and break the customer's site. You don't want to do that.
Never override anything, as overriding may break the customer's site: This includes event listeners, native object properties, whatever. For example, always use addEventListener or addEvent with events, because these allow you to have multiple listeners
You can't trust any styles: All styles of HTML elements you create must be inlined, because the customer's website may have its own CSS styling for them.
You can't add any CSS rules of your own: These may again break the customer's site.
These rules apply to any script or content you run directly on the customer site. If you create an iframe and display your content there, you can ignore these rules in any content that is inside the frame.
2. Process script on your server
Your embeddable script should usually be generated by a script on your server. This allows you to include logic such as choosing what to display based on parameters, or data from your application's database.
This can be written in any language you like.
Typically your script URL should include some kind of an identifier so that you know what to display. For example, you can use the ID to tell which customer's site it is or other things like that.
If your application requires users to log in, you can process this just like normal. The fact the server-side script is being called by the other website makes no difference.
Communication between the embedded script and your server or frames
There are a few tricks to this as well.
As you may know, XMLHttpRequest does not work across different domains, so you can't use that.
The simplest way to send data over from the other site would be to use an iframe and have the user submit a form inside the iframe (or run an XMLHttpRequest inside the frame, since the iframe's content resides on your own server so there is no cross domain communication)
If your embedded script displays content in an iframe dialog, you may need to be able to tell the script embedded on the customer site when to close the iframe. This can be achieved for example by using window.postMessage
For postMessage, see http://ejohn.org/blog/cross-window-messaging/
For cross-domain communication, see http://softwareas.com/cross-domain-communication-with-iframes
You could take a look here - it's an example of an API created using my JsApiToolkit, a framework for allowing service providers to easily create and distribute Facebook Connect-like tools to third-party sites.
The library is built on top of easyXDM for Cross Domain Messaging, and facilitates interaction via modal dialogs or via popups.
The code and the readme should be sufficient to explain how things fit together (it's really not too complicated once you abstract away things like the XDM).
About the embedding itself; you can do this directly, but most services use a 'bootstrapping' script that can easily be updated to point to the real files - this small file could be served with a cache pragma that would ensure that it was not cached for too long, while the injected files could be served as long living files.
This way you only incur the overhead of re-downloading the bootstrapper instead of the entire set of scripts.
Best practice is to put as little code as possible into your code snippet, so you don't ever have to ask the users to update their code. For instance:
<script type="text/javascript" src="http://your.site.com/somecode.js"></script>
Works fine if the author will embed it inside their page. Otherwise, if you need a bookmarklet, you can use this code to load your script on any page:
javascript:(function(){
var e=document.createElement('script');
e.setAttribute('language','javascript');
e.setAttribute('src','http://your.site.com/somecode.js');
document.head.appendChild(e);
})();
Now all your code will live at the above referenced URI, and whenever their page is loaded, a fresh copy of your code will be downloaded and executed. (not taking caching settings into account)
From that script, just make sure that you don't clobber namespaces, and check if a library exists before loading another. Use the safe jQuery object instead of $ if you are using that. And if you want to load more external content (like jQuery, UI stuff, etc.) use the onload handler to detect when they are fully loaded. For example:
function jsLoad(loc, callback){
var e=document.createElement('script');
e.setAttribute('language','javascript');
e.setAttribute('src',loc);
if (callback) e.onload = callback;
document.head.appendChild(e);
}
Then you can simply call this function to load any js file, and execute a callback function.
jsLoad('http://link.to/some.js', function(){
// do some stuff
});
Now, a tricky way to communicate with your domain to retrieve data is to use javascript as the transport. For instance:
jsLoad('http://link.to/someother.js?data=xy&callback=getSome', function(){
var yourData = getSome();
});
Your server will have to dynamically process that route, and return some javascript that has a "getSome" function that does what you want it to. For instance:
function getSome(){
return {'some':'data','more':'data'};
}
That will pretty effectively allow you to communicate with your server and process data from anywhere your server can get it.
You can serve a dynamically generated (use for example PHP or Ruby on Rails) to generate this file on each request) JS file from your server that is imported from the customers web site like this:
<script type="text/javascript" src="//www.yourserver.com/dynamic.js"></script>
Then you need to provide a way for your customer to decide what they want the modal/popup to contain (e.g. text, graphics, links etc.). Either you create a simple CMS or you do it manually for each customer.
Your server can see where each request for the JS file is coming from and provide different JS code based on that. The JS code can for example insert HTML code into your customers web site that creates a bar at the top with some text and a link.
If you want to access your customers visitors info you probably need to either read it from the HTML code, make your customers provide the information you want in a specific way or figure out a different way to access it from each customers web server.

How can I give users a javascript widget to pull content securely from my site

I need to be allow content from our site to be embeded in other users web sites.
The conent will be chargeable so I need to keep it secure but one of the requirements is that the subscribing web site only needs to drop some javascript into their page.
It looks like the only way to secure our content is to check the url of the page hosting our javascript matches the subscribing site. Is there any other way to do this given that we don't know the client browsers who will be hitting the subscribing sites?
Is the best way to do this to supply a javascript include file that populates a known page element when the page loads? I'm thinking of using jquery so the include file would first call in jquery (checking if it's already loaded and using some sort of namespace protection), then on page load populate the given element.
I'd like to include a stylesheet as well if possible to style the element but I'm not sure if I can load this along with the javascript.
Does this sound like a reasonable approach? Is there anything else I should consider?
Thanks in advance,
Mike
It looks like the only way to secure our content is to check the url of the page hosting our javascript matches the subscribing site.
Ah, but in client-side or server-side code?
They both have their disadvantages. Doing it with server-side code is unreliable because some browsers won't be passing a Referer header at all, and if you want to stop caches keeping a copy of the script, preventing the Referer-check from taking place, you have to serve with nocache or Vary: Referer headers, which would harm performance.
On the other hand, with client-side checks in the script you return, you can't be sure your environment you're running in hasn't been sabotaged. For example if your inclusion script tag was like:
<script src="http://include.example.com/includescript?myid=123"></script>
and your server-side script looked up 123 as being the ID for a customer using the domain customersite.foo, it might respond with the script:
if (location.host.slice(-16)==='customersite.foo') {
// main body of script
} else {
alert('Sorry, this site is not licensed to include content from example.com');
}
Which seems simple enough, except that the including site might have replaced String.prototype.slice with a function that always returned customersite.foo. Or various other functions used in the body of the script might be suspect.
Including a <script> from another security context cuts both ways: the including-site has to trust the source-site not to do anything bad in their security context like steal end-user passwords or replace the page with a big goatse; but equally, the source-site's code is only a guest in the including-site's potentially-maliciously-customised security context. So a measure of trust must exist between the two parties wherever one site includes script from another; the domain-checking will never be a 100% foolproof security mechanism.
I'd like to include a stylesheet as well if possible to style the element but I'm not sure if I can load this along with the javascript.
You can certainly add stylesheet elements to the document's head element, but you would need some strong namespacing to ensure it didn't interfere with other page styles. You might prefer to use inline styles for simplicity and to avoid specificity-interference from the page's main style sheet.
It depends really whether you want your generated content to be part of the host page (in which case you might prefer to let the including site deal with what styles they wanted for it themselves), or whether you want it to stand alone, unaffected by context (in which case you would probably be better off putting your content in an <iframe> with its own styles).
I'm thinking of using jquery so the include file would first call in jquery
I would try to avoid pulling jQuery into the host page. Even with noconflict there are ways it can conflict with other scripts that are not expecting it to be present, especially complex scripts like other frameworks. Running two frameworks on the same page is a recipe for weird errors.
(If you took the <iframe> route, on the other hand, you get your own scripting context to play with, so it wouldn't be a problem there.)
You can store the users domain, and a key within your local database. That, or the key can be an encrypted version of the domain to keep you from having to do a database lookup. Either one of these can determine whether you should respond to the request or not.
If the request is valid, you can send your data back out to the user. This data can indeed load in jQuery and and additional CSS reference.
Related:
How to load up CSS files using Javascript?
check if jquery has been loaded, then load it if false

Categories

Resources