Looking for a way to scrape HTML with JS - javascript

As the title suggests, I'm looking for a hopefully straightforward way of scraping all of the HTML from a webpage. Storing it in a string perhaps, and then navigating through that string to pull out the desired element.
Specifically, I want to scrape my twitter page and display my profile picture inside a new div. I know there are several tools for doing just this, but I would anyone have some code examples or suggestions for how I might do this myself?
Thanks a lot
UPDATE
After a very helpful response from T.J. Crowder I did some more searching online and found this resource.

In theory, this is easy. You just do an ajax call to get the text of the page, then use jQuery to turn that into a disconnected DOM, and then use all the usual jQuery tools to find and extract what you need.
$.ajax({
url: "http://example.com/some/path",
success: function(html) {
var tree = $(html);
var imgsrc = tree.find("img.some-class").attr("src");
if (imgsrc) {
// ...add the image to your page
}
}
});
But (and it's a big one) it's not likely to work, because of the Same Origin Policy, which prevents cross-origin ajax calls. Certain individual sites may have an open CORS policy, but most won't, and of course supporting CORS on IE8 and IE9 requires an extra jQuery plug-in.
So to do this with sites that don't allow your origin via CORS, there must be a server involved. It can be your server and you can grab the text of the page you want using server-side code and then send it to your page via ajax (or just build the bits you want into your page when you first render it). All of the usual server-side stacks (PHP, Node, ASP.Net, JVM, ...) have the ability to grab web pages. Or, in some cases, you may be able to use YQL as a cross-domain proxy, using their server rather than your own.

Related

load external webpage and add custom header and use the data from webpage

I want to load a external webpage on my own server and add my own header. Also i need to use the data from the external website like url and content (i need to search and find specific data, check if i got that data in my system and show my data in the header). The external webpage needs to be working (like the buttons for opening other pages, no new windows).
I know i can play with .NET to create software but i want to create a website that will do the trick. Can this be done? Php + iframe is to simple i think, that won't give me the data from external website and my server won't see changes in the external url (what i need).
If it's supposed to be client-side, then you can acquire the data necessary by using an Ajax request, parsing it in JavaScript and then just inserting it into an element. However you have to take into account that if the host doesn't support cross-origin resource sharing, then you won't be able to do it like this.
Ajax page source request: get full html source code of page through ajax request through javascript
Parsing elements from the source: http://ajaxian.com/archives/html-parser-in-javascript (not sure if useful)
Changing the element body:
// data --> the content you want to display in your element
document.getElementById('yourElement').innerHtml = data;
Other approach (server-side though) is to "act" like a browser by faking your user-agent to some browser's and then using cUrl for example to get the source. But you don't want to fake it, because that's not nice and you would feel bad..
Hope it gets you started!

HTML, Javascript: difference between an ajax call and changing the source of an iframe

Context: I'm trying to code a javascript function to like a certain post on tumblr, based on this link . I tried using an ajax call instead of changing the source of an iframe, but it doesn't work. Of course, changing the source of an iframe works.
So, what can be the difference that make this not work?
$baseUrl = 'http://tumblr.com/like/';
function LikePost( $postID, $reblogUrl )
{
/*
http://www.tumblr.com/<command>/<oauthId>?id=<postId>
<command>: like or unlike
<oauthId>: last eight characters of {ReblogURL}
<postId>: {PostID}
Exemple of Url
http://www.tumblr.com/like/fGKvAJgQ?id=16664837215
*/
$oauthId = $reblogUrl.substring( $reblogUrl.length - 8, $reblogUrl.length);
$likeUrl = $baseUrl + $oauthId + '?id=' + $postID;
$.ajax({
url: $likeUrl,
type:'POST'
});
}
AJAX requests are bound by same domain policy, with some exceptions that aren't worth listing since they don't work unless you control both domains.
In this case, you're calling a tumblr domain from your website, which you can't do through AJAX. However, iframes, script elements, and img elements can point to any domain, so if the like url isn't returning any content to you, you can use any of those means to record the like.
If you didn't want to use an iframe, the other method you could use would be to make a request to your server via AJAX, then proxy the request to tumblr. Your server can go to any url it wants.
However, the iframe approach is easiest. I suggest going that route since you already got it working. ;)
They are intended for different purposes. As jmort253 noted above, AJAX calls work only for the same domain, whereas Iframes may span different domains. But if you are interested in loading data from the same domain, AJAX may be a better option. Many times, while using IFrame, you will see a loading sign on the tab-bar of the page, showing that something inside it is loading (it's the IFrame page which is loading, not the entire page), which you don't want the user to see, because that is the point of AJAX, loading data seamlessly, giving the user the illusion that the data is coming almost simultaneously. With AJAX, you won't have these problems.
And even if you want to load data from different domains, while Javascript itself is not upto the task, you can use PHP to do the loading part, then use Javascript to fetch the data from there.

How do you get content from another domain with .load()?

Requesting data from any location on my domain with .load() (or any jQuery ajax functions) works just fine.
Trying to access a URL in a different domain doesn't work though. How do you do it? The other domain also happens to be mine.
I read about a trick you can do with PHP and making a proxy that gets the content, then you use jQuery's ajax functions, on that php location on your server, but that's still using jQuery ajax on your own server so that doesn't count.
Is there a good plugin?
EDIT: I found a very nice plugin for jQuery that allows you to request content from other pages using any of the jQuery function in just the same way you would a normal ajax request in your own domain.
The post: http://james.padolsey.com/javascript/cross-domain-requests-with-jquery/
The plugin: https://github.com/jamespadolsey/jQuery-Plugins/tree/master/cross-domain-ajax/
This is because of the cross-domain policy, which, in sort, means that using a client-side script (a.k.a. javascript...) you cannot request data from another domain. Lucky for us, this restriction does not exist in most server-side scripts.
So...
Javascript:
$("#google-html").load("google-html.php");
PHP in "google-html.php":
echo file_get_contents("http://www.google.com/");
would work.
Different domains = different servers as far as your browser is concerned. Either use JSONP to do the request or use PHP to proxy. You can use jQuery.ajax() to do a cross-domain JSONP request.
One really easy workaround is to use Yahoo's YQL service, which can retrieve content from any external site.
I've successfully done this on a few sites following this example which uses just JavaScript and YQL.
http://icant.co.uk/articles/crossdomain-ajax-with-jquery/using-yql.html
This example is a part of a blog post which outlines a few other solutions as well.
http://www.wait-till-i.com/2010/01/10/loading-external-content-with-ajax-using-jquery-and-yql/
I know of another solution which works.
It does not require that you alter JQuery. It does require that you can stand up an ASP page in your domain. I have used this method myself.
1) Create a proxy.asp page like the one on this page http://www.itbsllc.com/zip/proxyscripts.html
2) You can then do a JQuery load function and feed it proxy.asp?url=.......
there is an example on that link of how exactly to format it.
Anyway, you feed the foreign page URL and your desired mime type as get variables to your local proxy.asp page. The two mime types I have used are text/html and image/jpg.
Note, if your target page has images with relative source links those probably won't load.
I hope this helps.

URL masking in JavaScript

I currently have the following JavaScript function that will take current URL and concatenate it to another site URL to route it to the appropriate feedback group:
function sendFeedback() {
url = window.location.href;
newwin = window.open('http://www.anothersite.com/home/feedback/?s=' + url, 'Feedback');
}
Not sure if this is the proper terminology, but I want to mask the URL in the window.open statement to use the URL from the current window.
How would I be able to mask the window.open URL with the original in JavaScript?
Things you could do:
1- Mask the external site in a html frame inside a document from your site.
(for example www.mysite.com/shortUrl/)
2-Send a Location HTTP header (real url will eventually be displayed)
Keep in mind that browsers do their best to show the real address due to phishing concerns.
I wouldn't use javascript if I wanted to mask url even thought it would work with javascript. You wouldn't get much benefits in that scenario.
The reason is simple:
javascript/jQuery = functions belongs to client-side (browswer/your PC/DOM)
links, url, http, and headers = functions belongs to Apache.
Apache is always top level above client-side. Whenever link is fired to SampeLink.html, Apache wakes up and reads the file, but links/urls are already owned before javascript could claim them. So, it is kinda of pointless if you tried to manipulate links in your javascript scripts, even though it works but weak.
I'd point you to this awesome approach: .htaccess and you will be surprised how powerful it is. If .htaccess is presented in the parent folder of SampleLink.html, Apache denies the DOM engine (your browser) from reading files until Apache have finished reading .htaccess.
With your scenario, .htaccess can do some work for you by rewriting links and send "decoy" links to the DOM engine, meanwhile keeping the orginial links/urls behind the curtain; and visitors would reach to 404page if they tried to break the app or whatever you are concerned about.
This is a bit complicated, but it never ceased to fail me. I use this as my "bible" http://corz.org/serv/tricks/htaccess2.php.

Cross Domain Javascript Bookmarklet

I've been at this for several days and searches including here haven't give me any solutions yet.
I am creating a Bookmarklet which is to interact with a POST API. I've gotten most of it working except the most important part; the sending of data from the iframe (I know horrible! If anyone knows a better solution please let me know) to the javascript on my domain (same domain as API so the communication with the API is no problem).
From the page the user clicks on the bookmarklet I need to get the following data to the javascript that is included in the iFrame.
var title = pageData[0].title;
var address = pageData[0].address;
var lastmodified = pageData[0].lastmodified;
var referralurl = pageData[0].referralurl;
I first fixed it with parsing this data as JSON and sending it through the name="" attribute of the iFrame but realized on about 20% of webpages this breaks. I get an access denied; also it's not a very pretty method.
Does anyone have anyidea on how I can solve this. I am not looking to use POSTS that redirect I want it all to be AJAX and as unobtrusive as possible. It's also worth noting I use the jQuery library.
Thank you very much,
Ice
You should look into easyXDM, it's very easy to use. Check out one of the examples on http://consumer.easyxdm.net/current/example/methods.html
After a lot of work I was able to find a solution using JSONP which is enables Cross Domain Javascript. It's very tricky with the Codeigniter Framework because passing data allong the URLs requires a lot of encoding and making sure you dont have illegal characters. Also I'm still looking to see how secure it really is.
If I understand your question correctly, you might have some success by looking into using a Script Tag proxy. This is the standard way to do cross domain AJAX in javascript frameworks like jquery and extjs.
See Jquery AJAX Documentation
If you need to pass data to the iframe, and the iframe is actually including another page, but that other page is on the same domain (a lot of assumptions, I know).
Then the man page code can do this:
DATA_FOR_IFRAME = ({'whatever': 'stuff'});
Then the code on the page included by the iframe can do this:
window.parent.DATA_FOR_IFRAME;
to get at the data :)

Categories

Resources