Issues in developing web scraper

Issues in developing web scraper - javascript

I want to develop a platform where users can enter a URL and then my website will open the webpage in an iframe. Now the user can modify his website by simply right clicking and I will provide him options like "remove this element", "copy this element". I am almost through. Many of the websites are opening perfectly in iframe but for a few websites some errors have shown up. I could not identify the reason so asking for your help.
I have solved other issues like XSS problem.
Here is the procedure I have followed :-
Used JavaScript and sent the request to my Java server which makes connection to the URL specified by the user and fetches the HTML and then use Jsoup HTML parser to convert relative URLs into absolute URLs and then save the HTML to my disk in Java. And then I render the saved HTML into my iframe.
Is somewhere wrong ?
A few websites are working perfectly but a few are not.
For example:-
When I tried to open http://www.snapdeal.com it gave me the
Uncaught TypeError: Cannot read property 'paddingTop' of undefined
error. I don't understand why this is happening..
Update
I really wonder how this is implemented? # http://www.proxywebsites.in/browse.php?u=Oi8vd3d3LnNuYXBkZWFsLmNvbQ%3D%3D&b=13&f=norefer

2 issues, pick any you like:
your server side proxy code contains bugs
plenty of sites have either explicit frame-break code or at least expect to be top level frame.

You can try one more thing. In your proxy script you are saving your webpage on your disk and then loading into iframe. I think instead of loading the page you saved on disk in iframe try to open that page in browser. All those sites that restirct their page to be loaded into iframe will now get opened without any error.
Try this I think it an work

My Proxy Server side code :-
DateFormat df = new SimpleDateFormat("ddMMyyyyHHmmss");
String dirName = df.format(new Date());
String dirPath = "C:/apache-tomcat-7.0.23/webapps/offlineWeb/" + dirName;
String serverName = "http://localhost:8080/offlineWeb/" + dirName;
boolean directoryCreated = new File(dirPath).mkdir();
if (!directoryCreated)
log.error("Error in creating directory");
String html = Jsoup.connect(url.toString()).get().html();
doc = Jsoup.parse(html, url);
links = doc.select("link");
scripts = doc.select("script");
images = doc.select("img");
for (Element element : links) {
String linkHref = element.attr("abs:href");
if (linkHref != "") {
element.attr("href", linkHref);
}
}
for (Element element : scripts) {
String scriptSrc = element.attr("abs:src");
if (scriptSrc != "") {
element.attr("src", scriptSrc);
}
}
for (Element element : images) {
String imgSrc = element.attr("abs:src");
if (imgSrc != "") {
element.attr("src", imgSrc);
log.info(imgSrc);
}
}
And Now i am just returning the path where i saved my html file
That's it about my server code

Related

How to Detect If JS is Running in Website Builder?

I want to display forums inside websites where my javascript (and HTML and CSS) is embedded, but if the javascript is running inside a website builder, I just want to have some text telling the user their forums are installed here (in the embedded DIV) and not try to display any forums. My only idea is to look at the URL and if I see a known website builder, then run the website builder code, but I would need a large list of all website builder URLs. Does anyone have such a list or is there a better solution? My current code looks like this:
var hostURL = window.location.href;
if (hostURL == "about:srcdoc") hostURL = window.parent.location.href;
if (hostURL.indexOf("websites.godaddy.com") > -1 || // godaddy
hostURL.indexOf(".preview.editmysite.com") > -1) { // weebly
displayWebsiteBuilderInfo();
return;
}

Here's what I did, but I'm not sure if it's a good solution (and it's not a solution for the original question):
In the PHP code that handles the request to get the forums data I read the content at the referer URL (comes from the client - window.location.href) to see if the javascript is there. If it's not there, assume the request came from a website builder. Then if isWebsiteBuilder is true back at the client, call displayWebsiteBuilderInfo();
Here's the PHP code:
$siteContent = #file_get_contents($referer);
$siteContent = htmlspecialchars_decode($siteContent);
$idx = strpos($siteContent, "<script async src=\"https://www.bubblecritic.com/js/embed/the_js.js\"></script>");
if ($idx === false) $isWebsiteBuilder = true;

Add attachment by url to Outlook mail

The context
There is a button on the homepage of each document set in a document library on a SharePoint Online environment. When the button is clicked, an Outlook window opens with the title and body set and all the files in the document set should be added as the attachments.
The code
Here's the code I have so far:
var olApp = new ActiveXObject("Outlook.Application");
var olNs = olApp.GetNameSpace("MAPI");
var olItem = olApp.CreateItem(0);
var signature = olItem.HTMLBody;
signature.Importance = 2;
olItem.To = "";
olItem.Cc = "";
olItem.Bcc = "";
olItem.Subject = "Pre filled title";
olItem.HTMLBody =
"<span style='font-size:11pt;'>" +
"<p>Pre filled body</p>" +
"</span>";
olItem.HTMLBody += signature;
olItem.Display();
olItem.GetInspector.WindowState = 2;
var docUrl = "https://path_to_site/Dossiers/13245_kort titel/New Microsoft Word Document.docx";
olItem.Attachments.Add(docUrl);
The Problem
When I run this code, an Outlook window opens with everything set correctly. But on the line where the attachment is added I get following very vague error message:
SCRIPT8: The operation failed.
I thought it could be the spaces in the url so I replaced them:
docUrl = docUrl.replace(/ /g, "%20");
Also didn't work (same error) and providing all parameters like this also didn't work:
olItem.Attachments.Add(docUrl, 1, 1, "NewDocument");
Passing a path to a local file (e.g. C:/folder/file.txt) or a publicly available url to an image does work. So my guess is it has something to do with permissions or security. Does anybody know how to solve this?
PS: I know using an ActiveX control is not the ideal way of working (browser limitations, security considerations, ...) but the situation is what it is and not in my power to change.

You cannot pass a url to MailItem.Attachments.Add in OOM (it does work in Redemption - I am its author - for RDOMail.Attachments.Add). Outlook Object Model only allows a fully qualified path to a local file or a pointer to another item (such as MailItem).

Do we need a web server (like Apache) to access a .json file?

I was trying to read an info.json file, using the jQuery API. Please find the code below, which is part of test.html.
$.getJSON('info.json', function(data) {
var items = [];
$.each(data, function(key, val) {
items.push('<li id="' + key + '">' + val + '</li>');
});
The test.html file resides on my local machine and when I try to open it in the browser, the Ajax call is not getting triggered and the info.json file is not read.
Is it not working because I don't have a web server? Or am I doing anything wrong in the code? (I don't see any errors in the Firebug console though).
Thanks in advance.

You will always have to host your site from where you are making AJAX call. Otherwise it will throw this exception.
origin null is not allowed by access-control-allow-origin
Host your page on localhost server and I guess everything will work nicely.

While technically you don't need a web server for this, some of the libraries you use to abstract network access may not work with local files and some browsers don't let local files do a lot, so something like a little test web server for static files would be very useful for your development and testing.

Install a small webserver like http://jetty.codehaus.org/jetty/
easy to install, and small download ;)

By putting your JSON string into a text file and loading it in a iframe, you can extrapolate the data. (Most browsers can load .txt files in iframes.)
var frame = document.createElement("IFRAME"); //Create new iframe
var body = document.body;
frame.onload = function() { //Extrapolate JSON once loaded
data = JSON.parse(frame.contentDocument.documentElement.innerText); //Loads as a global.
body.removeChild(frame); //Removes the frame once no longer necessary.
}
frame.style.display = "none"; //Because the frame will have to be appended to the body.
body.appendChild(frame);
frame.src = "your JSON.txt"; //Select source after the onload function is set.

Loading local files in a WebBrowser Control

Im trying to load local files in a WebBrowser on Windows Phone 7.1 but Im always getting exceptions or a blank page.
I tried with
Stream stream = Application.GetResourceStream(
new Uri("./Html/par/index.html",
UriKind.Relative)).Stream;
using (StreamReader reader = new StreamReader(stream))
{
// Navigate to HTML document string
this.webBrowser.NavigateToString(reader.ReadToEnd());
}
this is firing a blank page.
I set index.html and all files needed (css/js) as Content and IsScriptEnable to "true".
Do you have an idea how to solve this problem?

i think the path is incorrect.
do you have /Html/par directories in your project ? secondly is index.html set to content ?
try
var rs = Application.GetResourceStream(new Uri("myFile.html", UriKind.Relative));
using(StreamReader sr = new StreamReader(rs.Stream))
{
this.webBrowser.NavigateToString(sr.ReadToEnd());
}
this might help
http://phone7.wordpress.com/2010/08/08/loading-a-local-html-file-in-the-webbrowser-control/
this might help undertand differences between resource and content
http://invokeit.wordpress.com/2011/09/30/images-and-build-actio-settings-in-wp7/
this link details how to load the file and other linked files
http://transoceanic.blogspot.co.uk/2011/07/wp7-load-local-html-files-and-all.html

wget + JavaScript?

I have this webpage that uses client-side JavaScript to format data on the page before it's displayed to the user.
Is it possible to somehow use wget to download the page and use some sort of client-side JavaScript engine to format the data as it would be displayed in a browser?

You could probably make that happen with something like PhantomJS
You can write a phantomjs script that will load the page like a browser would, and then either take screenshots or use JS to inspect the page and pull out data.

Here is a simple little phantomjs script that triggers javascript on a webpage and allows you to pull it down locally:
file: get.js
var page = require('webpage').create(),
system = require('system'), address;
address = system.args[1];
page.scrollPosition= { top: 4000, left: 0}
page.open(address, function(status) {
if (status !== 'success') {
console.log('** Error loading url.');
} else {
console.log(page.content);
}
phantom.exit();
});
Use it as follows:
$> phantomjs /path/to/get.js "http://www.google.com" > "google.html"
Changing /path/to, url and filename to what you want.

Not with wget, as I doubt it includes any form of a JavaScript engine. However, you could use WebKit to process the page, and thus the output.
Using things like this as a base for how to get the content: http://situated.wordpress.com/2008/06/04/take-screenshots-of-a-website-from-the-command-line/

Develop Reference

JavaScript is the programming language of the Web.

Issues in developing web scraper - javascript

2 issues, pick any you like: your server side proxy code contains bugs plenty of sites have either explicit frame-break code or at least expect to be top level frame.

Related

How to Detect If JS is Running in Website Builder?

Add attachment by url to Outlook mail

Do we need a web server (like Apache) to access a .json file?

Loading local files in a WebBrowser Control

wget + JavaScript?

Categories

Resources