Finding text from webpage using javascript - javascript

I am tring to make a script but I can't really find a solution.
I'm trying to find a string from a website. Hard part here is that I can't use
document.documentElement.innerHTML.search("string")
Since I can't do it locally, I want to use something like this:
var link = "myweb.com"
link.documentElement.innerHTML.search("string")
At the moment, my script generates the link, opens it and closes it: I just need to search the webpage for the word "error."

Javascript run inside of a client's browser won't actually be able to retrieve another website's html for you (unless it is a different page on your own website). You may want to read about the Same-Origin Policy.
You can, however, use javascript as a language to do what you want - just not running inside of a browser. You can use something called Node.js, which is simply a program you can use to run javascript outside of a browser.
What it really boils down to is that if you want to scrape another website (which is the term for what you are trying to do), you typically need to make a scraper that runs on a server, and not a browser.
To be complete, a (probably shady) way to scrape another website is to:
Have your server-side code fetch another website's conents
Use AJAX to pass the contents to a client's browser
Have the client do all of the processing
Optionally send the scraped information back to your server
Here is a good article on scraping with nodeJS.

if you need it just to work on your computer, you can make a userscript that will do this easily. If you want it to work as part of a hosted website, you need a server side solution

Related

Include Puppeteer (javascript) code inside of Static HTML

I am pretty new to front-end and trying to confirm conceptually first before I start to implement it.
For example, I want to return a static HTML file to the user's rest call request. That way, I can open HTML page on the user's side and get input. Along that line, I want to insert puppeteer code inside of the static HTML file to navigate to certain websites before getting the user's input.
Does it make sense? If not, can you please explain why?
It doesn't make sense to me. Puppeteer is a Node.js library. It's meant to be run on a machine, not in the browser. Based on what you want to do, I would explore using an <iframe> to load the other websites and writing JavaScript to control the <iframe> and get whatever input you need.

Altering a server file through javascript code

So I have a setup of a tablet connected to a Raspberry Pi computer. I want to be able to have a webpage hosted on the Pi change the contents of a file also hosted on the Pi (which will be used in a python script that i have written). I tried having the file inside a hidden iframe, but while my javascript ran, it didn't ever actually change the contents.
How can i set up communication between the webpage and the server files? I know nothing about jQuery in the slightest, but if i have to use it, I will.
While you can actually do something with files in HTML5, you must know that Javascript is a client-side script. In other words JS 'runs' on the persons browser and not really your server.
Languages like PHP actually run on your server, and therefor are able to achieve what you want.
I'm not THE Javascript expert, and you might even be able to modify a server file with JS but it will be 'hacky' and have a poor implementation and you might need to run a sort of API on your server that actually does the changing..
Save yourself the trouble of doing it like that and pick the right language for the job. I would suggest PHP. Its fairly easy to set that up and run the website. PHP has enough ways to create, view and modify files on the server itself: http://php.net/manual/en/book.filesystem.php

Use python to a open web browser (on windows), trigger javascript actions, and get the html contents?

Yes that sounds overly complicated.
I am trying to mine data from pages on our intranet. The pages are secure. The connection is refused when I try to get the contents with urllib.urlopen().
So I would like to use python to open a web browser to open the site then click some links that trigger javascript pop ups containing tables of info that I want to collect.
Any suggestions on where to begin?
I know the format of the page. It is something like this:
<div id="list">
<ul id="list item">
<li><a onclict="Openpopup('1');">blah</a></li>
</ul>
<ul></ul>
etc
Then a hidden frame becomes visible and the fields in the table within are filled.
<div>
<table>
<tr><td><span id="info_i_want">...
First off, I suggest that it's better to figure out what the page needs that JS is providing, and fake that - you'll have an easier time scraping the page if a browser isn't involved.
If it's just Javascript making an XMLHttpRequest, you can find the page from which the Javascript fetches the iframe data and connect directly to that.
But in spite of that you may need a library that does Javascript execution (if the reverse-engineering is too hard or it uses challenge tokens). A web-rendering framework like Gecko or WebKit might be appropriate.
Take a good look at Selenium if you insist on using a true web browser or cannot get the programmatic methods to work.
Once you've gotten the page contents via whatever method, you need an HTML parser (such as sgmllib or [almost] xml.dom). I suggest a DOM library. Parse the DOM and extract the contents from the appropriate node in the resulting tree.
The connection is refused when I try to get the contents with urllib.urlopen(). probably means you have to make a post request using python urllib module.I would suggest you use urllib2.You may also need to handle cookies, referrer,user-agent from your python code.
To see all the post request fired from your browser use firefox's live-http-headers.
For the javascript part,
Your best bet is to run a headless browser e.g phantomjs which understands all the intricacies of JavaScript, DOM etc but you will have to write your code in Javascript, benefit is that you can do whatever you want.
As, #phihag mentioned selenium is also a good option.
First of all, you should really find out why the connection is refused when you access the page with Python. Most likely, you'll have to perform HTTP authentication or specify a different User-Agent.
Firing up a browser, navigating, and getting the HTML back is a complex task. Luckily, you can implement it using selenium.
Consider taking a look at splinter which is a simpler webdriver API than Selenium.

How do JavaScript-based modal/popup services like KissInsights and Hello Bar work?

I'm developing a modal/popup system for my users to embed in their sites, along the lines of what KissInsights and Hello Bar (example here and here) do.
What is the best practice for architecting services like this? It looks like users embed a bit of JS but that code then inserts additional script tag.
I'm wondering how it communicates with the web service to get the user's content, etc.
TIA
You're right that usually it's simply a script that the customer embeds on their website. However, what comes after that is a bit more complicated matter.
1. Embed a script
The first step as said is to have a script on the target page.
Essentially this script is just a piece of JavaScript code. It's pretty similar to what you'd have on your own page.
This script should generate the content on the customer's page that you wish to display.
However, there are some things you need to take into account:
You can't use any libraries (or if you do, be very careful what you use): These may conflict with what is already on the page, and break the customer's site. You don't want to do that.
Never override anything, as overriding may break the customer's site: This includes event listeners, native object properties, whatever. For example, always use addEventListener or addEvent with events, because these allow you to have multiple listeners
You can't trust any styles: All styles of HTML elements you create must be inlined, because the customer's website may have its own CSS styling for them.
You can't add any CSS rules of your own: These may again break the customer's site.
These rules apply to any script or content you run directly on the customer site. If you create an iframe and display your content there, you can ignore these rules in any content that is inside the frame.
2. Process script on your server
Your embeddable script should usually be generated by a script on your server. This allows you to include logic such as choosing what to display based on parameters, or data from your application's database.
This can be written in any language you like.
Typically your script URL should include some kind of an identifier so that you know what to display. For example, you can use the ID to tell which customer's site it is or other things like that.
If your application requires users to log in, you can process this just like normal. The fact the server-side script is being called by the other website makes no difference.
Communication between the embedded script and your server or frames
There are a few tricks to this as well.
As you may know, XMLHttpRequest does not work across different domains, so you can't use that.
The simplest way to send data over from the other site would be to use an iframe and have the user submit a form inside the iframe (or run an XMLHttpRequest inside the frame, since the iframe's content resides on your own server so there is no cross domain communication)
If your embedded script displays content in an iframe dialog, you may need to be able to tell the script embedded on the customer site when to close the iframe. This can be achieved for example by using window.postMessage
For postMessage, see http://ejohn.org/blog/cross-window-messaging/
For cross-domain communication, see http://softwareas.com/cross-domain-communication-with-iframes
You could take a look here - it's an example of an API created using my JsApiToolkit, a framework for allowing service providers to easily create and distribute Facebook Connect-like tools to third-party sites.
The library is built on top of easyXDM for Cross Domain Messaging, and facilitates interaction via modal dialogs or via popups.
The code and the readme should be sufficient to explain how things fit together (it's really not too complicated once you abstract away things like the XDM).
About the embedding itself; you can do this directly, but most services use a 'bootstrapping' script that can easily be updated to point to the real files - this small file could be served with a cache pragma that would ensure that it was not cached for too long, while the injected files could be served as long living files.
This way you only incur the overhead of re-downloading the bootstrapper instead of the entire set of scripts.
Best practice is to put as little code as possible into your code snippet, so you don't ever have to ask the users to update their code. For instance:
<script type="text/javascript" src="http://your.site.com/somecode.js"></script>
Works fine if the author will embed it inside their page. Otherwise, if you need a bookmarklet, you can use this code to load your script on any page:
javascript:(function(){
var e=document.createElement('script');
e.setAttribute('language','javascript');
e.setAttribute('src','http://your.site.com/somecode.js');
document.head.appendChild(e);
})();
Now all your code will live at the above referenced URI, and whenever their page is loaded, a fresh copy of your code will be downloaded and executed. (not taking caching settings into account)
From that script, just make sure that you don't clobber namespaces, and check if a library exists before loading another. Use the safe jQuery object instead of $ if you are using that. And if you want to load more external content (like jQuery, UI stuff, etc.) use the onload handler to detect when they are fully loaded. For example:
function jsLoad(loc, callback){
var e=document.createElement('script');
e.setAttribute('language','javascript');
e.setAttribute('src',loc);
if (callback) e.onload = callback;
document.head.appendChild(e);
}
Then you can simply call this function to load any js file, and execute a callback function.
jsLoad('http://link.to/some.js', function(){
// do some stuff
});
Now, a tricky way to communicate with your domain to retrieve data is to use javascript as the transport. For instance:
jsLoad('http://link.to/someother.js?data=xy&callback=getSome', function(){
var yourData = getSome();
});
Your server will have to dynamically process that route, and return some javascript that has a "getSome" function that does what you want it to. For instance:
function getSome(){
return {'some':'data','more':'data'};
}
That will pretty effectively allow you to communicate with your server and process data from anywhere your server can get it.
You can serve a dynamically generated (use for example PHP or Ruby on Rails) to generate this file on each request) JS file from your server that is imported from the customers web site like this:
<script type="text/javascript" src="//www.yourserver.com/dynamic.js"></script>
Then you need to provide a way for your customer to decide what they want the modal/popup to contain (e.g. text, graphics, links etc.). Either you create a simple CMS or you do it manually for each customer.
Your server can see where each request for the JS file is coming from and provide different JS code based on that. The JS code can for example insert HTML code into your customers web site that creates a bar at the top with some text and a link.
If you want to access your customers visitors info you probably need to either read it from the HTML code, make your customers provide the information you want in a specific way or figure out a different way to access it from each customers web server.

Is it possible access other webpages from within another page

Basically, what I'm trying to do is simply make a small script that accesses finds the most recent post in a forum and pulls some text or an image out of it. I have this working in python, using the htmllib module and some regex. But, the script still isn't very convenient as is, it would be much nicer if I could somehow put it into an HTML document. It appears that simply embedding Python scripts is not possible, so I'm looking to see if theres a similar feature like python's htmllib that can be used to access some other webpage and extract some information from it.
(Essentially, if I could get this script going in the form of an html document, I could just open one html document, rather than navigate to several different pages to get the information I want to check)
I'm pretty sure that javascript doesn't have the functionality I need, but I was wondering about other languages such as jQuery, or even something like AJAX?
As Greg mentions, an Ajax solution will not work "out of the box" when trying to load from remote servers.
If, however, you are trying to load from the same server, it should be fairly straightforward. I'm presenting this answer to show how this could be done using jQuery in just a few lines of code.
<div id="placeholder">Please wait, loading...</div>
<script type="text/javascript" src="/path/to/jquery.js">
</script>
<script type="text/javascript>
$(document).ready(function() {
$('#placeholder').load('/path/to/my/locally-served/page.html');
});
</script>
If you are trying to load a resource from a different server than the one you're on, one way around the security limitations would be to offer a proxy script, which could fetch the remote content on the server, and make it seem like it's coming from your own domain.
Here are the docs on jQuery's load method : http://docs.jquery.com/Ajax/load
There is one other nice feature to note, which is partial-page-loading. For example, lets say your remote page is a full HTML document, but you only want the content of a single div in that page. You can pass a selector to the load method, as in my example above, and this will further simplify your task. For example,
$('#placeholder').load('/path/to/my/locally-served/page.html #someTargetDiv');
Best of luck!-Mike
There are two general approaches:
Modify your Python code so that it runs as a CGI (or WSGI or whatever) module and generate the page of interest by running some server side code.
Use Javascript with jQuery to load the content of interest by running some client side code.
The difference between these two approaches is where the third party server sees the requests coming from. In the first case, it's from your web server. In the second case, it's from the browser of the user accessing your page.
Some browsers may not handle loading content from third party servers very gracefully (that is, they might pop up warning boxes or something).
You can embed Python. The most straightforward way would be to use the cgi module. If the script will be run often and you're using Apache it would be more efficient to use mod_python or mod_wsgi. You could even use a Python framework like Django and code the entire site in Python.
You could also code this in Javascript, but it would be much trickier. There's a lot of security concerns with cross-site requests (ah, the unsafe internet) and so it tends to be a tricky domain when you try to do it through the browser.

Categories

Resources