How to look up URL names in Javascript - javascript

How can you use Javascript to parse out the URL of a page?

First of all, you have to decide whether you want to do this on the client or on the server. On the server, you can load the XML and use XPath to locate the part of the XML DOM tree that contains the site:
//site/name[text() = 'Blah00']
When using JavaScript on the client, a better solution would be to have a server which keeps the current status per site (use a database or some in-memory structure). Then use AJAX requests to ask the server for the information for a certain site. jQuery will make your life much easier.

I have solved this:
<script>
function mySiteURL() {
var myURL = window.location.href;
var dashIndex = myURL.lastIndexOf("-");
var dotIndex = myURL.lastIndexOf(".");
var result = myURL.substring(dashIndex + 1, dotIndex);
}
</script>

Related

Can i scrape this site using just node?

im very new to JavaScript so be patient.
I've been trying to scrape a site and get all the product URLs in a list that i will use later in other function like this:
url='https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx'
var http = require('http-get');
var request = require("request");
var cheerio = require("cheerio");
function getURLS(url) {
request(url, function(err, resp, body){
var linklist = [];
$ = cheerio.load(body);
var links = $('#productResults a');
for(valor in links) {
if(links[valor].attribs && links[valor].attribs.href && linklist.indexOf(links[valor].attribs.href) == -1){
linklist.push(links[valor].attribs.href);
}
}
var extended_links = [];
linklist.forEach(function(link){
extended_link = 'https://www.fromuthtennis.com/frm/' + link;
extended_links.push(extended_link);
})
console.log(extended_links);
})
};
This does work unless you go to the second page of items like this:
url='https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx#Filter=[pagenum=2*ava=1]'
var http = require('http-get');
var request = require("request");
var cheerio = require("cheerio"); //etc...
As far as i know this happens because the content on the page is loaded dynamically.
To get the contents of the page i believe i need to use PhantomJS because that would allow me to get the html code after the page has been fully loaded, so i installed the phantomjs-node module. I want to use NodeJS to get the URL list because the rest of my code is written on it.
I've been reading a lot about PhantomJS but using the phantomjs-node is tricky and i still don't understand how could i get the URL list using it because i'm very new to JavaScript or coding in general.
If someone could guide me a little bit i'd appreciate it a lot.
Yes, you can. That page looks like it implements Google's Ajax Crawling URL.
Basically it allows websites to generate crawler friendly content for Google. Whenever you see a URL like this:
https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx#Filter=[pagenum=2*ava=1]
You need to convert it to this:
https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx?_escaped_fragment_=Filter%3D%5Bpagenum%3D2*ava%3D1%5D
The conversion is simply take the base path: https://www.fromuthtennis.com/frm/c-10-mens-tops.aspx, add a query param _escaped_fragment_ who's value is URL fragment Filter=[pagenum=2*ava=1] encoded into Filter%3D%5Bpagenum%3D2*ava%3D1%5D using standard URI encoding.
You can read the full specification here: https://developers.google.com/webmasters/ajax-crawling/docs/specification
Note: This does not apply to all websites, only websites that implement Google's Ajax Crawling URL. But you're in luck in this case
You can see any product you want without using dynmic content using this url:
https://www.fromuthtennis.com/frm/showproduct.aspx?ProductID={product_id}
For example to see product 37023:
https://www.fromuthtennis.com/frm/showproduct.aspx?ProductID=37023
All you have to do is for(var productid=0;prodcutid<40000;productid++) {request...}.
Another approach is to use phantom module. (https://www.npmjs.com/package/phantom). It will let you run phantom command directly from your NodeJS app

Scraping authenticated website in node.js

I want to scrape my college website (moodle) with node.js but I haven't found a headless browser able to do it. I have done it in python in just 10 lines of code using RoboBrowser:
from robobrowser import RoboBrowser
url = "https://cas.upc.edu/login?service=https%3A%2F%2Fatenea.upc.edu%2Fmoodle%2Flogin%2Findex.php%3FauthCAS%3DCAS"
browser = RoboBrowser()
browser.open(url)
form = browser.get_form()
form['username'] = 'myUserName'
form['password'] = 'myPassword'
browser.submit_form(form)
browser.open("http://atenea.upc.edu/moodle/")
print browser.parsed
The problem is that the website requires authentication. Can you help me? Thanks!
PD: I think this can be useful https://www.npmjs.com/package/form-scraper but I can't get it working.
Assuming you want to read a 3rd party website, and 'scrape' particular pieces of information, you could use a library such as cheerio to achieve this in Node.
Cheerio is a "lean implementation of core jQuery designed specifically for the server". This means that given a String representation of a DOM (or part thereof), cheerio can traverse it in much the same way as jQuery can.
An example from Max Ogden show how you can use the request module to grab HTML from a remote server and then pass it to cheerio:
var $ = require('cheerio')
var request = require('request')
function gotHTML(err, resp, html) {
if (err) return console.error(err)
var parsedHTML = $.load(html)
// get all img tags and loop over them
var imageURLs = []
parsedHTML('a').map(function(i, link) {
var href = $(link).attr('href')
if (!href.match('.png')) return
imageURLs.push(domain + href)
})
}
var domain = 'http://substack.net/images/'
request(domain, gotHTML)
Selenium has support for multiple languages and multiple platforms and multiple browsers.

How Edit data of an XML node with Javascript

I want to write some data in an existing local XML file with Javascript with some text from an Html page. Is it possible to change content of nodes?
Here is XML sample:
<Notepad>
<Name>Player1</Name>
<Notes>text1</Notes>
</Notepad>
I will get some more text from input and want to add it after "text1", but can't find a solution.
function SaveNotes(content,player)
{
var xml = "serialize.xml";
var xmlTree = parseXml("<Notepad></Notepad>");
var str = xmlTree.createElement("Notes");
$(xmlTree).find("Notepad").find(player).append(str);
$(xmlTree).find("Notes").find(player).append(content);
var xmlString = (new XMLSerializer()).serializeToString(xmlTree);
}
Here is the code to manipulate xml content or xml file :
[Update]
Please check this Fiddle
var parseXml;
parseXml = function(xmlStr) {
return (new window.DOMParser()).parseFromString(xmlStr, "text/xml");
};
var xmlTree = parseXml("<root></root>");
function add_children(child_name, parent_name) {
str = xmlTree.createElement(child_name);
//strXML = parseXml(str);
$(xmlTree).find(parent_name).append(str);
$(xmlTree).find(child_name).append("hello");
var xmlString = (new XMLSerializer()).serializeToString(xmlTree);
alert(xmlString);
}
add_children("apple", "root");
add_children("orange", "root");
add_children("lychee", "root");
you can use it for searching in xml as well as adding new nodes with content in it. (And sorry i dont know how to load xml from client side and display it.)
but this fiddle demo will be helpful in adding content in xml and searching in it.
Hope it helps :)
If you want to achieve this on the client side you can parse your xml into a document object:
See
https://developer.mozilla.org/en-US/docs/Web/Guide/Parsing_and_serializing_XML
and
http://www.w3schools.com/xml/tryit.asp?filename=tryxml_parsertest2
And then manipulate it like you would the DOM of any html doc, e.g. createElement, appendChild etc.
See https://developer.mozilla.org/en-US/docs/Web/API/Document/createElement
Then to serialize it into a String again you could use https://developer.mozilla.org/en-US/docs/Web/API/Element/outerHTML
Persisting the data
Writing to a local file is not possible in a cross-browser way. In IE you could use ActiveX to read/write file.
You could use cookies to store data on the client side, if your data keeps small enough.
In HTML5 you could use local storage, see http://www.w3schools.com/html/html5_webstorage.asp
Try to use these two package one to convert to json and when is finish the other to come back
https://www.npmjs.com/package/xml2json
https://www.npmjs.com/package/js2xmlparser

C# Faster way to get javascript DOM than EO.WebBrowser

I have code in place where I'm using EO.WebBrowser to get the html from a page using the EO.WebView Request:
var cookie = new EO.WebBrowser.Cookie("cookie", "value");
cookie.Path = path;
cookie.Domain = domain;
var options = new BrowserOptions();
options.EnableWebSecurity = false;
Runtime.SetDefaultOptions(options);
var request = new Request(url);
request.Cookies.Add(cookie);
webView.LoadRequestAndWait(request);
Finally I use the following to get the HTML I need:
webView.GetDOMWindow().document.body.outerHTML
My issue is that this is very slow and although I can get it to run it locally, I can not get it to run on Azure server code. Is there a way to do the same thing using HttpWebRequest?
You can use JavaScript:
var data = (string)webView.EvalScript("document.body.outerHTML");
No, HttpWebRequest (and other similar "get me HTML response") methods will only give you HTML itself and will not run JavaScript on the page.
For server side processing of dynamic HTML consider using proper headless internet browser? instead of trying to convince regular IE to work correctly without UI.
the eo.webbrowser runs multi-process like chrome and unsupported by many cloud service environment.
just use WebClient or HttpWebRequest or RestSharp or something like that can do http requests to get the response html.

How to get hyperlink ids in another html file

am doing a project it requires a web site.on this site i have to darw state diagram for hyperlinks.that is how the hyperlinks are attached to one another on a site.am using html.how to get hyperlink id in another html file.i know about document.getElementById.
Thanks inadvance
That would require a way to access another HTML file through AJAX, which is not possible if it isn't on your domain or if CORS isn't enabled.
There's however quite a few things you could do:
Use your own server-side as proxy for fetching the HTML file.
Do the processing on the server-side and let JavaScript plot the data.
Do everything on the server-side.
If you'd like to get the ID's of a link you should use a HTML parser. Modern browsers include a such, it's called DOMParser. You'd do something like this:
var parser = new DOMParser();
var doc = parser.parseFromString(yourHTMLSource, 'text/html');
var links = doc.getElementsByTagName('a');
for(var i = 0, length = links.length; i < length; i++) {
links[i].getAttribute('id'); // -> Returns the ID of the link, if any
}
As I remember it, IE doesn't support this, but has it's own module for HTML parsing with some different methods, but still relatively easy to use.

Categories

Resources