Node.js Scraping ASU Course - javascript

I'm pretty new to Node.js, so apologies in advance if I don't know what I'm talking about.
I'm trying to scrape some courses off ASU's course catalog (https://webapp4.asu.edu/catalog/) and have made numerous attempts using Zombie, Node.IO, and the HTTPS api. In both cases I've run into a redirect loop.
I'm wondering if it's because I'm not setting my headers properly?
Below is a sample code of what I used (not Zombie/Node.IO):
var https = require('https');
var option = {
host: 'webapp4.asu.edu',
path: '/catalog',
method: 'GET',
headers: {
'set-cookie': 'onlineCampusSelection=C'
}
};
var req = https.request(options, function(res) {
console.log("statusCode: ", res.statusCode);
console.log("headers: ", res.headers);
res.on('data', function(d) {
process.stdout.write(d);
});
});
Just to clarify, I'm not having trouble with scraping with Node.js in general. More specifically, however, is ASU's course catalog that is giving me trouble.
Appreciate any ideas you guys could give me, thanks!
Update: My request successfully went through if I create a cookie with a JSESSIONID I got from Chrome/FF. Is there a way for me to request/create a JSESSIONID?

It looks like the server sets the JSESSIONID cookie and then redirects away, so you need to tell node.js not to follow redirects if you want to grab the cookie. I don't know how to do this with the http or https packages, but there is another package you can get via npm: request, which lets you do it. Here's a sample that should get you started:
var request = require("request");
var options = {
url: "https://webapp4.asu.edu/catalog/",
followredirect: false,
}
request.get(options, function(error, response, body) {
console.log(response.headers['set-cookie']);
});
Output should look something like this:
[ 'JSESSIONID=B43CC3BB09FFCDE07AE6B3B702717431.catalog1; Path=/catalog; Secure' ]

Id highly recommend using jsDOM in conjunction with jQuery(for node). I've used it many many times for scaping as it makes it super easy.
heres the example from jsdom's readme:
// Count all of the links from the nodejs build page
var jsdom = require("jsdom");
jsdom.env("http://nodejs.org/dist/", [
'http://code.jquery.com/jquery-1.5.min.js'
],
function(errors, window) {
console.log("there have been", window.$("a").length, "nodejs releases!");
});
Hope that helps, jsdom has made it real easy to hack together scraping experiments (for me at least).

Related

how to: wsse soap request in javascript (node)

I need to communicate with a soap:xml API from a node server on the Wix.com platform. The API requires Soap WSSE authentication.
I can send an authenticated request to the endpoint in SoapUI, however haven't been able successfully do this on the Wix node platform.
Wix only have a subset of node packages available for install and XMLHttpRequest is not available in their environment.
I have tried node-soap but receive errors which indicate the package might be buggy on the Wix node platform.
I've found myself using the node "request" (https://www.npmjs.com/package/request) package and trying to roll my own solution to work around missing node packages and environment restrictions.
Currently I can send a request to the end point however I receive the following response;
<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<SOAP-ENV:Envelope xmlns:SOAP-ENV=\"http://schemas.xmlsoap.org/soap/envelope/\"><SOAP-ENV:Body><SOAP-ENV:Fault><faultcode>SOAP-ENV:Client</faultcode><faultstring>Access denied</faultstring></SOAP-ENV:Fault></SOAP-ENV:Body></SOAP-ENV:Envelope>\n
This suggests to me i'm not authenticating correctly.
As I mentioned, I've been able to successfully send requests and receive expected responses via SoapUI. So the API is functioning, and I suspect it's my implementation that is at fault. I'll be honest, I've worked with REST/JSON API's in the past, and it has been a long time since i've worked with a SOAP API, and I remember even back then having a whole lot of pain!
my request code
import request from 'request';
import {wsseHeaderAssoc} from 'backend/wsse';
export function getLocationID() {
let apiUsername = "username";
let apiPassword = "password";
let apiURL = "https://api.serviceprovider.com/wsdl";
// WSSE authentication header vars
    let wsse = wsseHeaderAssoc(apiUsername, apiPassword);
let wsseUsername = wsse["Username"];
let wssePasswordDigest = wsse["PasswordDigest"];
let wsseCreated = wsse["Created"];
let wsseNonce = wsse["Nonce"];
let xml =
`<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:urn="urn:masked:api">`+
`<soapenv:Header>`+
`<wsse:Security xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd">`+
`<wsse:UsernameToken wsu:Id="UsernameToken-19834957983507345987345987345">`+
`<wsse:Username>${wsseUsername}</wsse:Username>`+
`<wsse:Password Type="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-username-token-profile-1.0#PasswordDigest">${wssePasswordDigest}</wsse:Password>`+
`<wsse:Nonce EncodingType="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-soap-message-security-1.0#Base64Binary">${wsseNonce}</wsse:Nonce>`+
`<wsu:Created>${wsseCreated}</wsu:Created>`+
`</wsse:UsernameToken>`+
`</wsse:Security>`+
`</soapenv:Header>`+
`<soapenv:Body>`+
...
`</soapenv:Body>`+
`</soapenv:Envelope>`
var options = {
url: apiURL,
method: 'POST',
body: xml,
headers: {
'Content-Type':'text/xml;charset=utf-8',
'Accept-Encoding': 'gzip,deflate',
'Content-Length':xml.length,
'SOAPAction':"https://api.serviceprovider.com/wsdl/service",
'User-Agent':"Apache-HttpClient/4.1.1 (java 1.5)",
'Connection':"Keep-Alive"
}
};
let callback = (error, response, body) => {
if (!error && response.statusCode == 200) {
console.log('Raw result ', response);
// If you ever get this working, do some mad magic here
};
console.log('Error ', response);
};
}
I'm using wsse-js (https://github.com/vrruiz/wsse-js/blob/master/wsse.js) to generate the PasswordDigest, Created datetime stamp and Nonce as the node wsse package (https://www.npmjs.com/package/wsse) isn't available on Wix. I've read over the code and based on what i've read elsewhere this looks like a good implementation.
I made one small addition to return the generated details in an assoc array;
export function wsseHeaderAssoc(Username, Password) {
var w = wsse(Password);
var wsseAssoc = [];
wsseAssoc["Username"] = Username;
wsseAssoc["PasswordDigest"] = w[2];
wsseAssoc["Created"] = w[1];
wsseAssoc["Nonce"] = w[0];
return wsseAssoc;
}
As stated earlier i'm receiving a response of;
<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<SOAP-ENV:Envelope xmlns:SOAP-ENV=\"http://schemas.xmlsoap.org/soap/envelope/\"><SOAP-ENV:Body><SOAP-ENV:Fault><faultcode>SOAP-ENV:Client</faultcode><faultstring>Access denied</faultstring></SOAP-ENV:Fault></SOAP-ENV:Body></SOAP-ENV:Envelope>\n
And i'm expecting a valid SOAP XML response.
I've used the raw xml structure and headers from SoapUI to construct this, everything looks fine, i really have no idea where i'm going wrong.
I would love any pointers anyone could throw my way - I've lost 2 days trying to brute force this, I need help.
You can use the WSSecurity method from the soap package. An example from their README:
var options = {
hasNonce: true,
actor: 'actor'
};
var wsSecurity = new soap.WSSecurity('username', 'password', options)
client.setSecurity(wsSecurity);

Node.js / Socket.io Access-Control-Allow-Origin error on both local and remote site

world !
I'm trying to get a JSON feed to my client, through my node/socket.io server.
All I keep getting, locally, or remotely (using nodejitsu) is the following error :
XMLHttpRequest cannot load http://jsonfeed.url. Origin http://example.com is not allowed by Access-Control-Allow-Origin.
From my research and testing, it doesn't depend on my browser, but on my server code. please enlighten me!
Server code (i can provide more, but tried to be concise):
var fs = require('fs'),
http = require('http');
//use of a fs and node-static to serve the static files to client
function handler(request, response) {
"use strict";
response.writeHead(200, {
'Content-Type': 'text/html',
'Access-Control-Allow-Origin' : '*'});
fileServer.serve(request, response); // this will return the correct file
}
var app = http.createServer(handler),
iosocks = require('socket.io').listen(app),
staticserv = require('node-static'); // for serving files
// This will make all the files in the current folder
// accessible from the web
var fileServer = new staticserv.Server('./');
app.listen(8080);
That's it, I have tried everything... Time to call the geniuses!
OK, how do I say this... First thanks to StackOverflow and to those who tried to answer, it's a bit less cold out there than it once was, keep it going.
THEN...
It's working.
I do not know why, and it's annoying.
All I did was removing :
response.writeHead(200, {
'Content-Type': 'text/html',
'Access-Control-Allow-Origin' : '*'});
Just like that. like it was at the beginning when it was NOT working!
Bonus question, Can somebody tell me why??

responding: Invalid_request when requesting token from node server

I'm attempting to utilize Google's authorization services this guide.
I'm having trouble trading the code in for a token from the server.
var token_request='?code='+code+
'&client_id='+client_id+
'&client_secret='+client_secret+
'&redirect_uri='+redirect_uri+
'&grant_type=authorization_code';
options = {
host: "accounts.google.com",
path: '/o/oauth2/token'+token_request,
method: "POST"
}
var tokenRequest = https.request(options, function(res){
var resp = "";
res.on('data', function(data){
resp+= data;
})
res.on('end', function(){
console.log(resp);
})
res.on('error', function(err){
console.log("\033[;33mIt's an Error.\033[0;39m");
console.log(err);
})
}).end();
I would say from this site that you should use 'method: "GET"' instead of 'method: "POST"' since your values are in the query string.
EDIT:
According to the comments, I would say that you have to rework your code in order for it to work properly.
To be honest I am trying to do the same thing with difficulty. Not withstanding that is it worth trying googleapis.
You need to use npm to install the google apis
npm install googleapis
see https://npmjs.org/package/googleapis
for the documentation

How to use Google Chrome Remote Debugging Protocol in HTTP?

I referred http://code.google.com/chrome/devtools/docs/remote-debugging.html.
First, I started a new chrome process.
chrome --remote-debugging-port=9222 --user-data-dir=remote-profile
Then I want to try some options written in http://code.google.com/intl/ja/chrome/devtools/docs/protocol/tot/index.html, but how can I use them?
I already know how to use those methods in WebSocket, but I have to use it in HTTP.
I tried this nodejs code but failed.
var http = require('http');
var options = {
host: 'localhost',
port: 9222,
path: '/devtools/page/0',
method: 'POST'
};
var req = http.request(options, function (res) {
console.log(res.headers);
res.on('data', function (chunk) {
console.log(chunk);
});
});
req.on('error', function (e) { console.log('problem' + e.message); });
req.write(JSON.stringify({
'id': 1,
'method': "Page.enable"
}));
req.end();
Is it wrong?
I know this is a fairly old question, but I ran into it when I was trying to do something similar.
There's an npm module called chrome-remote-interface, which makes using the Chrome Remote Debugging API a lot easier: https://github.com/cyrus-and/chrome-remote-interface
npm install chrome-remote-interface
Then you can use the module in your code like this:
var Chrome = require('chrome-remote-interface');
Chrome(function (chrome) {
with (chrome) {
on('Network.requestWillBeSent', function (message) {
console.log(message.request.url);
});
on('Page.loadEventFired', close);
Network.enable();
Page.enable();
Page.navigate({'url': 'https://github.com'});
}
}).on('error', function () {
console.error('Cannot connect to Chrome');
});
I think it says "Note that we are currently working on exposing an HTTP-based protocol that does not require client WebSocket implementation."
I'm not sure it means that you can have HTTP instead of WebSocket now.
There is also a fantastic NPM module called Weinre that allows you to easily use the Chrome debugging/ remote debugging tools. If you have to test cross browser too it allows you to use the Chrome tools even on certain versions of IE. There is some more information on the MSDN blog.

Connect to Cloudant CouchDB with Node.js?

I am trying to connect to my CouchDB database on Cloudant using Node.js.
This worked on the shell:
curl https://weng:password#weng.cloudant.com/my_app/_all_docs
But this node.js code didn't work:
var couchdb = http.createClient(443, 'weng:password#weng.cloudant.com', true);
var request = couchdb.request('GET', '/my_app/_all_docs', {
'Host': 'weng.cloudant.com'
});
request.end();
request.on('response', function (response) {
response.on('data', function (data) {
util.print(data);
});
});
It gave me this data back:
{"error":"unauthorized","reason":"_reader access is required for this request"}
How do I do to list all my databases with Node.js?
The built-in Node.js http client is pretty low level, it doesn't support HTTP Basic auth out of the box. The second argument to http.createClient is just a hostname. It doesn't expect credentials in there.
You have two options:
1. Construct the HTTP Basic Authorization header yourself
var Base64 = require('Base64');
var couchdb = http.createClient(443, 'weng.cloudant.com', true);
var request = couchdb.request('GET', '/my_app/_all_docs', {
'Host': 'weng.cloudant.com',
'Authorization': 'Basic ' + Base64.encode('weng:password')
});
request.end();
request.on('response', function (response) {
response.on('data', function (data) {
util.print(data);
});
});
You will need a Base64 lib such as one for node written in C, or a pure-JS one (e.g. the one that CouchDB Futon uses).
2. Use a more high-level Node.js HTTP client
A more featureful HTTP client, like Restler, will make it much easier to do the request above, including credentials:
var restler = require('restler');
restler.get('https://weng.cloudant.com:443/my_app/_all_docs', {
username: 'weng',
password: 'password'
}).on('complete', function (data) {
util.print(data);
});
There are lots of CouchDB modules for Node.js.
node-couch - a CouchDB connector
node-couchdb - A full API implementation
node-couchdb-min - Light-weight client with low level of abstraction and connection pooling.
cradle - a high-level, caching, CouchDB client
Just wanted to add
nano - minimalistic couchdb driver for node.js
to the list. It is written by Nuno Job, CCO of nodejitsu, and actively maintained.
This answer is looking a bit dated. Here is an updated answer that I verified using the following Cloudant Supported NPM Node Client library that works.
https://www.npmjs.com/package/cloudant#getting-started
And to answer his question on how to list his databases use the following code.
//Specify your Cloudant Database Connection URL. For Bluemix format is: https://username:password#xxxxxxxxx-bluemix.cloudant.com
dbCredentials_url = "https://username:password#xxxxxxxxx-bluemix.cloudant.com"; // Set this to your own account
// Initialize the library with my account.
// Load the Cloudant library.
cloudant = require('cloudant')(dbCredentials_url);
// List the Cloudant databases
cloudant.db.list(function(err, allDbs) {
console.log('All my databases: %s', allDbs.join(', ')) });

Categories

Resources