Scrape multiple JavaScript-based websites in R

Scrape multiple JavaScript-based websites in R - javascript

These are my first steps in programming and I'm trying to learn as much as I can before bothering you guys. But right now I'm pretty much stuck after trying several ways (thought of by myself or found online).
What I'm trying to do now is saving multiple whole JavaScript pages to further work with them in R. As far as I understood this is just possible using phantomjs. I've managed to get a code loading the page. But I'm struggling with the loop:
writeLines("var page = new WebPage();
var fs = require('fs');
for (i = 101; i <= 150; i++) {
page.open('http://understat.com/match/' + i, function (status) {
just_wait();
});
function just_wait() {
setTimeout(function() {
fs.write('match' + i + '.html', page.content, 'w');
phantom.exit();
}, 2500);
}
}
", con = "scrape.js")
js_scrape <- function(
js_path = "scrape.js",
phantompath = "/Users/Marek/Documents/Programmierung/Startversuche/phantomjs-2.1.1/bin/phantomjs"){
lines <- readLines(js_path)
command = paste(phantompath, js_path, sep = " ")
system(command)
}
js_scrape()
It's just saving the last page of the loop. Reading other threads I understood that the problem is that phantomJS is asynchronous and is pretty much closing the pages before they have been loaded. But I could not work a way around, so it's saving all of the pages in different files.

Related

Is there a JavaScript/jQuery File Create event?

For some reason I got stuck with some events on jQuery/JS
function update()
{
if(scrolling == true) {
return;
}
var count = 0;
jQuery.get('count.txt', function(data) {
count = data;
}).done(function() {
var countstr = '' + count;
myImage.src = "latest" + countstr + ".jpg#" + new Date().getTime();
setTimeout(update, 1000);
});
}
In my last question I asked about the jQuery "done function"
Currently I am working with a Timeout/timer to update the image every second
setTimeout(update, 1000);
It does work but I know that this can't be the smartest solution. In C# I'm able to use a FileWatcher to use an event to check if there is a new file in the folder
FileSystemWatcher watcher = new FileSystemWatcher();
watcher.Path = path;
watcher.NotifyFilter = NotifyFilters.LastAccess | NotifyFilters.LastWrite
| NotifyFilters.FileName | NotifyFilters.DirectoryName;
watcher.Filter = "*.jpg";
watcher.Created += new FileSystemEventHandler(OnChanged);
watcher.EnableRaisingEvents = true;
Is there an API or an event for jQuery/JS to check that? I was also looking to work with AJAX but I got no experiences with AJAX.
//edit
I know that JS is not able to do that. But I was just wondering if there is another way to use this event (like AJAX or Node.js)
What Am I doing?
I made a software which will upload many images on my ftp server. images0, images1, images2 etc.
The event should check if there was another image uploaded and should show this instead of the old image

Florian, as it was already mentioned, you cannot do it with client JS code.
What would I use in this case (I assume it's the universal solution):
NodeJS has file watching API (https://nodejs.org/docs/latest/api/fs.html#fs_class_fs_fswatcher), thus, you can subscribe to FS events.
You should notify client about this changes. I would use soket.io ( https://socket.io/ , both client and server side).
Using the file watcher and websokets you can notify user about any FS changes. You can upload files using FTP, HTTP client or just create them locally.

Clientside/Frontend Languages won't able to create/edit/delete a File
It can only read a File
for writefile in node js ..its already in stackoverflow refer Writing files in Node.js

Use regular expressions in javascript to manage source files

I am making my first steps coding with JavaScript and also playing with a webgl library called Three.js.
After see some tutorials and make some experiments I finally made this: https://dl.dropboxusercontent.com/u/15814455/Monogram.html.
As you can see in my code, the object reflects randomly a group of 6 images that I have in a folder of 13 images.
var numberOfImages = 13, images = [];
for (var i = 1; i <= numberOfImages; i++) {
images.push('sources/instagram/image' + i + ".jpg");
}
var urls = images.sort(function(){return .6 - Math.random()}).slice(0,6);
var reflectionCube = THREE.ImageUtils.loadTextureCube( urls );
reflectionCube.format = THREE.RGBFormat;
The thing is that each time I upload an Instagram picture, it will be saved in that folder called instagram.
Now my problem is that, if I upload for example, 10 images to the folder I have to change this line of code: var numberOfImages = 13 to this var numberOfImages = 23.
So I am looking for a way to modify my code and not to set a limit of number of images. So I could upload images in the folder and then automatically see them in my 3d object.
I've been reading on internet and I found that I can use something called regular expressions in my code to solve this problem.
I would like to know if using regular expressions is a real solution. Is it worth to learn regular expressions to solve this problem?
Do you have some suggestion? There is another solution? Maybe its something simple and I should write a different line of code, but if it's something more complicated and I should learn some language I would like to learn the right language.

First off, if you are going to be programming at length in pretty much any language, it will be worth knowing regular expressions and how/when to use them so it's will be useful to learn them.
If this was a client/server problem where you controlled the server, the usual way to solve this problem is that the server would scan the file system on the server and it would tell the client how many images to prepare for.
If you have to solve this entirely on the client, then you can't directly scan the file system from the client, but you can request increasing file numbers and you can see (asynchronously) when you stop getting successful loading of images. This is not particularly easy to code because the response will be asynchronous, but it could be done.
Here's a scheme for preloading images and finding out where they stopped preloading successfully:
function preloadImages(srcs, callback) {
var img, imgs = [];
var remaining = srcs.length;
var failed = [];
function checkDone() {
--remaining;
if (remaining <= 0) {
callback(failed);
}
}
for (var i = 0; i < srcs.length; i++) {
img = new Image();
img.onload = checkDone;
img.onerror = img.onabort = function() {
failed.push(this.src);
checkDone();
}
img.src = srcs[i];
imgs.push(img);
}
}
var maxNumberOfImages = 30, images = [];
for (var i = 1; i <= maxNumberOfImages; i++) {
images.push('sources/instagram/image' + i + ".jpg");
}
preloadImges(images, function(failed) {
var regex = /(\d+)\.jpg$/;
var nums = failed.map(function(item) {
var matches = item.match(regex);
return parseInt(matches[1], 10);
}).sort();
var numImages = nums[0];
// now you know how many good images there are
// put your code here to use that knowledge
});
Note: This does use a regular expression to parse the image number out of a URL.
It would be possible to code this without a preset limit, but it would be more work. You'd probably request images 10 at a time and if you didn't get any errors, then request the next block of 10 until you found the first failure.

How to get all of the cookies dropped in a session as a txt file?

I'm working on a digital art project that involves gathering cookies from a set of websites that I visit. I'm dabbling in writing some code to help me with this but overall I'm just looking for the easiest/fastest way to gather all of the contents of the cookies dropped in a single visit into a text file for re-use later.
Right now - I'm using this script in a JavaScript bookmarklet which replaces the page I'm on with the contents of the cookies in an array (I'm later putting this array into a python script I wrote...).
The contents of the bookmarklet is below but the problem right now is it only returns the contents of the cookies from the single domain.
So for example - if I run this script on the NYTimes.com homepage I get approx 48 cookies dropped by the domain. But if I look in Chrome I see that all of the 3rd party tracking scripts have hundreds of cookies. How do I gather them all? Not just the NYtimes.com ones?
This is the current JavaScript code I'm running via a bookmarklet right now:
function get_cookies_array() {
var cookies = { };
if (document.cookie && document.cookie != '') {
var split = document.cookie.split(';');
for (var i = 0; i < split.length; i++) {
var name_value = split[i].split("=");
name_value[0] = name_value[0].replace(/^ /, '');
cookies[decodeURIComponent(name_value[0])] = decodeURIComponent(name_value[1]);
}
}
return cookies;
}
function quotationsanitize(cookie){
if(cookie.indexOf('"') === -1)
{
return cookie;
}
else{
alert("found a quotation!");
return encodeURIComponent(cookie);
}
}
function sanitize(cookie){
if(cookie.indexOf(',') === -1)
{
return quotationsanitize(cookie);
}
else{
alert("found a comma!");
return quotationsanitize(encodeURIComponent(cookie));
}
}
function appendCookies(){
$("body").empty();
var cookies = get_cookies_array();
$("body").append("[");
for(var name in cookies) {
//$("body").append(name + " : " + cookies[name] + "<br />" );
var cookieinfo = sanitize(cookies[name]);
$("body").append('"' + cookieinfo + '",<br />' );
}
$("body").append("]");
}
var js = document.createElement('script');
js.src = "https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js";
document.head.appendChild(js);
jqueryTimeout = window.setTimeout(appendCookies, 500);
I'm removing " and , from the output because I'm putting this data into an array in Python by copying and pasting it. I admit that it's a hack. If anyone has any better ideas I'm all ears!

I'd write a simple little HTTP proxy. And then set your browser to use the proxy, and have it record all the cookies as they pass through.
There's a question about writing a simple proxy here, seriously simple python HTTP proxy?
which might get you started.
You'd need to extend it to read the headers, and extract the cookies, but that's relatively easy, and if you're happy in python, you''l find libraries that do most of what you want already. You would want to record the Related header too, so you knew which cookies came from which page request, but you could then record and entire browsing session quite simply.

how to save files in "while" with phantomjs?

I need to save 4 files in html output.
here is the code in phantomjs:
var i = 0;
while (i<4)
{
var page = require('webpage').create();
var fs = {};
fs = require('fs');
if(i==0)
{
var url = 'http://www.lamoda.ru/shoes/dutiki-i-lunohody/?sitelink=leftmenu&sf=16&rdr565=1#sf=16';
} else {
var url = 'http://www.lamoda.ru/shoes/dutiki-i-lunohody/?sitelink=leftmenu&sf=16&rdr565=1#sf=16&p='+i;
}
page.open(url, function (status) {
var js = page.evaluate(function () {
return document;
});
console.log(js.all[0].outerHTML);
page.render('export'+i+'.png');
fs.write(i+'.html', js.all[0].outerHTML, 'w');
phantom.exit();
});
i++;
}
It seems that I need to change the FS variable, but I don't know how... I don't need create fs1,fs2,fs3,fs4... I need to find you the better solution, hope you will help, thank you)

Is it okay if your requests are serial, so page 2 is not requested until page 1 has returned? If so I recommend you base your code of this multi-url sample in the documentation.
If you want the requests to run in parallel then you need to use a JavaScript closure to protect the local variables (see https://stackoverflow.com/a/17619716/841830 for an example of how to do that). Once you are doing that you can then either parse "url" to find out if it ends in p=1, p=2, etc. Or assign i inside the page object, and access it with this.i.

Photoshop Scripting: iterating list of all layers withing the layerset is very slow

The task that I'm trying to implement is very simple:
I need to get a list of all layers (one-level), within a specified layer set (group), and to write this list to a file.
The code is simple (and working) as well:
function indexCurrent(document){
var log = new File(indexLocation+document.name+'.js');
alert("Collecting data");
var images = document.layerSets.getByName("Images").layers;
var imagesLength = images.length;
var layers = [];
alert("Iterating " + imagesLength + " layers");
for(var jj = 0, jL = imagesLength; jj < jL; jj++){
layers.push('\t\t\'' + images[jj].name + '\'');
}
alert("Writing " + layers.length + " layers");
log.open('w');
log.write('\n\t\'' + document.name + '\': [\n' );
log.write(layers.join(",\n"));
log.write('\n\t]\n');
log.close();
}
This code works, but for 150+ layers it takes hours between "Iterating" and "Writing" lines.
I have read all related questions here, but that doesn't help.
I'm sure that there should be much more efficient way for such a simple task.
I'm running Photoshop CS6 on Windows 7.
Thanks.

I suggest you try switching from accessing the layers via the DOM to getting at them via the action manager. I'm pretty sure you'll get better performance that way. I'm terrible with the action manager code so I can't give you a working example - just something to google :)

Runs fine and fast for me, but you didn't state how large the source psd is.
Anyhoo, have a look here:
Action Manager layer search
I can't run it as I'm running good old CS2. Ye-har!

Develop Reference

JavaScript is the programming language of the Web.

Scrape multiple JavaScript-based websites in R - javascript

Related

Is there a JavaScript/jQuery File Create event?

Use regular expressions in javascript to manage source files

How to get all of the cookies dropped in a session as a txt file?

how to save files in "while" with phantomjs?

Photoshop Scripting: iterating list of all layers withing the layerset is very slow

Categories

Resources