Webscraping images in python with selenium and beautifulsoup from an AJAX website

Webscraping images in python with selenium and beautifulsoup from an AJAX website - javascript

I've spent a long time trying to go through the html, javascript, network traffic, etc, and learning a lot about javascript, blobs, base64 decoding/encoding of images but I still can't seem to figure out how to extract the images in these videos from this website: https://www.jamesallen.com/loose-diamonds/all-diamonds/
Here's what I know:
Each video is actually a set of up to 512 images, which are retrieved from a server via files titled setX.bin (X is a number). Then they are parsed via an int array into a blob object (There's also some base64 but I forget where), that is somehow converted into an image.
Following the source code is very difficult as it is purposely written as spaghetti code.
How can I extract each diamond's images and do so efficiently?
My one solution is:
I can get the setX.bin files very easily, and if I just 'pass' them into the javascript functions somehow then I should be good.
My second solution is:
to rotate each diamond manually and extract the images from the cache or something like that.
I'd like to use python to do this.
EDIT:
I found javascript here on SO that does gives the 'SecurityError: The operation is not secure'. Here it is:
function exportCanvasAsPNG(id, fileName) {
var canvasElement = document.getElementById(id);
canvasElement.crossOrigin = "anonymous";
var MIME_TYPE = "image/png";
var imgURL = canvasElement.toDataURL(MIME_TYPE);
window.console.log(canvasElement);
var dlLink = document.createElement('a');
dlLink.download = fileName;
dlLink.href = imgURL;
dlLink.dataset.downloadurl = [MIME_TYPE, dlLink.download, dlLink.href].join(':');
document.body.appendChild(dlLink);
dlLink.click();
document.body.removeChild(dlLink);
}
exportCanvasAsPNG("canvas-key-_w5qzvdqpl",'asdf.png');
I ran it from Firefox console. When I ran a similar execute script in python, I got the same issue.
I want to be able to scrape all 360 degree images for each canvas.
Edit2: To make this question simpler, I know how to get the setX.bin files, but I don't know how to covert this collection of images from bin to jpg. Each bin file is multiple jpg files.

The .bin files appear to just contain the jpegs concatenated together with some leading metadata. You can simply iterate through the bytes of the file looking for jpeg file signatures (0xFFD8) and slice out each image:
JPEG_MAGIC = b"\xff\xd8"
with open("set0.bin", "rb") as f:
s = f.read()
i = 0
start_index = s.find(JPEG_MAGIC)
while True:
end_index = s.find(JPEG_MAGIC, start_index + 1)
if end_index == -1:
end_index = len(s)
with open(f"out{i:03}.jpg", "wb") as out:
out.write(s[start_index:end_index])
if end_index == len(s):
break
start_index = end_index
i += 1
Result:

Related

Encode / Decode PNGs to base64 strings in JXA/JavaScript

I am trying to write a JXA script in Apple Script Editor that converts PNG files to base64 strings, which can then be added to a JSON object.
I cannot seem to find a JXA method that works for doing the base64 encoding /decoding part.
I came across a droplet which was written using Shell Script that outsources the task to openssl and then outputs a .b64 file:
for f in "$#"
do
openssl base64 -in "$f" -out "$f.b64"
done
So I was thinking of Frankenstein'ing this up to a method that uses evalAS to run inline AppleScript, per the example:
(() => {
'use strict';
// evalAS2 :: String -> IO a
const evalAS2 = s => {
const a = Application.currentApplication();
return (a.includeStandardAdditions = true, a)
.runScript(s);
};
return evalAS2(
'use scripting additions\n\
for f in' + '\x22' + file + '\x22\n'
do
openssl base64 -in "$f" -out "$f.b64"
done'
);
})();
And then re-opening the .b64 file in the script, but this all seems rather long-winded and clunky.
I know that it is possible to use Cocoa in JXA scripts, and I see that there are methods for base64 encoding/decoding in Cocoa...
As well as Objective-C:
NSData *imageData = UIImagePNGRepresentation(myImageView.image);
NSString * base64String = [imageData base64EncodedStringWithOptions:0];
The JXA Cookbook has a whole section going over Syntax for Calling ObjC functions, which I am trying to read over.
From what I understand, it should look something like:
var image_to_convert = $.NSData.alloc.UIImagePNGRepresentation(image)
var image_as_base64 = $.NSString.alloc.base64EncodedStringWithOptions(image_to_convert)
But I just am a total noob to this, so it is still difficult for me to understand it all.
In the speculative code above, I am not sure where I would get the image data from?
I am currently trying:
ObjC.import("Cocoa");
var image = $.NSImage.alloc.initWithContentsOfFile(file)
console.log(image);
var image_to_convert = $.NSData.alloc.UIImagePNGRepresentation(image)
var image_as_base64 = $.NSString.alloc.base64EncodedStringWithOptions(image_to_convert)
But it is resulting in the following errors:
$.NSData.alloc.UIImagePNGRepresentation is not a function. (In
'$.NSData.alloc.UIImagePNGRepresentation(image)',
'$.NSData.alloc.UIImagePNGRepresentation' is undefined)
I am guessing it is because UIImagePNGRepresentation is of the UIKit framework, which is an iOS thing and not OS X?
I came across this post, which suggests this:
NSArray *keys = [NSArray arrayWithObject:#"NSImageCompressionFactor"];
NSArray *objects = [NSArray arrayWithObject:#"1.0"];
NSDictionary *dictionary = [NSDictionary dictionaryWithObjects:objects forKeys:keys];
NSImage *image = [[NSImage alloc] initWithContentsOfFile:[imageField stringValue]];
NSBitmapImageRep *imageRep = [[NSBitmapImageRep alloc] initWithData:[image TIFFRepresentation]];
NSData *tiff_data = [imageRep representationUsingType:NSPNGFileType properties:dictionary];
NSString *base64 = [tiff_data encodeBase64WithNewlines:NO];
But again, I have no idea how this translates to JXA. I just am determined to get something working.
I was hoping that there was some way of just doing it in plain old JavaScript that will work in a JXA script?
I look forward to any answers and/or pointers that you might be able to provide. Thank you all in advance!

I'm sorry I never worked with JXA but a lot in Objective-C.
I think You are getting the compile errors, because You are trying to always allocate new Objects.
I think it should be the simply:
ObjC.import("Cocoa");
var imageData = $.NSData.alloc.initWithContentsOfFile(file);
console.log(imageData);
var image_as_base64 = imageData.base64EncodedStringWithOptions(0); // Call method of allocated object
0 is a constant for Base64 encodings to just get the base64 String.
edit:
var theString = ObjC.unwrap(image_as_base64);
This to make the value visible to JXA

Use below code. Read the file to var file from jquery file input element using FileReader in 'readDataAsURL'. Then you will have your png as a string in base64 format.
You may need to split the base64 string with ',' to get the actual data part of the string, which you can include in a JSON and send it to the backend via an API.
var file = $('#fileUpload').prop('files')[0];
var base64data;
var reader = new FileReader();
reader.readAsDataURL(file);
reader.onload = function() {
base64data = reader.result;
var dataUrl = base64data.split(",");
};
Usually the base64 string you will get be in this form.
'data:image/png;base64,STREAM_OF_SOME_CHARACTERS...
So the STREAM_OF_SOME_CHARACTERS...(dataUrl) is where actually the image data is in.
Furthermore you can open the image in a HTML page with
<img src=base64data>

File Uploading ReadAsDataUrl

I have a question about the File API and uploading files in JavaScript and how I should do this.
I have already utilized a file uploader that was quite simple, it simply took the files from an input and made a request to the server, the server then handled the files and uploaded a copy file on the server in an uploads directory.
However, I am trying to give people to option to preview a file before uploading it. So I took advantage of the File API, specifically the new FileReader() and the following readAsDataURL().
The file object has a list of properties such as .size and .lastModifiedDate and I added the readAsDataURL() output to my file object as a property for easy access in my Angular ng-repeat().
My question is, it occurred to me as I was doing this that I could store the dataurl in a database rather than upload the actual file? I was unsure if modifying the File data directly with it's dataurl as a property would affect its transfer.
What is the best practice? Is it better to upload a file or can you just store the dataurl and then output that, since that is essentially the file itself? Should I not modify the file object directly?
Thank you.
Edit: I should also note that this is a project for a customer that wants it to be hard for users to simply take uploaded content from the application and save it and then redistribute it. Would saving the files are urls in a database mitigate against right-click-save-as behavior or not really?

There is more then one way to preview a file. first is dataURL with filereader as you mention. but there is also the URL.createObjectURL which is faster
Decoding and encoding to and from base64 will take longer, it needs more calculations, more cpu/memory then if it would be in binary format.
Which i can demonstrate below
var url = 'https://upload.wikimedia.org/wikipedia/commons/c/cc/ESC_large_ISS022_ISS022-E-11387-edit_01.JPG'
fetch(url).then(res => res.blob()).then(blob => {
// Simulates a file as if you where to upload it throght a file input and listen for on change
var files = [blob]
var img = new Image
var t = performance.now()
var fr = new FileReader
img.onload = () => {
// show it...
// $('body').append(img)
var ms = performance.now() - t
document.body.innerHTML = `it took ${ms.toFixed(0)}ms to load the image with FileReader<br>`
// Now create a Object url instead of using base64 that takes time to
// 1 encode blob to base64
// 2 decode it back again from base64 to binary
var t2 = performance.now()
var img2 = new Image
img2.onload = () => {
// show it...
// $('body').append(img)
var ms2 = performance.now() - t2
document.body.innerHTML += `it took ${ms2.toFixed(0)}ms to load the image with URL.createObjectURL<br><br>`
document.body.innerHTML += `URL.createObjectURL was ${(ms - ms2).toFixed(0)}ms faster`
}
img2.src = URL.createObjectURL(files[0])
}
fr.onload = () => (img.src = fr.result)
fr.readAsDataURL(files[0])
})
The base64 will be ~3x larger. For mobile devices I think you would want to save bandwidth and battery.
But then there is also the latency of doing a extra request but that's where http 2 comes to rescue

Handling Large stream with Node.js

Here is my attempt to convert an svg string to a png buffer using node and the imagemagick convert tool. The png buffer is then used to draw an image in a pdf using pdfkit.
Td;lr I have a large svg string that needs to get to a child process "whole" (i.e not chunked). How do I do so?
This is an example that works for small files.
var child_process = require('child_process');
var pdfDocument = require('pdfkit');
var convert = child_process.spawn("convert", ["svg:", "png:-"]),
svgsrc = '<svg><rect height="100" width="100" style="fill:red;"/></svg>';
convert.stdout.on('data', function(data) {
console.log(data.toString('base64')
doc = new pdfDocument()
doc.image(data)
}
convert.stdin.write(svgsrc);
convert.stdin.end()
This works when the svg string is 'small' (like the on provided in the example) -- I'm not sure where the cut-off from small to large is.
However, when attempting to use a larger svg string (something you might generate using D3) like this [ large string ]. I run into:
Error: Incomplete or corrupt PNG file
So my question is: How do I ensure that the convert child process reads the entire stream before processing it?
A few things are known:
The png buffer is indeed incomplete. I used a diff tool to check
the base64 string generated by the app
versus the base64 of a png-to-svg converter online. The non-corrupted
string is much larger than the corrupted string. (sorry I haven't
been more specific with file size). That is, the convert tool seems
to not be reading the entire source at any given time.
The source svg string is not corrupted (as evidenced by the fact the the
gist rendered it)
When used in the command line the convert tool correctly generate a
png file from a svg "stream" with cat large_svg.svg | convert svg:png:- So this is not an issue with the convert tool.
This lead me to down a rabbit hole of looking a node's buffer size for writeable and readable streams but to no avail. Maybe someone has worked with larger streams in node and can help out with getting the to work.

As #mscdex pointed out I had to wait for the process to finish before
attemping downstream work. All that was need was to wait for the end event on the convert.stdout stream and concatenate buffers on the data events.
// allocate a buffer of size 0
graph = Buffer.alloc(0)
// on data concat the incoming and the `graph`
convert.stdout.on('data', function(data) {
graph = Buffer.concat([graph, data])
}
convert.stdout.on('end', function(signal) {
// ... draw on pdf
}
EDIT:
Here is an more efficient version of the above where we use #mscdex
suggestion to do the concatenation on the end callback and keeping a chunksize argument to help the Buffer allocate size when concatenation the chunks.
// allocate a buffer of size 0
var graph = [];
var totalchunks = 0;
convert.stdout.on('data', function(data) {
graph.push(data);
totalsize +=data.length;
}
convert.stdout.on('end', function(signal) {
var image = Buffer.concat(graph, totalsize);
// ... draw on pdf
}

javascript: downloaded file size different from content length

I have a base64 string which I decoded and wishes to allow the user to save this as a file. In particular, when I check the length of decodedContent, it's 11271 bytes.
var content = messageObj['data'];
var decodedContent = atob(content);
console.log(decodedContent.length);
Then I used
var blob = new Blob([decodedContent], {type: 'application/octet-stream'});
window.open((window.URL || window.webkitURL).createObjectURL(blob));
To prompt the user to save decodedContent. When I check the file size saved, it says 16892 bytes, which is different from what is stated above. Any idea why?
Content is a base64 encoded tar-ball file sent from the server.
for i ,l in enumerate(to_download):
if i == 1:
break
last_index = l.rfind('|')
download_path = l[last_index+1:].strip()
mda_url = '%s/%s'%(root_url, download_path)
logger.debug('Downloading file %s/%s at %s', i, len(to_download), mda_url)
mda_req = urllib2.Request(mda_url)
mda_response = urllib2.urlopen(mda_req)
f = StringIO.StringIO(mda_response.read())
replace_path = mda_url.replace('/', '_')
ti = TarInfo("%s.txt" %replace_path)
ti.size = f.len
tar.addfile(ti, f)
tar.close()
tar_file.close()
with open("/Users/carrier24sg/Desktop/my.tar", 'rb') as f:
tar_str = f.read()
logger.info("Completed downloading all the requested files..")
return tar_str
UPDATE:
Narrowed down to the problem being with either var decodedContent = atob(content); or var blob = new Blob([decodedContent], {type: 'application/octet-stream'});
Finally I managed to use the #Jeremy Bank's answer here. His first answer solves the issue of content length being different, but when I check the checksum, the content doesn't seem to tally. Only using his second answer's function b64toBlob did I get to resolve this. However, I'm still not sure what is wrong here, so I'm hoping someone can shed some light to this.

I think the problem is that atomb() gives back a basic base64 version of the file. when you ask for the size of it, it will return the bytes it contains.
When you make a blob from the base64 variable and ask for its size, it will return how much space it will fill up on your computer.
The two things are different because file storing size and coded size are not the same thing. And they differ on different platforms as well.

decoding a bin file to mp3 using Node Js

I am encoding a MP3 file to Base64 in Node Js using this method :
encodebase64 = function(mp3file){
var bitmap = fs.readFileSync(mp3file);
var encodedstring = new Buffer(bitmap).toString('base64');
fs.writeFileSync('encodedfile.bin', encodedstring);}
and then again I want to construct the MP3 file from the Base64 bin file, but the file created is missing some headers , so obviously there's a problem with the decoding.
the decoding function is :
decodebase64 = function(encodedfile){
var bitmap = fs.readFileSync(encodedfile);
var decodedString = new Buffer(bitmap, 'base64');
fs.writeFileSync('decodedfile.mp3', decodedString);}
I wondered if anyone can help
Thanks.

Perhaps it is an issue with the encoding parameter. See this answer for details. Try using utf8 when decoding to see if that makes a difference. What platforms are you running your code on?

#Noah mentioned an answer about base64 decoding using Buffers, but if you use the same code from the answer, and you try to create MP3 files with it, then they won't play and their file size will be larger than original ones just like you experienced in the beginning.
We should write the buffer directly to the mp3 file we want to create without converting it(the buffer) to an ASCII string:
// const buff = Buffer.from(audioContent, 'base64').toString('ascii'); // don't
const buff = Buffer.from(audioContent, 'base64');
fs.writeFileSync('test2.mp3', buff);
More info about fs.writeFile / fs.writeFileAsync

Develop Reference

JavaScript is the programming language of the Web.

Webscraping images in python with selenium and beautifulsoup from an AJAX website - javascript

Related

Encode / Decode PNGs to base64 strings in JXA/JavaScript

File Uploading ReadAsDataUrl

Handling Large stream with Node.js

javascript: downloaded file size different from content length

decoding a bin file to mp3 using Node Js

Categories

Resources