How to use GCP DLP with a file stream - javascript

I'm working with Node.js and GCP Data Loss Prevention to attempt to redact sensitive data from PDFs before I display them. GCP has great documentation on this here
Essentially you pull in the nodejs library and run this
const fileBytes = Buffer.from(fs.readFileSync(filepath)).toString('base64');
// Construct image redaction request
const request = {
parent: `projects/${projectId}/locations/global`,
byteItem: {
type: fileTypeConstant,
data: fileBytes,
},
inspectConfig: {
minLikelihood: minLikelihood,
infoTypes: infoTypes,
},
imageRedactionConfigs: imageRedactionConfigs,
};
// Run image redaction request
const [response] = await dlp.redactImage(request);
const image = response.redactedImage;
So normally, I'd get the file as a buffer, then pass it to the DLP function like the above. But, I'm no longer getting our files as buffers. Since many files are very large, we now get them from FilesStorage as streams, like so
return FilesStorage.getFileStream(metaFileInfo1, metaFileInfo2, metaFileInfo3, fileId)
.then(stream => {
return {fileInfo, stream};
})
The question is, is it possible to perform DLP image redaction on a stream instead of a buffer? If so, how?
I've found some other questions that say you can stream with ByteContentItem and GCPs own documentation mentions "streams". But, I've tried passing the returned stream from .getFileStream into the above byteItem['data'] property, and it doesn't work.

So chunking the stream up into buffers of appropriate size is going to work best here. There seem to be a number of approaches to build buffers from a stream you can use here.
Potentially relevant: Convert stream into buffer?
(A native stream interface is a good feature request, just not yet there.)

Related

How can I asynchronously decompress a gzip file in a JavaScript web worker?

I have a vanilla JS script running in-browser (Chrome, and only Chrome needs to be supported - if browser support is of any importance).
I want to offload a 15 MB gzip file to a web worker and unzip that file asynchronously, then return the uncompressed data back to the main thread, in order not to freeze the main application thread during the decompression procedure.
When unzipping in the main thread, I'm using the JSXCompressor library, and that works fine. However, as this library references the window object, which isn't accessible from a worker context, I can't use the same library injected into the worker code (running the decompression raises an exception on the first line of the library mentioning "window", saying it's undefined).
The same is true for other JS libraries I've managed to dig up in an afternoon of googling, like zlib or the more modern Pako. They all in one way or another seem to reference a DOM element, which raises exceptions when used in a web worker context.
So my question is - is anyone aware of a way I can pull this off, either by explaining to me what I seem to be getting wrong, through a hack, or by providing me with a link to a JS library that can function in this use case (I need only decompression, standard gzip)?
Edit: I'm also interested in any hack that can leverage built-in browser capabilities for ungzipping, as is done for HTTP requests.
Thanks a bunch.
I've authored a library fflate to accomplish exactly this task. It offers asynchronous versions of every compression/decompression method it supports, but rather than running in an event loop, the library delegates the processing to a separate thread. You don't need to manually create a worker or specify paths to the package's internal workers, since it generates them on-the-fly.
import { gunzip } from 'fflate';
// Let's suppose you got a File object (from, say, an input)
const reader = new FileReader();
reader.onloadend = () => {
const typedArrayUncompressed = new Uint8Array(reader.result);
gunzip(typedArrayUncompressed, (err, gzippedResult) => {
// This is a Uint8Array
console.log('Compressed output:', gzippedResult);
});
}
reader.readAsArrayBuffer(fileObject);
Effectively, you need to convert the input format to a Uint8Array, then convert the output format to whatever you want to use. For instance, FileReader is the most cross-platform solution for files, fflate.strToU8 and fflate.strFromU8 work for string conversions.
P.S. This is actually still about as fast as the native CompressionStream solution from my tests, but will work in more browsers. If you want streaming support, use fflate's AsyncGunzip stream class.
There is a new web API Compression streams proposal, which is already implemented in Chrome and which does exactly this: asynchronously compress/decompress data.
It should support both deflate and gzip algorithms, and should use native implementations -> faster than any lib.
So in Chrome you can simply do:
if( "CompressionStream" in window ) {
(async () => {
// To be able to pass gzipped data in stacksnippet we host it as a data URI
// that we do convert to a Blob.
// The original file is an utf-8 text "Hello world"
// which is way bigger once compressed, but that's an other story ;)
const compressed_blob = await fetch("data:application/octet-stream;base64,H4sIAAAAAAAAE/NIzcnJVyjPL8pJAQBSntaLCwAAAA==")
.then((r) => r.blob());
const decompressor = new DecompressionStream("gzip");
const decompression_stream = compressed_blob.stream().pipeThrough(decompressor);
const decompressed_blob = await new Response(decompression_stream).blob();
console.log("decompressed:", await decompressed_blob.text());
})().catch(console.error);
}
else {
console.error("Your browser doesn't support the Compression API");
}
Obviously, this is also be available in Web Workers, but since the API is designed as entirely asynchronous, and making use of Streams, browsers should theoretically already be able to outsource all the hard work on an other thread on their own anyway.
Now, this is still a bit of a future solution, and for today you still might want to use a library instead.
However, we don't do library recommendations here, but I should note that I personally do use pako in Web Workers in a daily basis, with no problem and I don't see why a compression library would ever need the DOM, so I supsect you are doing something wrong™.
(async() => {
const worker_script = `
importScripts("https://cdnjs.cloudflare.com/ajax/libs/pako/1.0.11/pako_inflate.min.js");
self.onmessage = async (evt) => {
const file = evt.data;
const buf = await file.arrayBuffer();
const decompressed = pako.inflate(buf);
// zero copy
self.postMessage(decompressed, [decompressed.buffer]);
};
`;
const worker_blob = new Blob([worker_script], { type: "application/javascript" });
const worker_url = URL.createObjectURL(worker_blob);
const worker = new Worker(worker_url);
const compressed_blob = await fetch("data:application/octet-stream;base64,H4sIAAAAAAAAE/NIzcnJVyjPL8pJAQBSntaLCwAAAA==")
.then((r) => r.blob());
worker.onmessage = ({ data }) => {
console.log("received from worker:", new TextDecoder().decode(data));
};
worker.postMessage(compressed_blob);
})().catch(console.error);

Send XMLHttpRequest data in chunks or as ReadableStream to reduce memory usage for large data

I've been trying to use JS's XMLHttpRequest Class for file uploading. I initially tried something like this:
const file = thisFunctionReturnsAFileObject();
const request = new XMLHttpRequest();
request.open('POST', '/upload-file');
const rawFileData = await file.arrayBuffer();
request.send(rawFileData);
The above code works (yay!), and sends the raw binary data of the file to my server.
However...... It uses a TON of memory (because the whole file gets stored in memory, and JS isn't particulary memory friendly)... I found out that on my machine (16GB RAM), I couldn't send files larger than ~100MB, because JS would allocate too much memory, and the Chrome tab would crash with a SIGILL code.
So, I thought it would be a good idea to use ReadableStreams here. It has good enough browser compatibility in my case (https://caniuse.com/#search=ReadableStream) and my TypeScript compiler told me that request.send(...) supports ReadableStreams (I later came to the conclusion that this is false). I ended up with code like this:
const file = thisFunctionReturnsAFileObject();
const request = new XMLHttpRequest();
request.open('POST', '/upload-file');
const fileStream = file.stream();
request.send(fileStream);
But my TypeScript compiler betrayed me (which hurt) and I recieved "[object ReadableStream]" on my server ಠ_ಠ.
I still haven't explored the above method too much, so I'm not sure if there might be a way to do this. I'd also appreciate help on this very much!
Splitting the request in chunk would be an optimal solution, since once a chunk has been sent, we can remove it from memory, before the whole request has even been recieved.
I have searched and searched, but haven't found a way to do this yet (which is why I'm here...). Something like this in pseudocode would be optimal:
const file = thisFunctionReturnsAFileObject();
const request = new XMLHttpRequest();
request.open('POST', '/upload-file');
const fileStream = file.stream();
const fileStreamReader = fileStream.getReader();
const sendNextChunk = async () => {
const chunk = await fileStreamReader.read();
if (!chunk.done) { // chunk.done implies that there is no more data to be read
request.writeToBody(chunk.value); // chunk.value is a Uint8Array
} else {
request.end();
break;
}
}
sendNextChunk();
I'd like to expect this code to send the request in chunks and end the request when all chunks are sent.
The most helpful resource I tried, but didn't work:
Method for streaming data from browser to server via HTTP
Didn't work because:
I need the solution to work in a single request
I can't use RTCDataChannel, it must be in a plain HTTP request (is there an other way to do this than XMLHttpRequest?)
I need it to work in modern Chrome/Firefox/Edge etc. (no IE support is fine)
Edit: I don't want to use multipart-form (FormData Class). I want to send actual binary data read from the filestream in chunks.
You can't do this with XHR afaik. But the more modern fetch API does support passing a ReadableStream for the request body. In your case:
const file = thisFunctionReturnsAFileObject();
const response = await fetch('/upload-file', {
method: 'POST',
body: file.stream(),
});
However, I'm not certain whether this will actually use chunked encoding.
You are facing a Chrome bug where they do set an hard-limit of 256MB to the size of the ArrayBuffer that can be sent.
But anyway, sending an ArrayBuffer will create a copy of the data, so you should rather send your data as a File directly, since this will only read the File exactly like you wanted it to be, as a stream by small chunks.
So taking your first code block that would give
const file = thisFunctionReturnsAFileObject();
const request = new XMLHttpRequest();
request.open('POST', '/upload-file');
request.send(file);
Ans this will work in Chrome too, even with few Gigs files. The only limit you would face here would be before, when you'd do whatever processing you are doing on that File.
Regarding posting ReadableStreams, this will eventually come, but as of today July the 13th of 2020, only Chrome has started working on its implementation, and we web-devs still can't play with it, and specs are still having hard times to come with something stable.
But it's not a problem for you, since you would not win anything doing so anyway. Posting a ReadableStream made from a static File is useless, both fetch and xhr will do this internally already.

node.js Transfer and saving files using TCP Server

I have a lot of devices sending messages to a TCP Server written in node. The main task of the TCP server is to route some of that messages to redis in order to be processed by another app.
I've written a simple server that does the job quite well. The structure of the code is basically this (not the actual code, details hidden):
const net = require("net");
net.createServer(socket => {
socket.on("data", buffer => {
const data = buffer.toString();
if (shouldRouteMessage(data)) {
redis.publish(data);
}
});
});
Most of the messages are like: {"text":"message body"}, or {"lng":32.45,"lat":12.32}. But sometimes I need to process a message like {"audio":"...encoded audio..."} that spans several "data" events.
What I need in this case is to save the encoded audio into a file and send to redis {"audio":"path/to/audio-file.mp3"} where the route is the file with the audio data received.
One simple option is to store the buffers until I detect the end of the message and then save all them to a file, but that means, among other things, that I must keep the file on memory before saving to disk.
I hope there are better options using streams and pipes. ¿Any suggestions? (some code examples, would be nice)
Thanks
I finally solved, so I post the solution here for documentation purposes (and with some luck, to help others).
The solution was, indeed, quite simple: just open a write stream to a file and write the data packets as they are received. Something like this:
const net = require("net");
const fs = require("fs");
net.createServer(socket => {
socket.on("data", buffer => {
let file = null;
let filePath = null;
const data = buffer.toString();
if (shouldRouteMessage(data)) {
// just publish the message
redis.publish(data);
} else if (isAudioStart(data)) {
// create a write stream to a file and write the first data packet
filePath = buildFilePath(data);
file = fs.createWriteStream(filePath);
file.write(data);
} else if (isLastFragment(data)) {
// if is the last fragment, write it, close the file and publish the result
file.write(data);
file.close();
redis.publish(filePath);
file = filePath = null;
} else if (isDataFragment(data)) {
// just write (stream) it to file
file.write(data);
}
});
});
Note: shouldRouteMessage, buildFilePath, isDataFragment, and isLastFragment are custom functions that depends on the kind of data.
In this way, the incoming data is streamed to the file directly and no need to save the contents in memory before. node's streams rocks!
As always the devil is in the details. Some checks are necesary to, for example, ensure there's always a file when you want to write it. Remember also to set the proper encoding when converting to string (for example: buffer.toString('binary'); did the trick for me). Depending on your data format, the shouldRouteMessage, isAudioStart... and all this custom functions can be more or less complex.
Hope it helps.

FFmpeg converting from video to audio missing duration

I'm attempting to load YouTube videos via their direct video URL (retrieved using ytdl-core). I load them using the request library. I then pipe the result to a stream, which is used as the input to ffmpeg (via fluent-ffmpeg). The code looks something like this:
var getAudioStream = function(req, res) {
var requestUrl = 'http://youtube.com/watch?v=' + req.params.videoId;
var audioStream = new PassThrough();
var videoUrl;
ytdl.getInfo(requestUrl, { downloadURL: true }, function(err, info) {
res.setHeader('Content-Type', 'audio/x-wav');
res.setHeader('Accept-Ranges', 'bytes');
videoUrl = info.formats ? info.formats[0].url : '';
request(videoUrl).pipe(audioStream);
ffmpeg()
.input(audioStream)
.outputOptions('-map_metadata 0')
.format('wav')
.pipe(res);
});
};
This actually works just fine, and the frontend successfully receives just the audio in WAV format and is playable. However, the audio is missing any information about its size or duration (and all other metadata). This also makes it unseekable.
I'm assuming this is lost somewhere during the ffmpeg stage, because if I load the video directly via the URL passed to request it loads and plays fine, and has a set duration/is seekable. Any ideas?
It isn't possible to know the output size nor duration until it is finished. FFmpeg cannot know this information ahead of time in most cases. Even if it could, the way you are executing FFmpeg it prevents you from accessing the extra information.
Besides, to support seeking you need to support range requests. This isn't possible either, short of encoding the file up to the byte requested and streaming from there on.
Basically, this isn't possible by the nature of what you're doing.

How do I access the data from a stream?

I'm working with this library: mTwitter
My problem is, when I want to use the streaming function:
twit.stream.raw(
'GET',
'https://stream.twitter.com/1.1/statuses/sample.json',
{delimited: 'length'},
process.stdout
);
I don't know how to access the JSON that generates process.stdout.
You could use a writable stream, from stream.Writable.
var stream = require('stream');
var fs = require('fs');
// This is where we will be "writing" the twitter stream to.
var writable = new stream.Writable();
// We listen for when the `pipe` method is called. I'm willing to bet that
// `twit.stream.raw` pipes to stream to a writable stream.
writable.on('pipe', function (src) {
// We listen for when data is being read.
src.on('data', function (data) {
// Everything should be in the `data` parameter.
});
// Wrap things up when the reader is done.
src.on('end', function () {
// Do stuff when the stream ends.
});
});
twit.stream.raw(
'GET',
'https://stream.twitter.com/1.1/statuses/sample.json',
{delimited: 'length'},
// Instead of `process.stdout`, you would pipe to `writable`.
writable
);
I'm not sure if you really understand what the word streaming means here. In node.js a stream is basically a file descriptor. The example uses process.stdout but a tcp socket is also a stream, an open file is also a stream and a pipe is also a stream.
So a streaming function is designed to pass the received data directly to a stream without needing you to manually copy the data from source to destination. Obviously this means you don't get access to the data. Think of streaming like pipes on unix shells. That piece of code is basically doing this:
twit_get | cat
In fact, in node, you can create virtual streams in pure js. So it is possible to get the data - you just have to implement a stream. Look at the node documentation of the stream API: http://nodejs.org/api/stream.html

Categories

Resources