How to remove Active Content from uploaded PDF documents?

How to remove Active Content from uploaded PDF documents? - javascript

I developed a kind of job application website and I only now realized that by allowing the upload of PDF files I'm at risk of receiving PDF documents containing encrypted data, active content (e.g. JavaScript, PostScript), and external references.
What could I use to sanitize or re-build the content of every PDF files uploaded by users?
I want that the companies that will later review the uploaded resumes are able to open the resumes from their browsers without putting them at risk..

The simplest method to flatten or sanitise a PDF that can be done using using GhostScript in safer mode requires just one pass:-
For a Windows user it will be as "simple" as using new 9.55 command
"c:\path to gs9.55\bin\GSwin64c.exe" -sDEVICE=pdfwrite -dNEWPDF -o "Output.pdf" "Input.pdf"
for others replace gs9.55\bin\GSwin64c with version 9.55 GS command
It is not a fast method e.g. around 40ppm is not uncommon, thus 4 pages is about 6 seconds to be reprinted, however, a 400 page document could take 10 minutes.
Advantages the file size is often smaller once any redundant content is removed. Images and font reconstruction may save storage, e.g. a 100 MB file may be reduced to 30 MB but that is a general bonus, not an aim.
JavaScript actions are usually discarded, However links such as bookmarks are usually retained, so be cautious as the result can still have rogue hyperlinks.
The next best suggestion is two passes via PostScript as discussed here https://security.stackexchange.com/questions/103323/effectiveness-of-flattening-a-pdf-to-remove-malware
GS[win64c] -sDEVICE=ps2write -o "%temp%\temp.ps" "Input.pdf"
GS[win64c] -sDEVICE=pdfwrite -o "Output.pdf" "%temp%\temp.ps"
But there is no proof that its any different or more effective than the one line approach.
Finally the strictest method of all, is burst the pdf into image only pages then stitch the images back into a single pdf and concurrently run OCR to reconstruct a searchable PDF (drops bookmarks). That can also be done using Ghostscript enabled with Tesseract.
Note:- visible external hyperlinks may then still be reactivated due to the pdf readers native ability.

Related

Benchmarking loading and execution time of iframes from specific initiator/origin

I have a website with a master script X. This script loads external scripts (async) depending on the pagetype, each of those scripts is separated in it's own iframe but those scripts may load other scripts. The website has a lot of pages that needs to be benchmarked and thus the process needs to be automated.
The website itself, the master script X and the iframes can't be changed.
The website loads other scripts/images which are not relevant but influence the loading+execution time of the specific iframes with origin X.
I need to know the loading and execution times of those iframes in absolute and relative time (e.g. master script X loads after 300ms on the page, takes 50ms to execute, loads iframe1, iframe1 loads after 350ms and takes 100ms to execute, loads another script that loads after 450ms and takes 30ms = iframe 1 starts after 350ms and finishes after 480ms - repeat for every other iframe with origin X).
Is this possible with Node.js / Puppeteer and if so, which functions/libs can I utilize for the task?

You have two options:
Listen to the according events, save the their time and calculate the results
Use an existing library to generate a HAR file
Option 1: Listen to events
You can listen to events like request, response (or requestfinished) in puppeteer, note their timestamps (e.g. by using Date.now()) and compare them. You have the full freedom (and responsibility) on what events to listen to.
Example:
page.on('request', request => {
if (request.url() === 'URL of your frame') {
const timestamp = Date.now();
// store time somewhere
}
});
// listen to other events and store their timings in addition
Depending on the complexity of your page you might want to use arrays to store the data or even a database.
Option 2: HAR file
Use a library like puppeteer-har or chrome-har to create a HAR file of the page loading process (I have not used any of these myself).
Quotation what a HAR file is (source):
The HAR file format is an evolving standard and the information contained within it is both flexible and extensible. You can expect a HAR file to include a breakdown of timings including:
How long it takes to fetch DNS information
How long each object takes to be requested
How long it takes to connect to the server
How long it takes to transfer assets from the server to the browser of each object
The data is stored as a JSON document and extracting meaning from the
low-level data is not always easy. But with practice a HAR file can
quickly help you identify the key performance problems with a web
page, letting you efficiently target your development efforts at areas
of your site that will deliver the greatest results.
There are multiple existing tools to visualize HAR files (like this one) and you can even drop a HAR file into a Chrome instance to analyze it.
If you want to automate the process even more, you can also write your own script to parse the HAR file. As it is a JSON file this is easy to do.

Firefox and Chrome offer to download the HAR archiv of the loaded website. It's a JSON file with all requests, sources, targets and loading times.
I can't put my source code online because I don't own the copyright but analysing this text file enables everything I needed (used recursive looping to get the unknown number of initiator depth requests).

Can I secure resources other than javascript in Nodewebkit?

I read this post
securing the source code in a node-webkit desktop application
I would like to secure my font files and I was thinking this snapshot approach might be a way. So instead of running this
nwsnapshot --extra-code application.js application.bin
Could I run
nwsnapshot --extra-code font_file font_file.bin
Then in package.json add this?
snapshot: 'font_file.bin'
Or would there be an alternative mechanism to reference the binary font? Would it be possible to convert the CSS file referencing the font into binary? Can anything else other than javascript be converted to binary?

One dumb thing you can do is to add your assets to the exe file as stated here:
https://github.com/nwjs/nw.js/wiki/How-to-package-and-distribute-your-apps#step-2a-put-your-app-with-nw-executable
Basically you have to create a zip of your content (included your package.json) and rename it to "package.nw" then you can "merge it" into the exe file by typing this if you're in windows (the link explains how to do this in other OS's):
`copy /b nw.exe+app.nw app.exe `
This is not a great security measure (beacuse it can be opened as a zip file) but is one step further.
Another thing that could add security to your files is to encrypt them and then add them dinamically through js (while decrypting them) for this you could use the encrypt and decrypt methods available in node.
http://lollyrock.com/articles/nodejs-encryption/
However the weakest point in the application is still the packages.json file and for this nw provides nothing.
Cheers!

javascript (DOJO) file caching - Client side

I want to implement caching of the javascript files (Dojo Tool kit) which are not going to change.. Currently my home page takes about 15-17 secs to load and upon refresh it takes 5-6 secs on load.. But is there a way to use the cached files again when we load it in a new browser session.. I do not want the browser to make request to the server on load of the application home page in a new browser session? Also is there a option to set the expiry to a certain number of days.. I tried with META tag and not helping much.. Either I'm doing something wrong or I'm not implementing it correctly..
I have implemented the dojo compression tool kit and see a slight improvement in the performance but not significant..

Usually your browser should do that already anyway. Please check if caching is really turned on and not only for session.
However, creating a custom dojo build with your app profile defining layers with dojo the build puts all your code together and bundles it with dojo.js (the files are still available independently). The result is just one http request for all of the code (larger file but just once). The gained speed due to reduced http requests is much more than a cache could ever provide.
For details refer to the tutorial: http://dojotoolkit.org/documentation/tutorials/1.8/build/

The caching is made by the browser, which behaviour is influenced by Cache-Control HTTP header (see http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html). Normally, the browser would ask, if newer version of resources is available, so you get 1 short request for each resource anyway.
From my experience, even with very aggresive caching, when the browser is instructed not to ask for the version of the resource on server for the given period of time, the checking in the browser cache for such an immense number of resources is a costly process.
The real solution are custom builds. You have written something about "dojo compression", I assume you're acquainted with the Dojo build profiles. It's quite nasty documented, but once you're successfull with it, you should have some big file(s) with layer(s), with the following format:
require({cache:{
"name/of/dojo/resource": function() { ... here comes the content of JS file ... }
, ...
}});
It's a multi-definition file that inlines all definitions of modules within single layer. So, loading such file will load many modules in single request. But you must load the layer.
In order to get layers to run I've had to add the extra require to each of my entry JS files (those that are referenced in the headers of HTML file):
require(["dojo/domReady!"], function(){
// load the layers
require(['dojo/dojo-qm',/*'qmrzsv/qm'*/], function(){
// here is your normal dojo code, all modules will be loaded from the memory cache
require(["dojo", "dojo/on", "dojo/dom-attr", "dojo/dom-class"...],
function(dojo,on,domAttr,domClass...){
....
})
})
})
It has significantly improved the performance. The bottleneck was loading a large amount of little javascript modules, not parsing them. Loading and parsing all modules at once is much cheaper that loading hundreds of them on demand.

Photoshop on web server for scripts

So, I'm trying to think of the best way to solve a problem I have.
The problem is that I produce many websites for my job and as CSS3 and HTML5 introduce themselves powerful I want to eliminate almost all images from my websites. For button icons and various other things I have a sprite image with all the icons on that I just shift around depending what icon I need. What I need to be able to do is recolour this image dynamically on the web server so that I don't have to open up Photoshop and recolour the icons manually.
I have done some research and the only thing that I've come across that has a chance of working the way I want it to is a Photoshop JavaScript. My question is, once I've written my script and it recolours my icon image, can it be done on a server so, when a user clicks a button for example, the image is recoloured and saved to the server?
Would this require installing Photoshop being installed on the server? Is this even possible?

Photoshop is just available for Mac or Windows as you will know.
As far as i know you can't install Photoshop on Windows Server. (I tried it with CS4 by myself - maybe it works with CS6 knowadays). But you could install PS on a Win 7 machine behind a firewall.
If you use a Windows machine you can use COM for automation. I tried it and it worked well.
I have done a similiar thing you are thinking of with two Macs and PS Javascript (Imagemagick, PIL etc. weren't working for me, because the job was too complicated) on a medium traffic webpage. So i don't agree with Michaels answer.
First thing: think about caching the images and use the low-traffic-time to compute images which could be needed in future. This really made things easier for me.
Second Thing: Experiment with image size, dpi etc. The smaller the images - the faster the process.
My Workflow was:
Webserver is writing to a database ("Hey i need a new image with name "path/bla.jpg").
A Ajax call is checking if the image is present. If not - show "processing your request placeholder"
A script running in a infinite loop on the mac behind a firewall is constanly checking if a new image is needed.
If it finds one it is updating the database ("Mac One will compute this job"). This prevents that every Mac will go for the new image.
The script is calling Photoshop. Photoshop is computing the image.
The script uploads the image (i used rsync) to the webserver.
ajax-call sees the new image and presents it to the user.
Script on the Mac updates database "image successfully created".
You will need some error handling logic etc.

Uhm this problem has been bothering me for years too.. all i always wished was to have a Photoshop Server which i could talk trough an API and get things done.. well.. i have built something that is Closer... using the generator plugin i can connect thought a web-socket and inject javascript in Photoshop.. technically you are able to do anything that can be done using photoshop scripting guide.... (Including manipulating existing PDS)
This library https://github.com/Milewski/generator-exporter exports all the marked layers with a special syntax as its desired format...
this code could run on the server.. using nodejs
import { Generator } from 'generator-exporter'
import * as glob from 'glob'
import * as path from 'path'
const files = glob.sync(
path.resolve(__dirname, '**/*.psd')
);
const generator = new Generator(files, {
password: '123456',
generatorOptions: {
'base-directory': path.resolve(__dirname, 'output')
}
})
generator.start()
.then(() => console.log('Here You Could Grab all the generated images and send back to client....'));
however i wouldn't recommend using this for heavy usage with too many concurrent tasks... because it needs photoshop installed locally... Photoshop GUI will be initialized.. this process is quite slow. so it doesn't really fit for a busy workflow.

Javascript FileSaver saves empty files after writing large number of files

Background
I'm working on an internal project that basically can generate a video on the client side, but since there are no JavaScript video encoders I'm aware of, I'm just exporting each frame individually. I need to avoid uploading to the server; this is all happening on the client side.
Implementation
I'm using this FileSaver.js (more specifically, Chrome's webkit FileSystem API) to save a large number of PNGs generated by an HTML5 canvas. I set Chrome to automatically download to a specific folder, so when I hit 'Save' it just takes off and saves something like 20 images per second. This works perfectly for my purposes.
If I could use JSZip to compress all these frames into one file before offering it to the client to save, I would, but I haven't even tried because there's just no way the browser will have enough memory to generate ~8000 640x480 PNGs and then compress them.
Problem
The problem is that after a certain number of images, every file downloaded is empty. Chrome even starts telling me in the download bar that the file is 0 bytes. Repeated on the same project with the same export settings, the empty saves start at exactly the same time. For example, with one project, I can save the first 5494 frames before it chokes. (I know this is an insanely large number, but I can't help that.) I tried setting a 10ms delay between saves, but that didn't have any effect. I haven't tried a larger delay because exporting takes a very long time as it is.
I checked the blob.size and it's never zero. I suspect it's exceeding some quota, but there are no errors thrown; it just silently fails to either write to the sandbox or copy the file to the user-specified location.
Questions
How can I detect these empty saves? Prevent them? Is there a better way to do this? Am I just screwed?
EDIT: Actually, after debugging FileSaver.js, I realized that it's not even using webkitRequestFileSystem; it returns when it gets here:
if (can_use_save_link) {
object_url = get_object_url(blob);
save_link.href = object_url;
save_link.download = name;
if (click(save_link)) {
filesaver.readyState = filesaver.DONE;
dispatch_all();
return;
}
}
So, it looks like it's not even using the FileSystem API, and therefore I have no idea how to empty the storage before it's full.
EDIT 2: I tried moving the "if (can_use_save_link)" block to inside the "writer.onwriteend" function, and changing it to this:
if (can_use_save_link) {
save_link.href = file.toURL();
save_link.download = name;
click(save_link);
}else{
target_view.location.href = file.toURL();
}
The result is I'm able to save all 8260 files (about 1.5GB total) since it's now using storage with a quota. Before, the files didn't show up in the HTML5 FileSystem because I assume you didn't need to put them there if the anchor element supported the 'download' attribute.
I was also able to comment out the code that appends ".download" to the filename, and I had to provide an empty anonymous function as an argument to both instances of "file.remove()".

Use JSZip, it won't use too much memory if you disable compression (which is the default). To manually disable compression anyways, make sure to pass compression: "STORE" when calling zip.generate().

I ended up modifying FileSaver.js (see "EDIT 2" in the original post).

Develop Reference

JavaScript is the programming language of the Web.