Client-side javascript to extract patterns from online PDF document

Client-side javascript to extract patterns from online PDF document - javascript

I am trying to extract patterns from online PDFs using a client side script (tampermonkey / greasemonkey - Firefox or Chrome). The implementation can be browser specific, would like to try get it working in either 1.
I am able to use JS to extract the content and match on it manually in Firefox (which loads pdf.js automatically). E.g. on a PDF URL:
var matchList = document.body.innerText.match(/my_regex/gi);
I am now trying to port this into Greasemonkey for a user-script:
// ==UserScript==
// #name MyExtractor
// #version 1
// #grant none
// #include *.pdf
// ==/UserScript==
console.log("User script");
console.log(document.body.innerText); // this JS executed manually logs the PDF to text, but
alert("HI");
The script doesn't load - is it possible to get a Gm script to execute on a PDF url in Firefox?
In Chrome, the PDF document seems to be embedded - so even with direct console JS, i can't seem to get access to the content. e.g.
> document.getElementsByTagName("embed")[0]
<embed name="some_id" style="position:absolute; left: 0; top: 0;" width="100%" height="100%" src="about:blank" type="application/pdf" internalid="some_id">
This is about as far as I have been able to get with Chrome - is there a way to get the PDF object based on the above element and extract text from it?
With regards to the JS, i do not necessarily need to have it run directly on the PDF url, I can also get it to identify a page that has a PDF anchor href on it, and then fetch and parse it based on a request if possible - if there is a way to fetch and process with a PDf library some how?
References used so far:
Execute a Greasemonkey script on every page, regardless of page-type (like foo.com/image.jpg)? - do i need to build an extension for this?
Extract text from pdf file using javascript (and followed some of the links) - specifically, i have tried to follow this: How to extract text from PDF in JavaSript - but have not been able to create a reference to the PDF source / add the library to GM and execute as expected - is this a good path to follow and try solve the problems I am running into?

Related

IE extension - Injecting Javascript file

I am developing an IE extension which works on sites opened in Internet Explorer. It is designed to work the same way as a chrome extension. I am trying to implement the Background function of chrome extension using c++ and the content script by injecting JS into the current web page. The content script, I am trying to load via IHTMLWindow2 execScript on Document load event. Now that I need to inject JS files directly I tried the following.
Had the JS file under a folder inside the Project destination and tried to inject using physical path.
std::wstring filePath(_T("d:/xx/xxx/x/x/Content/myFile.js"));
scriptText = scriptText+ filePath + endScript;
VARIANT vrt = {0};
HRESULT hrexec = ifWnd->execScript(SysAllocString(scriptText.c_str()),L"javascript", &vrt);
The scriptText has some javascript code to create script element with type and src attributes. The filePath holds the physical path towards the js file.[Also tried relative path but it was a no go]
The above was not working correctly in IE9 due to mixed content issue, upon which I researched to figure out that IE9 expects the js file to be retrieved from a server rather than local physical path. The console throws me the below exception.
SEC7111: HTTPS security is compromised by file:<filepath>
SCRIPT16388: Operation aborted
I am pretty much not sure is there any round about for injecting Javascript to the current DOM from the physical path. Please help me on this.
Also let me know is there any other possibility of injecting the JS file from the current working directory into the DOM.

You don't have to inject a <SCRIPT> tag in the DOM.
If your js file contains:
var strHello = "Hello";
function SayHello() { alert( strHello ); }
you may just read the file into memory, construct a BSTR string with it, and pass that string to IHTMLWindow2::execScript.
Later, another call to execScript with the string SayHello(); will popup the alert box. The code you injected is still here.

can firefox extension modify DOM of HTML document then save as HTML?

I am creating a firefox extension that lets the operator perform various actions that modify the content of the HTML document. The operator does not edit HTML, they take other actions and my extension modifies the document by inserting elements, adding attributes, and so forth.
When the operator is finished, they need to be able to save the HTML document as a file (or have my extension send it to an internet destination, but this is not required since they can email the saved file).
I thought maybe the changes made by the javascript code in my extension would be reflected in the HTML document, but when I ask the firefox browser to "view source" after making modifications, it displays the original HTML text.
My questions are:
#1: What is the easiest way for the operator to save the HTML document with all the changes my extension has made?
#2: What is the easiest way for the javascript code in my extension to process the HTML document contents and write to an HTML file on the local disk?
#3: Is any valid HTML content incapable of accurate representation in the saved file?
#4: Is the TreeWalker part of the solution (see below)?
A couple observations from my research so far:
I've read about the TreeWalker object, which seems to provide a fairly painless way for an extension to walk through everything (?or almost everything?) in the HTML document. But does it expose everything so everything in the original (and my modifications) can be saved without losing anything of importance?
Does the TreeWalker walk through the HTML document in the "correct order" --- the order necessary for my extension to generate the original and/or modified HTML document?
Anything obscure or tricky about these problems?

Ok so I am assuming here you have access to page DOM. What you need to do it basically make changes to the dom and then get all the dom code and save it as a file. Here is how you can download the page's html code. This will create an a tag which the user needs to click for the file to download.
var a = document.createElement('a'), code = document.querySelectorAll('html')[0].innerHTML;
a.setAttribute('download', 'filename.html');
a.setAttribute('href', 'data:text/html,' + code);
Now you can insert this a tag anywhere in the DOM and the file will download when the user clicks it.
Note: This is sort of a hack, this injects entire html of the file in the a tag, it should in theory work in any up to date browser (except, surprise, IE). There are more stable and less hacky ways of doing it like storing it in a file system API file and then downloading that file instead.
Edit: The document.querySelectorAll line accesses the page DOM. For it to work the document must be accessible. You say you are modifying DOM so that should already be there. Make sure you are adding the code on the page and not your extension code. This code will be at the same place as your DOM modification code, not your extension pages that can't access the DOM.
And as for the a tag, it will be inserted in the page. I skipped the steps since I assumed you already know how to manipulate DOM and also because I don't know where you would like to add the link. And you can skip the user action of clicking the link too, but it's a hack and only works in modern browsers. You can insert the a tag somewhere in the original page where user won't see it and then call the a.click() function to simulate a click event on the link. But this is not a legit way and I personally only use it on my practice projects to call click event listeners.
I can only test this on chrome not on FF but try this code, this will not require you to even add the a link to DOM. You need to add this next to the DOM manipulation code. This will work if luck is on your side :)
var a = document.createElement('a'), code = document.querySelectorAll('html')[0].innerHTML;
a.setAttribute('download', 'filename.html');
a.setAttribute('href', 'data:text/html,' + code);
a.click();

There is no easy way to do this with the web API only, at least when you want a result that does not omit stuff like the doctype or comments. You could still write a serializer yourself that goes through document.childNodes and serialized according to the node type (Element.outerHTML, Comment.data and so on).
Luckily, you're writing a Firefox add-on, so you have access to a lot more (powerful) stuff.
While still not 100% perfect, the nsIDocumentEncoder implementations will produce pretty decent results, that should only differ in some whitespace and explicit charset declaration at most (everything else is a bug).
Here is an example on how one might use this component:
function serializeDocument(document) {
const {
classes: Cc,
interfaces: Ci,
utils: Cu
} = Components;
let encoder = Cc['#mozilla.org/layout/documentEncoder;1?type=text/html'].createInstance(Ci.nsIDocumentEncoder);
encoder.init(document, 'text/html', Ci.nsIDocumentEncoder.OutputLFLineBreak | Ci.nsIDocumentEncoder.OutputRaw);
encoder.setCharset("utf-8");
return encoder.encodeToString();
}
If you're writing an SDK add-on, stuff gets more complicated as the SDK abstracts some important stuff away. You'll need to go through the chrome module, and also figure out the active window and tab yourself. Something like Services.wm.getMostRecentWindow("navigator:browser").content.document (Services.jsm) should do the trick.
In XUL overlay add-ons, content.document should suffice to get the document of the currently active tab, and you have Components access already.
Still, you need to let the user choose a file destination, usually through nsIFilePicker and then actually write the file, by using something like a file stream or the fully async OS.File API.

Looks like I get to answer my own question, thanks to someone in mozilla #extdev IRC.
I got totally faked out by "view source". When I didn't see my modifications in the window displayed by "view source", I assumed the browser would not provide the information.
However, guess what? When I "file" ===>> "save page as...", then examine the page contents with a plain text editor... sure enough, that contained the modifications made by my firefox extension! Surprise!

A browser has no direct write access to the local filesystem. The only read access it has is when explicitly provide a file:// URL (see note 1 below)
In your case, we are explicitly talking about javascript - which can read and write cookies and local storage. It can also send stuff back to the server and retrieve it, e.g. using AJAX.
Stuff you put in local storage/cookies is effectively not accessible to other programs (such as email clients).
It is possible to create very long mailto: URLs (see note 2) but only handles inline content in the email and you're going to run into all sorts of encoding issues that you're not ready to deal with.
Hence I'd recommend pursuing storage serverside via AJAX - and look at local storage once you've got this sorted/working.
Note 1: this is not strictly true. a trusted, signed javascript has access to additional functions which may include direct file access.
Note 2: (the limit depends on the browser and the email client - Lotus Notes truncaets the content rather a lot)

Windows 8 Metro open Local files (.html)

I want to open a local html file from windows 8 metro (javascript ) App.
I tried doing it the way : http://msdn.microsoft.com/en-us/library/windows/apps/hh701484.aspx . It works fine as soon as i keep giving the actual http address but as soon as i replace them with my local file path , the success return is false everytime.
Any help ??

You can use the StorageAPIs and read all the HTML in a file. Then create a DOM element and set its innerHTML. (This is much easier if you use jQuery to manipulate the DOM).
I've got an example of something similar - where I read files from the app's local storage directory, and show the HTML in a web browser control. The example is in C# / XAML, but a similar logic can be used (without the need for a web browser control - since your app would be running inside a host that can directly show HTML like a browser):
http://krishnanadiminti.blogspot.com.au/2012/09/howto-provide-in-app-help-using-html.html

dynamically create greasemonkey script

I'm trying to create a dynamic GM script. Here's what I thought would do it
win = window.open('myScript.user.js');
win.document.writeln('// ==UserScript==');
win.document.writeln('// #name sample script');
win.document.writeln('// #description alerts hi');
win.document.writeln('// #include http://www.google.com/*');
win.document.writeln('// ==/UserScript==');
win.document.writeln('');
win.document.writeln('(function(){alert("hi");})()');
win.document.close();
Well it doesn't. Anyone have any ideas how to go about doing this?

You cannot dynamically create Greasemonkey scripts with Greasemonkey (alone).
A GM script is not part of the HTML page, so writing GM code to a page will never work. The script needs to be installed into GM's script management system.
A GM script cannot write to the file system, nor access sufficient browser chrome to install a script add-on.
You might be able to write a GM script that posts other scripts to a server, and then sends the browser to that server. GM would then prompt the user to install the new script.
You might be able to write a browser add-on that could write GM scripts, but I suspect that this approach will be difficult.
You probably could write a Python (or C, VB, etc.) program that generates GM scripts for installation. With extra work, such a program could probably automatically install the script, too.
Why do you want to dynamically create Greasemonkey scripts, anyway? There may be a simpler method to accomplish the true goal.?.
Update for OP comment/clarification:
Re: "I want to be able to have a user select an element to get blocked and then create a script that sets that element's display to none on all sites from that domain"...
One way to do that:
Store domain and selector pairs using GM_setValue().
The script would, first thing, check to see if it had a value stored for the current page's domain or URL (using GM_getValue() or GM_listValues()).
If a match was found, hide the element(s) as specified in the selector.
Note that, depending on the element, the excellent Adblock Plus extension may be able to block the element much more elegantly (saves bandwidth/DL-time too).

Parse Greasemonkey metadata and/or grab comments from within a function

function blah(_x)
{
console.info(_x.toSource().match(/\/\/\s*#version\s+(.*)\s*\n/i));
}
function foobar()
{
// ==UserScript==
// #version 1.2.3.4
// ==/UserScript==
blah(arguments.callee);
}
foobar();
Is there any way to do this using JavaScript? I want to detect the version number / other attributes in a Greasemonkey script but as I understand it, .toSource() and .toString() strip out comments1.
I don't want to wrap the header block in <><![CDATA[ ]><> if I can avoid it, and I want to avoid having to duplicate the header block outside of the comments if possible.
Is this possible? Are there alternatives to toSource() / .toString() that would make this possible?
[1] - http://isc.sans.edu/diary.html?storyid=3231

There is currently no really good way for a Greasemonkey script to know its own metadata (or comments either).   That is why every "autoupdate" script (like this one) requires you to set extra variables so that the script will know its current version.
As aularon said, the only way to get the comments from a JS function is to parse the source HTML of the <script> tag or of the file.
However, there is a trick that might work for you. You can read in your own GM script as a resource and then parse that source.
For example:
Suppose your script was named MyTotallyKickassScript.user.js.
Now add a resource directive to your script's metadata block like so:
// #resource MeMyself MyTotallyKickassScript.user.js
Notice that there is no path information to the file, GM will use a relative path to copy the resource, one time, when the script is first installed.
Then you can access the script's code using GM_getResourceText(), like so:
var ThisFileSource = GM_getResourceText ("MeMyself");
//Optional for Firebug users: console.log (ThisFileSource);
You can parse ThisFileSource to get the comments you want.
A script that parses Greasemonkey metadata from a source file is here. You should be able to adapt it with little effort.

Javascript engine will ignore comments, the only way to do that is to string process <script>'s innerHTML, or string process an AJAX request that fetches the .js file, if it was an external file.

Develop Reference

JavaScript is the programming language of the Web.