Javascript query-able indexing of offline HTML docs

Javascript query-able indexing of offline HTML docs - javascript

I posted a similar question earlier but don't think I explained my requirements very clearly. Basically, I have a .NET application that writes out a bunch of HTML files ... I additionally want this application to index these HTML files for full-text searching in a way that javascript code in the HTML files can query the index (based on search terms input by a user viewing the files offline in a web browser).
The idea is to create all this and then copy to something like a thumb drive or CD-ROM to distribute for viewing on a device that has a web browser but not necessarily internet access.
I used Apache Solr for a proof of concept, but that needs to run a web server.
The closest I've gotten to a viable solution is JSSindex (jssindex.sourceforge.net), which uses Lush, but our users' environment is Windows and we don't want to require them to install Cygwin.

It looks like your main problem is to make index accessible by local HTML. Cheat way to do it: put index in JS file and refer from the HTML pages.
var index=[ {word:"home", files:["f.html", "bb.html"]},....];

Ladders Could be a solution, as it provides on the spot indexing. But with 1,000 files or more, I dunno how well it'd scale... Sadly, I am not sure JS is the answer here. I'd go for a custom (compiled) app that served both as front-end (HTML display) and back-end (text search and indexing).

Use a trie - they're ridiculously compact and very scalable - dead handy for text matching.
There is a great article covering performance and design strategies. They're slower to boot up than a dictionary, but take up a lot less room, particularly when you're working with larger datasets.
I'd tackle it as follows:
in your .net code index all the keywords that are important to you (track their document and offset).
generate your trie structure using an alpha sorted list of keywords,
decorate the terminal nodes with information about the documents the words they represent can be found in.
C
A
R T [{docid,[hit offsets]},...]
You don't have to store the offsets, but it would allow you to search for words by proximity or order.
Your .net guys could build the trie sample code.
It will take a while to generate the map, but once it's done and you've serialised it to JSON your javascript application will race through it.

Related

OCR a scanned file and retrieve the metadata

I am using Alfresco community 6.1.
I have thousands of invoices to scan, OCR them (near 100% recognition) and retrieve the needed metadata (Partner, Invoice Number, Amount, Units,Currency,...).(All of this in Alfresco)
Based on these metadata retrieved i need to do some operations on the invoices ( Move them to appropriate folders, apply some workflows...).
As a first approche:
For the OCR I used Alfresco Simple OCR Action, but the result is not very accurate (far from 100%).
For retrieving the results I convert the PDF OCRed to a plain text file and then i search it's content using javascript with document.content ... But since the OCR is not accurate i can't tell if it's the best solution to search inside the document.
So my questions are :
How can I make the OCR results more accurate?
How to retrieve important data from the invoice? is the method i'm using good enough or very poor for such processing?
Im using pdfsandwich, and my alfresco-global.properties is:
ocr.command=/usr/bin/pdfsandwich
ocr.output.verbose=true
ocr.output.file.prefix.command=-o
ocr.extra.commands=-verbose -lang eng
ocr.server.os=linux

I'm afraid this question is off topic: https://stackoverflow.com/help/on-topic
Some input anyway:
I highly recommend to do all the ocr/classification/extraction outside / before storing the pdfs in Alfresco
The technical term for what you're looking for is: Document Capture
If you really expect to classify your scanned docs and to extract the data for inbound documents (which you can't control in structure) the solutions are quite expensive and licensed per pages/period. Market leaders are Kofax and Abbyy in that area.
If you can control the document structure / if the structure of the document is fix you could use quite cheaper solutions which use something like a dynamic template approach (depending on found ancor points, barcodes, regex matches). We use PDFmdx for this to automate qualified extraction.
Everything depends on the OCR quality. My personal opinion: the free/open source ocr components can't compete with the commercial solutions if you don't have the time, exprtise and resources to train and optimize them. Abbyy has a quite affordable CLI solution for linux (ABBYY FineReader Engine CLI for Linux) but I'm sure there are others with similar results.
There is a quite nice and simple solution called AutoOCR which is a REST-/SOAP-Service providing a generic, configurable interface to use several ocr engines and configurations as a service. We implemented an Alfresco integration to act as an Alfresco Transformer but since the Alfresco Transformer framework is deprecated I'd recommend to do the whole ocr and recognition stuff before storing the documents in Alfresco
Finally: if it is a one time approach: Try to find a service provider doing at least the ocr and maybe also the classification/extraction.

To answer your questions.
To improve OCR results you need to pre-process image. That includes noise removal, line removal, thresholding, etc. But none of them helps if the engine is not working precisely. Tesseract from version 4.0.0 is working well enough for most applications.
Your approach may work in some cases but it will not work great on a large set of invoices. I suggest using some of the invoice data extraction services. In that case, you don't need to worry about preprocessing and extraction itself. You could use:
typless
Klippa
Taggun
Using such a service can save you a lot of headaches and time.
Disclaimer: I am one of the creators of typless. Feel free to suggest edits.

Migrating Javascript from the database

I am working on an old enterprise solution with these properties:
The solution has a MVC web application
The solution has a WCF service layer
The solution has javascript in the database, in the form of functions in a database column
The web application retrieves said javascript through the service layer and plugs it into certain pages
My team cannot modify the web application, nor the service layer
My team must write javascript by inserting functions into said database columns
This architecture leads to:
A very inefficient development loop
Very poor source control
I'd like to propose a solution for them, how to upgrade this, but here's where I fall a bit short on experience. My suggestion would be:
Migrate the javascript from the database to javascript files
Make some sort of hook in the web application for other teams' javascript files
My questions are:
Has anyone had this kind of problem and how did they solve it?
Is there an effective way to do this kind of javascript migration into files? My idea would be to write a small console program to do the migration
How would they make a hook to import our javascript files? My idea is to make a script bundle with some naming convention, so we can add scripts without them needing to change their code. Are there problems with this approach?
Any kind of input would be invaluable.
Edit:
Additional explanation:
The mechanism maps the javascript function names to a certain DOM elements' event attributes and inlines the code right after the element
The functions are standalone functions, depending only on libraries already in the web application
The functions are grouped by a common form
So I suppose it would be better to group them into files bearing the form names.

If these are just simple, static function definitions being inlined into the web page, then I suppose it might be possible to serialize/aggregate them all into a giant file and run something like prettier on it to make it readable.
That wouldn't be ideal to gain traction in your proposed migration, though. If the code has any volume to it at all, it would be nice to give some structure and order to maintain it.
It's already kind of a huge assumption that this javascript is just pure functions without any complex dependencies on each other, but it's possible that these pieces of Javascript work in isolation already if they are being pulled out of a database. It's hard to know without knowing more context. It seems unlikely that your life will be that easy.
If you managed to extract this monolithic Javascript file, the easiest thing to do would be include it in a script tag for the entire site and be done with it. This could be a bad idea if the file is getting to the ~MB size and slows your initial page load time.
Then again, the point at which you have a bunch of functions in one file, you could probably do a lot there to optimize and reduce duplication of code.
This is still all conjecture because I don't know the mechanism by which your web application imports the javascript once it retrieves it from the database.

Is content in an encrypted Flash container file more safe than as plain HTML/Javascript?

The current task is as follows:
It's about publishing spreadsheet tables online and making them accessible only to registered subscribers. The access to these spreadsheets is meant to be a paid service. Subscribers may access them online from wherever they are and do their calculations related to expenses or working hours and so on. These spreadsheets are developed in MS Excel. They are then converted into HTML/Javascript files via a macro app. The resulting Javascript code contains all the important formulas which need to be protected.
I know about Javascript "obfuscation" and scrambling" but would like to find a better solution since the two mentioned methods can be reversed.
The idea is to place the spreadsheet tables and the formulas for calculation inside of a Flash container file for protection. This Flash container file is not meant to link to or access any other external sources. The data which the users input into the spreadsheet would be saved in XML format.
Here is one tutorial which explains how to encrypt a Flash container file in order to prevent decompilers from making the content accessible:
http://code.tutsplus.com/tutorials/protect-your-flash-files-from-decompilers-by-using-encryption--active-3115
Here is a tool which claims to do the same, but it may be that it just obfuscates and does not go as far as the process in the tutorial above:
www.amayeta.com/software/swfencrypt/
There are some downsides of using Flash which I know. I will not list them here, they are discussed in this forum. Consider that in this case the security aspect outweighs the downsides of Flash. The conversion of the HTML/Javascript content into Flash format will add more effort to this project.
I would like to ask these questions to this community:
Is there a converter that could help to translate Javascript into Actionscript?
Would it be necessary to translate the Javascript into PHP in order to use it within Flash?
Would the effort be worth it?

No this won't be worth the effort as the client will have full control over the runtime of flash. This means that it would not be difficult to extract the functions used. If you must protect your formulas then you should only perform the calculations on your server (or any kind of well protected cloud, if such a thing exists).
If you think that your code will run fine in flash or a browser, then it should not be hard to run the code in a well protected backend server.

Get html form from xsd

I have a quite complex xsd file that describe some objects (it's not important, but it's the DATEX II standard)
Do you know if there is an automatic way to create an html form that act like a "wizard" to guide the user to create xml object as described in the xsd?

The answer to this depends on the intended user base, how you want your users to access your forms, and the technology stack you have in place already or you're willing to deploy.
If your users are quality control analysts, and so the intent is to have them use that generated UI to manage test cases, then a handful of commercial tools have this ability. A quick search on Google for terms such as "generate ui forms from XSD to test web services" should give you on first page the major players in this space (I won't name names to avoid conflict of interests). There are differences in how vendors approach this, that have to do with the time it takes to generate these forms from large bodies of XML Schema, which in turn translate into different degrees of usability. Given what I see in DATEX, from a complexity perspective, you may have a hard time to find a free tool for this...
If your users are rather data entry specialists, then the above are not the tools you want them to use. Without knowing much about your environment (I see your java-ee tag, but still not clear how it would relate to this task), one model could be a combination of InfoPath with SharePoint; while the process to generate the form is not fully automatic, it comes close to that. It is XSD driven, in the sense that at design time you drag and drop XSD on a design form, that allows you to build some really nice UI. Follow their competition on your particular technology stack and you may have your answer. Or, you can go to this site that lists XForms implementations; IBM's form designer, much like InfoPath, can use XML Schema for design, etc.
If this is for developers to get some XML, another alternative might also be to go with an Excel based (or SharePoint lists) approach and generate XML from that data (you give away cost to acquire something to build specific to your requirements tooling, here assuming people that are really familiar with spreadsheets instead).
Given how DATEX model looks like, you'll have to do some manual customizations anyway, if you plan to use the extensibility model, or if you choose to build different forms for different scenarios i.e. instead of one big form that'll give you all descendents for the abstract payloadPublication in some drop down, to just a specific, simple form e.g. MeasurementSiteTablePublication.

I know this is an old question, but I ran into this issue myself and couldn't find a satisfying open-source solution. I ended up writing my own implementation, XSD2HTML2XML. To my knowledge it is the most complete tool out there. It supports automatic generation of HTML forms from XSD, including the population of a form with XML data.
If you prefer an out-of-the-box solution rather than writing your own, please see my implementation, XML Schema Form Generator.

Reflective Web Application (WebIDE)

Preamble
So, this question has already been answered, but as it was my first question for this project, I'm going to continue to reference it in other questions I ask for this project.
For anyone who came from another question, here is the basic idea: Create a web app that can make it much easier to create other web applications or websites. To do this, you would basically create a modular site with "widgets" and then combine them into the final display pages. Each widget would likely have its own set of functions combined in a Class if you use Prototype or .prototype.fn otherwise.
Currently
I am working on getting the basics down: editing CSS, creating user JavaScript functions and dynamically finding their names/inputs, and other critical technical aspects of the project. Soon I will create a rough timeline of the features I wish to create. Soon after I do this, I intent to create a Blog of sorts to keep everyone informed of the project's status.
Original Question
Hello all, I am currently trying to formalize an idea I have for a personal project (which may turn into a professional one later on). The concept is a reflective web application. In other words, a web application that can build other web applications and is actively used to build and improve itself. Think of it as sort of a webapp IDE for creating webapps.
So before I start explaining it further, my question to all of you is this: What do you think would be some of the hardest challenges along the way and where would be the best place to start?
Now let me try to explain some of the aspects of this concept briefly here. I want this application to be as close to a WYSIWYG as possible, in that you have a display area which shows all or part of the website as it would appear. You should be free to browse it to get to the areas you want to work on and use a JavaScript debugger/console to ask "what would happen if...?" questions.
I intend for the webapps to be built up via components. In other words, the result would be a very modular webapp so that you can tweak things on a small or large scale with a fair amount of ease (generally it should be better than hand coding everything in <insert editor of choice>).
Once the website/webapp is done, this webapp should be able to produce all the code necessary to install and run the created website/webapp (so CSS, JavaScript, PHP, and PHP installer for the database).
Here are the few major challenges I've come up with so far:
Changing CSS on the fly
Implementing reflection in JavaScript
Accurate and brief DOM tree viewer
Allowing users to choose JavaScript libraries (i.e. Prototype, jQuery, Dojo, extJS, etc.)
Any other comments and suggestions are also welcome.
Edit 1: I really like the idea of AppJet and I will check it out in detail when I get the time this weekend. However, my only concern is that this is supposed to create code that can go onto others webservers, so while AppJet might be a great way for me to develop this app more rapidly, I still think I will have to generate PHP code for my users to put on their servers.
Also, when I feel this is ready for beta testers, I will certainly release it for free for everyone on this site. But I was thinking that out of beta I should follow a scheme similar to that of git: Free for open source apps, costs money for private/proprietary apps.

Conceptually, you would be building widgets, a widget factory, and a factory making factory.
So, you would have to find all the different types of interactions that could be possible in making a widget, between widgets, within a factory, and between multiple widget making factories to get an idea.
Something to keep on top of how far would be too far to abstract?
**I think you would need to be able to abstract a few layers completely for the application space itself. Then you'd have to build some management tool for it all. **
- Presentation, Workflow and the Data tier.
Presentation: You are either receiving feedback, or putting in input. Usually as a result of clicking, or entering something. A simple example is making dynamic web forms in a database. What would you have to store in a database about where it comes/goes from? This would probably make up the presentation layer. This would probably be the best exercise to start with to get a feel for what you may need to go with.
Workflow: it would be wise to build a simple workflow engine. I built one modeled on Windows Workflow that I had up and running in 2 days. It could set the initial event that should be run, etc. From a designer perspective, I would imagine a visio type program to link these events. The events in the workflow would then drive the presentation tier.
Data: You would have to store the data about the application as much as the data in the application. So, form, event, data structures could possibly be done by storing xml docs depending on whether you need to work with any of the data in the forms or not. The data of the application could also be stored in empty xml templates that you fill in, or in actual tables. At that point you'd have to create a table creation routine that would maintain a table for an app to the spec. Google has something like this with their google DB online.
Hope that helps. Share what you end up coming up with.

Why use PHP?
Appjet does something really similar using 100% Javascript on the client and server side with rhino.
This makes it easier for programmers to use your service, and easier for you to deploy. In fact even their data storage technique uses Javascript (simple native objects), which is a really powerful idea.

Develop Reference

JavaScript is the programming language of the Web.