OCR a scanned file and retrieve the metadata

OCR a scanned file and retrieve the metadata - javascript

I am using Alfresco community 6.1.
I have thousands of invoices to scan, OCR them (near 100% recognition) and retrieve the needed metadata (Partner, Invoice Number, Amount, Units,Currency,...).(All of this in Alfresco)
Based on these metadata retrieved i need to do some operations on the invoices ( Move them to appropriate folders, apply some workflows...).
As a first approche:
For the OCR I used Alfresco Simple OCR Action, but the result is not very accurate (far from 100%).
For retrieving the results I convert the PDF OCRed to a plain text file and then i search it's content using javascript with document.content ... But since the OCR is not accurate i can't tell if it's the best solution to search inside the document.
So my questions are :
How can I make the OCR results more accurate?
How to retrieve important data from the invoice? is the method i'm using good enough or very poor for such processing?
Im using pdfsandwich, and my alfresco-global.properties is:
ocr.command=/usr/bin/pdfsandwich
ocr.output.verbose=true
ocr.output.file.prefix.command=-o
ocr.extra.commands=-verbose -lang eng
ocr.server.os=linux

I'm afraid this question is off topic: https://stackoverflow.com/help/on-topic
Some input anyway:
I highly recommend to do all the ocr/classification/extraction outside / before storing the pdfs in Alfresco
The technical term for what you're looking for is: Document Capture
If you really expect to classify your scanned docs and to extract the data for inbound documents (which you can't control in structure) the solutions are quite expensive and licensed per pages/period. Market leaders are Kofax and Abbyy in that area.
If you can control the document structure / if the structure of the document is fix you could use quite cheaper solutions which use something like a dynamic template approach (depending on found ancor points, barcodes, regex matches). We use PDFmdx for this to automate qualified extraction.
Everything depends on the OCR quality. My personal opinion: the free/open source ocr components can't compete with the commercial solutions if you don't have the time, exprtise and resources to train and optimize them. Abbyy has a quite affordable CLI solution for linux (ABBYY FineReader Engine CLI for Linux) but I'm sure there are others with similar results.
There is a quite nice and simple solution called AutoOCR which is a REST-/SOAP-Service providing a generic, configurable interface to use several ocr engines and configurations as a service. We implemented an Alfresco integration to act as an Alfresco Transformer but since the Alfresco Transformer framework is deprecated I'd recommend to do the whole ocr and recognition stuff before storing the documents in Alfresco
Finally: if it is a one time approach: Try to find a service provider doing at least the ocr and maybe also the classification/extraction.

To answer your questions.
To improve OCR results you need to pre-process image. That includes noise removal, line removal, thresholding, etc. But none of them helps if the engine is not working precisely. Tesseract from version 4.0.0 is working well enough for most applications.
Your approach may work in some cases but it will not work great on a large set of invoices. I suggest using some of the invoice data extraction services. In that case, you don't need to worry about preprocessing and extraction itself. You could use:
typless
Klippa
Taggun
Using such a service can save you a lot of headaches and time.
Disclaimer: I am one of the creators of typless. Feel free to suggest edits.

Related

Tesseract in a specific information

I want to scan a Spanish DNI ang get some information and print it in the screen. A DNI has this form: 1
And i want to take the fields DNI, Nombre and Apellidos (in the image, it would be 99999999R, CARMEN, ESPAÑOLA ESPAÑOLA).
I thought that the best way is using "cut tool" and use the OCR in the cut images. What do you think? I have to make the project in HTML/JS and I don't really know how to program this.
Thanks.

This is not an easy task and to do it, you need to do the following:
Make sure you "cut" the image precisely around the borders. This method needs to be robust to lightning conditions, low contrast situations, etc. Ideally, it should use advanced computer vision and ML techniques
Then you need to define where the individual fields are. This is also not an easy task, because the sizes and positions of the fields vary between different IDs.
In the final step, you need to have a very reliable OCR tool, one which would give you a low error rate, so that you actually have a benefit of doing this automatically, compared to just retyping all these fields manually. Although OCR seems like an easy problem today, it's still very hard, especially on ID documents which can be worn out and damaged and taken in weird lighting conditions.
My company Microblink has spent years working on ID scanning, not just for Spanish DNIs, but also for many other document types (there are more than 5000 different types in the world).
If you are interested in reading how we're doing it, here are some of the materials:
Goodbye Templates
BlinkID v5
From OCR to DeepOCR
As for the "cut tool" - we do have a feature that allows you to automatically capture the image of a document and crop it around the edges of the document. We call it "Document capture" and it's a part of our BlinkID SDK.
As for the HTML/JS - it's not clear what exactly you need, but we do have a React Native and Cordova plugins which allow you to build cross-platform mobile apps in JS, and we also have a Frontend SDK and Web API which allow you to scan documents in any browser.

How to implement firebase server side security

I'm currently working on a new google polymer web application and wondered if I should use firebase as the backend/db. I took a look at the project, made some test applications and really liked it! But to fully convince me, that firebase is the way to go I need the following questions answered:
I'm a little bit concerned about security: So, I know, that firebase uses read, write and validate to implement server side security. From the samples, I noticed that the validation basically is a one-line JS script, that represents a 'if'. As I'm planning to build a web e-commerce application I need to validate quite some inputs. Is there a possibility, to outsource the validation in a separate file, to make it more readable? Also I wondered, if there is a possibility, to test these server side validations, with for example unit tests?
I'm not 100% sure at the moment, that firebase can cover all of our use cases. Would it be possible/a good solution to use a "normal" backend for some critical functions and then persist the data from the backend in firebase?
I saw some nice polymer elements for firebase. Is firebase 100% supported in polymer/web components?
Is there an other way (like Java approach) to implement server business logic?
Is there a way, to define update scripts, so that new releases can easily be pushed to production?
Thanks & kind regards
Marc

So, I asked the firebase supprt and got the following answer:
Great to meet you.
I'm a little bit concerned about security: So, I know, that firebase uses read, write and validate to implement server side security. From the samples, I noticed that the validation basically is a one-line JS script, that represents a 'if'. As I'm planning to build a web e-commerce application I need to validate quite some inputs. Is there a possibility, to outsource the validation in a separate file, to make it more readable? Also I wondered, if there is a possibility, to test these server side validations, with for example unit tests?
You can implement extremely complex and effective rules using our security rules language. You can deploy security rules as part of your hosting deploy process, or via the REST API. It's not possible to break the contents into multiple files on the server, but you could certainly build your own process for merging multiple files into a single JSON result.
I'm not 100% sure at the moment, that firebase can cover all of our use cases. Would it be possible/a good solution to use a "normal" backend for some critical functions and then persist the data from the backend in firebase?
Generally speaking, synchronizing Firebase and a SQL back end is not very practical and they don't translate well. It's probably entirely redundant as well.
I saw some nice polymer elements for firebase. Is firebase 100% supported in polymer/web components?
I don't know what 100% supported means in this context. We offer a JavaScript SDK so they should play fine together.
Is there an other way (like Java approach) to implement server business logic?
We offer official SDKs in Java, Objective-C/Swift, Android, Node.js, JavaScript, and a REST API for use with other languages.
Is there a way, to define update scripts, so that new releases can easily be pushed to production?
I'm not sure what this means. Most likely the answer is no, since we don't provide a build process or any tools to release your software.
I hope that helps!
I responded:
Thank you for the information, it helped me very much! After reading your response on question number 5 one further question popped into my mind:
…
5. Is there a way, to define update scripts, so that new releases can easily be pushed to production?
I'm not sure what this means. Most likely the answer is no, since we don't provide a build process or any tools to release your software.
Is there like a best practice on how to handle the database schema? I only have one web application (without apps, etc.) in my case... I expect, that the database will change drastically over time and releases. Should I write JS logic, that checks the current database version and update it, if it's necessary? Maybe this would make a nice feature...
For example: I deployed Version 1.0 of my application and everything works fine. After 3 months of programming I notice, that the user data needs a further attribute: address, which is a 'not null' attribute. I now deploy Version 2.0 of my application and every new registered user has a address, but the old users (from Version 1.0) do not have this field or a value.
How should I handle this?
Support responded:
Hi Marc,
There’s no best practice here, but your ideas seem fairly sound. You probably don’t need to check in your JavaScript. You can probably store a version number in the user’s profiles, and when they upgrade to the latest software, you can upgrade that in their profile data.
Then your validation rules could use something like the following:
{
"user": {
".write": "newData.hasChild('address') || newData.child('appVersion') < 4",
"address": {
".validate": "newData.isString() && newData.val().length < 1000"
}
}
}
So if you are concerned about versioning, this could be used to deal with legacy releases.
Another popular approach I’ve seen from devs is to do intermediate upgrades by duplicating data. Thus, you release an intermediate version that writes to the old path and to the new path with the updated data structure (which keeps the app working for old users till they upgrade). Once a reasonable percent of clients are upgraded, then release a final version that no longer does a dual write to the old structure and newer structure.
Of course, flattening data, while it makes joining and fetching data bit more of a pain, will make upgrades much easier as the modular data structure adapts more easily to changes. And, naturally, a pragmatic design where you wrap the various records in a class (e.g. the UserProfile class with getter/setter methods) makes transitions simpler as you can easily hack in versioning at one place.
Hope this helps someone :)

Javascript query-able indexing of offline HTML docs

I posted a similar question earlier but don't think I explained my requirements very clearly. Basically, I have a .NET application that writes out a bunch of HTML files ... I additionally want this application to index these HTML files for full-text searching in a way that javascript code in the HTML files can query the index (based on search terms input by a user viewing the files offline in a web browser).
The idea is to create all this and then copy to something like a thumb drive or CD-ROM to distribute for viewing on a device that has a web browser but not necessarily internet access.
I used Apache Solr for a proof of concept, but that needs to run a web server.
The closest I've gotten to a viable solution is JSSindex (jssindex.sourceforge.net), which uses Lush, but our users' environment is Windows and we don't want to require them to install Cygwin.

It looks like your main problem is to make index accessible by local HTML. Cheat way to do it: put index in JS file and refer from the HTML pages.
var index=[ {word:"home", files:["f.html", "bb.html"]},....];

Ladders Could be a solution, as it provides on the spot indexing. But with 1,000 files or more, I dunno how well it'd scale... Sadly, I am not sure JS is the answer here. I'd go for a custom (compiled) app that served both as front-end (HTML display) and back-end (text search and indexing).

Use a trie - they're ridiculously compact and very scalable - dead handy for text matching.
There is a great article covering performance and design strategies. They're slower to boot up than a dictionary, but take up a lot less room, particularly when you're working with larger datasets.
I'd tackle it as follows:
in your .net code index all the keywords that are important to you (track their document and offset).
generate your trie structure using an alpha sorted list of keywords,
decorate the terminal nodes with information about the documents the words they represent can be found in.
C
A
R T [{docid,[hit offsets]},...]
You don't have to store the offsets, but it would allow you to search for words by proximity or order.
Your .net guys could build the trie sample code.
It will take a while to generate the map, but once it's done and you've serialised it to JSON your javascript application will race through it.

Get html form from xsd

I have a quite complex xsd file that describe some objects (it's not important, but it's the DATEX II standard)
Do you know if there is an automatic way to create an html form that act like a "wizard" to guide the user to create xml object as described in the xsd?

The answer to this depends on the intended user base, how you want your users to access your forms, and the technology stack you have in place already or you're willing to deploy.
If your users are quality control analysts, and so the intent is to have them use that generated UI to manage test cases, then a handful of commercial tools have this ability. A quick search on Google for terms such as "generate ui forms from XSD to test web services" should give you on first page the major players in this space (I won't name names to avoid conflict of interests). There are differences in how vendors approach this, that have to do with the time it takes to generate these forms from large bodies of XML Schema, which in turn translate into different degrees of usability. Given what I see in DATEX, from a complexity perspective, you may have a hard time to find a free tool for this...
If your users are rather data entry specialists, then the above are not the tools you want them to use. Without knowing much about your environment (I see your java-ee tag, but still not clear how it would relate to this task), one model could be a combination of InfoPath with SharePoint; while the process to generate the form is not fully automatic, it comes close to that. It is XSD driven, in the sense that at design time you drag and drop XSD on a design form, that allows you to build some really nice UI. Follow their competition on your particular technology stack and you may have your answer. Or, you can go to this site that lists XForms implementations; IBM's form designer, much like InfoPath, can use XML Schema for design, etc.
If this is for developers to get some XML, another alternative might also be to go with an Excel based (or SharePoint lists) approach and generate XML from that data (you give away cost to acquire something to build specific to your requirements tooling, here assuming people that are really familiar with spreadsheets instead).
Given how DATEX model looks like, you'll have to do some manual customizations anyway, if you plan to use the extensibility model, or if you choose to build different forms for different scenarios i.e. instead of one big form that'll give you all descendents for the abstract payloadPublication in some drop down, to just a specific, simple form e.g. MeasurementSiteTablePublication.

I know this is an old question, but I ran into this issue myself and couldn't find a satisfying open-source solution. I ended up writing my own implementation, XSD2HTML2XML. To my knowledge it is the most complete tool out there. It supports automatic generation of HTML forms from XSD, including the population of a form with XML data.
If you prefer an out-of-the-box solution rather than writing your own, please see my implementation, XML Schema Form Generator.

dynamic web forms

I'm developing a web application that allows reports to be written and viewed online. These reports will have the structure of a typical school report or annual employee appraisal report. I would like the user to be able to customise the structure of their report. For example, one school might want a report in the format
Subject Comment Score
-----------------------------
English He sucks 20%
Maths He rocks 88%
Science About average 70%
whereas another might want
Subject Grade
---------------
English A
Maths B
Science C
What I'm looking for is a way for each school to specify the format of their reports - possibly some kind of JavaScript form-building library. Such a library could be used in a page that allows the uses to build a form which would be used as a template for their reports.
As I'll need to process each report submitted on the server-side, I'll need to capture some semantics about each field. For example, it would be great if the user could specify whether the answer to each question on the report should be plain text, a numerical score, a checkbox, radio buttons, etc
Any suggestions about useful technologies for handling such "dynamic" forms would be really appreciated. XForms looks like it might be relevant, but I haven't dug into it too deeply yet.
Cheers,
Don

A very nice XForms based form builder, (LGPL) http://www.orbeon.com/
You can check out their form builder demo here: http://www.orbeon.com/ops/fr/orbeon/builder/summary/

I agree with Jeff Beck's comments and also noticed the following.
You said your target audience is non-technical and all of the solutions above are going to involve learning HTML and a complex template language, possibly a non-starter for your audience.
The solutions above also seem to need more complexity than your problem requires. MooTools, Dojo, etc. seem like overkill. XForms and XSLT even more so. Yes they'll work and give you a lot of extra functionality, but do you need the level of complexity and the issues of debugging/maintainability/training that go with those extra features?
Your regular teacher or business user probably has a basic understanding of how to enter and save files in Excel. If you can teach them how to save in CSV format and upload the form, or even better yet install a macro that will save to CSV and post it to your web site, then that's likely the only training they'll need. To get the semantics you can add a bit more training and have the first row of the report be the column names and the 2nd row be the column type. It's not elegant, but it is easy for possibly tech-challenged users to adopt, as Jeff points out.
On the server side I'd recommend the following stack:
Web server => node.js (perhaps using Chain - github.com/hassox/chain)
Data store => Redis (and node-redis)
Templating => Haml-js (github.com/creationix/haml-js)
CSV parsing => See http://purbayubudi.wordpress.com/2008/11/09/csv-parser-using-javascript/
and make sure to use the fixed version that's in the comments (for quoted commas).
Your more tech savvy users can customize the HAML without you compromising security, and HAML is pretty straightforward with a little training:
this HAML...
%body
.profile
.left.column
#date= print_date()
#address= current_user.address
.right.column
#email= current_user.email
#bio= current_user.bio
produces...
<div class="profile">
<div class="left column">
<div id="date">Thursday, October 8, 2009</div>
<div id="address">Richardson, TX</div>
</div>
<div class="right column">
<div id="email">tim#creationix.com</div>
<div id="bio">Experienced software professional...</div>
</div>
</div>

Pragmatic approach would be using google spreadsheet's feature called forms, (paid) services from wufoo or JotForm.

I would suggest to use:
a wiki engine or a plain-text to HTML converter such as Markdown to allow your users to customize the templates you provide
an HTML templating library to insert the data using the defined template
For the HTML converter, you may use John Gruber's MarkDown (in perl) on the server-side, or a Javascript port by John Fraser, Showdown.
For the HTML templating, there are many Javascript libraries available, depending on your framework of choice:
jTemplates, a plugin for jQuery
the Template API in Prototype
Wojo based on Dojo
tmpl.js for MooTools
I also designed one, bezen.template.js part of the bezen.org Javascript library

I'm in charge of XSLTForms, and it seems like a good candidate for what you want to do.
The possibilities for XSLTForms are superior to those of XForms 1.1 specification : XSLT at client-side, SVG, and others.
Dynamic forms can be developed with XForms and, in case it would not be enough for your application, XSLTForms could integrate necessary extensions.

Should be easy to build in Smalltalk with Seaside. You have a WATableReport with WATableColumns. Just build a simple editor where each school can define those columns.
I'm not sure what javascript or XForms have to do with it. As far as I know XForms is currently dead unless you can prescribe the browser.

i think if the forms are not changing too frequently, you should not provide an system for the non-tech user to mess up with the reports
rather make ur system easy for YOU to add new reports, in this case, customer send you an pdf/excel showing the format they want, and you can quickly come up with a new report
that is was i did for our accounting system which being used for several clinics, we also charge a nominal fee for each report changing (to prevent user mindlessly changing the system)

Develop Reference

JavaScript is the programming language of the Web.