PDF to HTML in a nodejs application

PDF to HTML in a nodejs application - javascript

I'm trying to extract the content of a PDF to obtain the equivalent HTML.
I'm using Nodejs to do so (it's a telegram bot).
I googled for a while and I've been able to find only HTML to PDF things like using poppetteer and similar. Do you know something that does the exact reverse thing?
Ty in advance.

Have a look at pdfjs-dist.
I haven't used that library, but it seems to be the one to take you a bit closer to your objective. Also, as you probably already know, a PDF can contain anything: scanned text, photos, drawings, what not.
It is probably impossible to have a library that is able to extract all the info a human can extract from a PDF.

Related

What's the best way to create a pdf document with charts and in a decent quality?

First of all, I want to apologize for my English, I know my grammar is not the best, I'm not a native speaker so sometimes I have a hard time finding the right words or trying to explain the things I have in mind. So if you have a problem understanding what I want, please let me know and I'll try to be clearer.
I'm making a web app, the backend is made in PHP.
I need to create some kind of pdf-report for the users with charts, tables, pictures, and I need it to be decent (Quality and Style)
I don't know if the best option is a javascript library, 'cause I've found some libraries, but these make not really good looking pdfs.
So I was looking for a javascript library, but nothing seems to fit my requeriments.
I'll give you some pictures in order for you to have a better idea about what I want:
Sales report: This was the closest I could find to what I want.
Another sales report Something like this
Invoice I need something that allows me to make a really good looking design for the document, similar to this invoice
I'm really getting tired of looking and found nothing... It's been almost 2 months now, maybe more.
So my first idea was to use a javascript library that could convert html to pdf, but I was so naive and I didn't know how hard it was to convert html to pdf and without messing with the structure of the elements (The visual organization)
I first tried jsPDF by MrRio, but it doesn't support css (Something that I really need) or the charts (If the charts are pictures, it works fine, but if they are generated with a library like chartJS or highcharts it won't work).
I tried with some similar plugins but all of them had similar problems
Here is the link for jsPDF and their git
Then I found jsreport, I did a really nice thing in their playground and it worked perfectly. The problem is it needs NodeJS and my Site is made in nodeJS, and I don't know if there is a way to use jsreport with php and if there is I'm not sure if it will work as nice as it did in their playground.
Something that I just tought about if there will be a way that use nodeJS only for this feature, I don't know it is not recommended or something, need to find more about this, idk what you guys think.
Jsreport can use multiple recipes like PhantomJS to generate the PDF
I've found other plugins but they are also for nodeJS only.
Here is jsreport site
Then I tried again jsPDF but this time using it along with html2canvas and it worked almost fine, but the quality decreases a lot and the text cannot be selected or copied. I managed to increase the quality a bit, but nothing further that.
I have other ideas, something related to the print option from the browsers, but haven't found anything good.
The other idea that I don't know if it possible, it's making some kind of document like a template and then somehow add the user data inside that template, to finally have a pdf but with the users data.
Read this, here is some related info.
So, the reason I want this feature it's because most of the clients that use my Web App, require this function. So I can't migrate my app to Node or Rails, because it will take a really long time, and I don't have that time right now, so I don't know what are my options. The solution doesn't need to be free, but if it is, it will be great, because I don't have a big budget for a really expensive plugin.
Thank you for reading my question, I really need help with this, so let me know everything you have in mind, every answer is appreciated even if I've already tried it, because maybe you can came up with a different solution using something that I have used already.

Prepare your Html page with Highcharts or whatever charts you want to get and then use this tool WkHtmltoPDF

C#/javascript, Convert Apple's new HEIC image format to JPG

So I know this question has been asked several times, but I can NOT find a clear, definitive answer.
I am trying to use pure HTML5 & AJAX/C# (nothing else if possible) to simply capture a photo, from a mobile page (not an app), via IOS/Android. So far with the code I have, everything works fine for Android, but the issue is IOS' new HEIC image format.
My goal is to convert the HEIC image, captured by the form input, and convert it to JPG before sending it to the server.
Could someone please explain how to do this, (in a little detail), and also include any dependencies/libraries that are needed to do it.
PS: I am very aware of Nokia's GitHub project (which doesn't work), and also the overly expensive API that does it for you.

Came across your question looking for answers but sadly didn't find any :(
However, I've been successful to do the conversion using libheif.
I've created this repository which uses libheif but with much simpler API. This can only be used in the browser, and the resulting image won't retain any metadata. However, it will work as you would expect.
Here's a demo: https://alexcorvi.github.io/heic2any/#try

How to extract text from an image using JavaScript

I want to extract all the information which I can get from an image so if that image contains:
Name : john doe
Dob : 12/12/2012
After user has uploads that image I want to extract those two pieces of information on two variables and store those in my database. I have tried Orcad.js but that did not work for me :(. Are there other methods to extract text from an image and store it via JavaScript?

Javascript is a whole computer language (despite criticism against it) and since the arrival of NodeJS, simply saying Javascript doesn't communicate to the community whether you're trying to do this in the browser or on your server.
The functionality that you're describing is optical character recognition (OCR). Does Javacript have it built-in? No. That's the short answer to your question.
Is it possible to do this using the Javascript language? Yes, but you'll have to work to make it work. As you've already discovered, there are projects like Ocrad.js which implement the OCR algorithm's in Javascript and run right in your browser. That demo seems to work reasonably well for me. Care to elaborate on the specific issues you encountered?
On the other, more obvious end of the spectrum, if you're running Javascript on your server, then you can use Javascript to write OCR code (much like Ocrad), or you can delegate it to some application you can download and run on your server like OCR4Linux.

pdf.js analog for Word Documents

I am searching for a JavaScript library that is similar to pdf.js but allows the viewing of Word Documents (.doc and .docx)
Are there any?
UPDATE:
There is an interesting library called DOCX.js
But I'm searching for something more advanced.

I doubt it. Behind pdf.js stands Mozilla, so it isn't a weekend project.
There are options to let LibreOffice run in the browser, but I have no first-hand experience with it. Apparently, some cloud projects like NextCloud use it, though.
Then you have Google Docs to import the Word file and let it be displayed there, but there is no way to embed that easily or even host the code yourself. (Also, as I understand it, there are transformations to the Word file on the server involved.)
And after all, if you compare the PDF spec with the OpenXML (aka .docx) spec, it becomes quite clear, that a fully compliant viewer will be a complex beast, to say the least.

I just found out ViewerJS, but it only supports OpenDocument formats. It's not what you were looking for, but may be worth a shot, specially if you can find a way to convert odt to doc (this question might help).

At a glance, it looks like Flexpaper can be used to this effect, but it's effectively using a server-side version of open office to convert the document into images that can be viewed on the web. This'll work in a pinch, but certainly lacks the quality of pdf.js.

You can use ViewerJS and JOD Converter (http://www.artofsolving.com/opensource/jodconverter.html) together to achieve requirement. First you can convert office documents to open office or pdf format using above converter. Then you can show those documents with the help of either pdf.js or ViewerJS

Native Documents (in which I have an interest) makes an embeddable viewer/editor for Word documents. There's an online demo where you can try your own document.

Saving Div Content As Image On Server

I have been learning a bit of jQuery and .Net in VB. I have created a product customize tool of sorts that basically layers up divs and add's text, images etc on top of a tshirt.
I'm stuck on an important stage!
I need to be able to convert the content of the div that wraps all these divs of text and images to one flat image taking into account any CSS that has been applied to it also.
I have heard of things that I could use to screen capture the content of a browser on the server which could be possible for low res thumbs etc, but it sounds a little troublesome! and it would really be nice to create an image of high res.
I have also heard to converting the html to html5 canvas then writing that out... but looks too complicated for me to fathom and browser support is an issue.
Is this possible in .NET?
Perhaps something with javascript could be done?
Any help or guidance in the correct direction would be appreciated!
EDIT:
I'm thinking perhaps I could do with two solutions for this. Ideally I would end up with a normal res jpg/png etc for displaying on the website, But also a print ready high res file would be very desirable as well.
PostScript Printer - I have heard of it but I'm struggling to find a good resource to understand it for a beginner (especially with wiki black out). Perhaps I could create a html page from my div content and send it to print to a EPS file. Anyone know any good tutorials for this?

We did this... about 10 years ago. Interestingly, the tech available really hasn't changed too much.
update - Best Answer
Spreadshirt licenses their product: http://blog.spreadshirt.net/uk/2007/11/27/everyones-a-designer-free-designers-for-premium-partners/
Just license it. Don't do this yourself, unless you have real graphics manipulating and print production experience. I'd say in today's world you're looking at somewhere around 4,000 to 5,000 hours of dev time to duplicate what they did... And that's if you have two top tier people working on it.
Short answer: you can't do it in html.
Slightly longer answer:
It doesn't work in part because you can't screen cap the client side and get the level of resolution needed for production type printing. Modern screen resolution is usually on the order of 100 ppi. For a decent print you really need something between 3 and 6 times that density. Otherwise you'll have lots of pixelation and it will generally look like crap when it comes out.
A different Answer:
Your best bet is to leverage something like SVG (scalable vector graphics) and provide a type of drawing surface to the browser. There are several ways of doing this using Flash (Spreadshirt.com uses this) or Silverlight (not recommended). We used flash and it was pretty good.
You might be able to get away with using HTML 5. Regardless, whatever path you pick is going to be complicated.
Once the user is happy with their drawing and wants to print it out, you create the final file and run a process to convert it to Postscript or whatever format your t-shirt provider needs. The converter (aka RIP software) is going to either take a long time to develop or cost a bunch of money... pick one. (helpful hint: buy it. Back then, we spent around $20k US and it was far cheaper than trying to develop).
Of course, this ignores issues such as color matching and calibration. This was actually our primary problem. Everyone's monitor is slightly different and what looks like red on one machine is pink on another.
And for a little background, we were doing customized wrapping paper. The user added text, selected images from our library or uploaded their own, and picked a pattern. Our prints came out on large-format HP Inkjet printers (36" and 60" wide). Ultimately we spent between $200k and $300k just on dev resources to make it happen... and it did, unfortunately, the price point we had to sell at was too high for the market.

If you can use some server-side tool, check phantomjs. This is a headless webkit browser (with no gui) which can take a page's screenshot, an uses a javascript api. It should do the trick.

Send the whole div with user generated content back to server using ajax call.
Generate an HTML Document on server using 'HtmlTextWriter' class.
Then you can convert that HTML file using external tools like
(1) http://www.officeconvert.com/products_website_to_image.htm#easyhtmlsnapshot
(2) http://html-to-image.acasystems.com/faq-html-to-picture.htm
which are not free tools, but you can use them by creating new Process on server.

The best option I came across is wkhtmltopdf. It comes with a tool called wkhtmltoimage. It uses QtWebKit (A Qt port of the WebKit rendering engine) to render a web page, and converts the result to PDF or image format of your choice, all done at server side.
Because it uses WebKit, it renders everything (images, css and even javascript) just like a modern browser does. In my use case, the results have been very satisfying and are almost identical to what browsers would render.
To start, you may want to look at how to run external tools in .NET:
Execute an external EXE with C#.NET

Develop Reference

JavaScript is the programming language of the Web.