How can i do Text extraction in Nodejs with good quality?

How can i do Text extraction in Nodejs with good quality? - javascript

I want to extract the text from an image. it was a PDF and i convert it to a tiff image.
The image was good quality.I want to extract the word GeNeSys-ID: and the number after it.
I tried with tesseract.js but the extraction was not good then i tried with node tesseract ocr and the extraction was better but not as i want. I also tried to use preprocessing but it doesn't help.
This is the code with node tesseract ocr
//const tesseract = require("tesseract.js");
const tesseract = require("node-tesseract-ocr")
const { exec } = require('child_process');
const config = {
lang: "deu",
oem: 3, // try different OEMs to see which one produces the best results
psm: 4, // try different PSMs to see which one produces the best results
}
let moduleInfo = {
dirPath: '',
pdf: 'a.pdf'
};
exec("gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=tiffgray -r600 -dDownsampleColorImages=true -dDownsampleGrayImages=true -dDownsampleMonoImages=true -dColorConversionStrategy=/saturation -dAutoRotatePages=/None -sOutputFile=" + moduleInfo.dirPath + "output.tiff " + moduleInfo.dirPath + moduleInfo.pdf, (err, stdout, stderr) => {
if (err) {
console.error(`Error: ${err}`);
return;
}
console.log(stdout);
tesseract
.recognize("output.tiff", config)
.then((text) => {
console.log("Result:", text)
})
.catch((error) => {
console.log(error.message)
})
});
and this is some input examples :
input1
input2

Related

having problems with `fs.writeFile` it doesn't create files

I'm trying to start a script that itself creates a model file in json using fs.writeFile. The problem is when I run the script using node file.js. It is supposed to create a new file face-expression-model.json in directory /models but it doesn't create anything and doesn't show any errors.
I tried to use another library fs-extra not working as well, tried to make the script to create model directory fs.WriteDir not working eitheritried to add process.cwd() to bypass any authorisation when creating the file but didn't work. I also tried to add try/catch block to catch all errors but it doesn't show any errors and it appears that the file was created for the first while but NOPE, unfortunately.
Here is the code I'm using.
const axios = require("axios");
const faceapi = require("face-api.js");
const { FaceExpressions } = faceapi.nets;
const fs = require("fs");
async function trainModel(imageUrls) {
try {
await FaceExpressions.loadFromUri(process.cwd() + "/models");
const imageTensors = [];
for (let i = 0; i < imageUrls.length; i++) {
const response = await axios.get(imageUrls[i], {
responseType: "arraybuffer"
});
const image = new faceapi.Image();
image.constructor.loadFromBytes(new Uint8Array(response.data));
const imageTensor = faceapi.resizeImageToBase64Tensor(image);
imageTensors.push(imageTensor);
}
const model = await faceapi.trainFaceExpressions(imageTensors);
fs.writeFileSync("./models/face-expression-model.json", JSON.stringify(model), (err) => {
if (err) throw err;
console.log("The file has been saved!");
});
} catch (error) {
console.error(error);
}
}
const imageUrls = [
array of images urls here
];
trainModel(imageUrls);

I don't know exactly why but I had the same problem a while ago. Try using the "fs.writeFile" method. It worked for me.
fs.writeFile("models/face-expression-model.json", JSON.stringify(model), {}, (err) => {
if (err) throw err;
console.log("The file has been saved!");
});
Good luck with that!

How could I check If a zip file is corrupted in NodeJS?

I would check if a ZIP file is corrupted using NodeJS using less CPU and memory as possible.
How to corrupt a ZIP file:
Download a ZIP file
Open the ZIP file using a text editor optimized like Notepad++
Rewrite the header. Only put random characters.
I am trying to reach this goal using the NPM library "node-stream-zip"
private async assertZipFileIntegrity(path: string) {
try {
const zip = new StreamZip.async({ file: path });
const stm = await zip.stream(path);
stm.pipe(process.stdout);
stm.on('end', () => zip.close());
} catch (error) {
throw new Error();
}
}
However, when I run the unit tests I receive an error inside an array:
Rejected to value: [Error]

import zip from 'yauzl';
import path from 'path';
const invalidZipPath = path.resolve('invalid.zip');
const validZipPath = path.resolve('valid.zip');
const isValidZipFile = (filePath) => {
return zip.open(filePath, { lazyEntries: true }, (err, stream ) => {
if (err) {
console.log('fail to read ', filePath);
return false;
}
console.log('success read ', filePath);
return true;
});
}
isValidZipFile(validZipPath);
isValidZipFile(invalidZipPath);

Cypress identify the downloaded file using regex

I have one scenario where I have to verify the downloaded text file's data against an API response.
Below is the code that I have tried.
Test:
const path = require('path')
const downloadsFolder = Cypress.config('downloadsFolder')
cy.task('deleteFolder', downloadsFolder)
const downloadedFilename = path.join(downloadsFolder, 'ABCDEF.txt')//'*.txt'
....
cy.get('#portmemo').its('response.body')
.then((response) => {
var json = JSON.parse(response);
const resCon = json[0].content.replaceAll(/[\n\r]/g, '');
cy.readFile(downloadedFilename).then((fc) => {
const formatedfc = fc.replaceAll(/[\n\r]/g, '');
cy.wrap(formatedfc).should('contains', resCon)
})
})
Task in /cypress/plugins/index.js
const { rmdir } = require('fs')
module.exports = (on, config) => {
console.log("cucumber started")
on('task', {
deleteFolder(folderName) {
return new Promise((resolve, reject) => {
rmdir(folderName, { maxRetries: 5, recursive: true }, (err) => {
if (err) {
console.error(err);
return reject(err)
}
resolve(null)
})
})
},
})
When I have the downloadedFilename as 'ABCDEF.txt', it works fine [I have hard coded here]. But I need some help to get the (dynamic) file name as it changes every time [eg.: AUADLFA.txt, CIABJPT.txt, SVACJTM.txt, PKPQ1TM.txt & etc.,].
I tried to use '.text' but I get 'Timed out retrying after 4000ms: cy.readFile("C:\Repositories\xyz-testautomation\cypress\downloads/.txt") failed because the file does not exist error.
I referred to this doc as well but no luck yet.
What is the right way to use regex to achieve the same? Also wondering is there a way to get the recently downloaded file name?

You can make use of the task shown in this question How can I verify if the downloaded file contains name that is dynamic
/cypress/plugins/index.js
const fs = require('fs');
on('task', {
downloads: (downloadspath) => {
return fs.readdirSync(downloadspath)
}
})
This returns a list of the files in the downloads folder.
Ideally you'd make it easy on yourself, and set the trashAssetsBeforeRuns configuration. That way, the array will only contain the one file and there's no need to compare arrays before and after the download.
(Just noticed you have a task for it).

How to increase OCR accuracy in Node JS and Tesseract.js?

I use tesseract.js for detecting numbers in Node JS.
For example this is my image :
I run my script and it detects something like this:
289 ,0
And due to noises in the image, it considers space, other signs like comma and etc.
Is there anyway I can specify just numbers and no others signs like space and commas?
Also this is my code:
tesseract.recognize(
__dirname + '/Captcha.png',
'eng',
{ logger: m => console.log(m) }
).then(({ data: { text } }) => {
console.log(text);
});

I don't no the js tesseract API, however it seems that there is a quite simple work-around here by filter afterward:
tesseract.recognize(
__dirname + '/Captcha.png',
'eng',
{ logger: m => console.log(m) }
).then(({ data: { text } }) => {
const filteredText = Array.from(text.matchAll(/\d/g)).join("")
console.log(filteredText)
})
Here's the test for just the filtering function:
if (Array.from("209, 1".matchAll(/\d/g)).join("") !== "2091") {
throw("Not working")
}

I just started learning the inners of tesseract.js for an assignment.
API documentation explains how you may achieve what you want with the use of some parameters when lunching the job: tessedit_char_whitelist (setting white list characters makes the result only contains these characters) preserve_interword_spaces (keeps the space between words)
From https://github.com/naptha/tesseract.js/blob/master/docs/examples.md
const { createWorker } = require('tesseract.js');
const worker = createWorker();
(async () => {
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
await worker.setParameters({
tessedit_char_whitelist: '0123456789',
preserve_interword_spaces: '0',
});
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text);
await worker.terminate();
})();

How to solve timeout problem with electron-html-to node.js

I'm experiencing this timeout when trying to use electron to convert an html file to pdf. I'm running this js app through node.
`{ Error: Worker Timeout, the worker process does not respond after 10000 ms
at Timeout._onTimeout (C:\Users\Owner\Desktop\code\PDF-Profile-Generator\node_modules\electron-workers\lib\ElectronManager.js:377:21)
at ontimeout (timers.js:436:11)
at tryOnTimeout (timers.js:300:5)
at listOnTimeout (timers.js:263:5)
at Timer.processTimers (timers.js:223:10)
workerTimeout: true,
message:
Worker Timeout, the worker process does not respond after 10000 ms,
electronTimeout: true }`
I do not know too much about electron. I have not been able to try too much to try to debug it. The js code is meant to generate an html file based on user input, pulling from a github profile. Then, that html file needs to be converted to a pdf file.
My js code is as follows:
const fs = require("fs")
const convertapi = require('convertapi')('tTi0uXTS08ennqBS');
const path = require("path");
const generate = require("./generateHTML");
const inquirer = require("inquirer");
const axios = require("axios");
const questions = ["What is your Github user name?", "Pick your favorite color?"];
function writeToFile(fileName, data) {
return fs.writeFileSync(path.join(process.cwd(), fileName), data);
};
function promptUser() {
return inquirer.prompt([
{
type: "input",
name: "username",
message: questions[0]
},
{
type: "list",
name: "colorchoice",
choices: ["green", "blue", "pink", "red"],
message: questions[1]
}
])
};
function init() {
promptUser()
.then(function ({ username, colorchoice }) {
const color = colorchoice;
const queryUrl = `https://api.github.com/users/${username}`;
let html;
axios
.get(queryUrl)
.then(function (res) {
res.data.color = color
const starArray = res.data.starred_url.split(",")
res.data.stars = starArray.length
console.log(res)
html = generate(res.data);
console.log(html)
writeToFile("profile.html", html)
})
var convertFactory = require('electron-html-to');
var conversion = convertFactory({
converterPath: convertFactory.converters.PDF
});
conversion({ file: './profile.html' }, function (err, result) {
if (err) {
return console.error(err);
}
console.log(result.numberOfPages);
console.log(result.logs);
result.stream.pipe(fs.createWriteStream(__dirname + '/profile.pdf'));
conversion.kill(); // necessary if you use the electron-server strategy, see bellow for details
});
// convertapi.convert('pdf', { File: './profile.html' })
// .then(function (result) {
// // get converted file url
// console.log("Converted file url: " + result.file.url);
// // save to file
// return result.file.save(__dirname + "/profile.pdf");
// })
// .then(function (file) {
// console.log("File saved: " + file);
// });
})
}
init();

I had a similar problem. I had installed multiple versions of Electron (electron, electron-html-to, electron-prebuilt), and the problem was resolved when I deleted the older versions in package.json so only one was left. The assumption is that they were interfering with each other.
So check the installed versions of electron, because the problem might be there rather than your code.

Develop Reference

JavaScript is the programming language of the Web.

How can i do Text extraction in Nodejs with good quality? - javascript

Related

having problems with `fs.writeFile` it doesn't create files

How could I check If a zip file is corrupted in NodeJS?

Cypress identify the downloaded file using regex

How to increase OCR accuracy in Node JS and Tesseract.js?

How to solve timeout problem with electron-html-to node.js

Categories

Resources