How to increase OCR accuracy in Node JS and Tesseract.js? - javascript

I use tesseract.js for detecting numbers in Node JS.
For example this is my image :
I run my script and it detects something like this:
289 ,0
And due to noises in the image, it considers space, other signs like comma and etc.
Is there anyway I can specify just numbers and no others signs like space and commas?
Also this is my code:
tesseract.recognize(
__dirname + '/Captcha.png',
'eng',
{ logger: m => console.log(m) }
).then(({ data: { text } }) => {
console.log(text);
});

I don't no the js tesseract API, however it seems that there is a quite simple work-around here by filter afterward:
tesseract.recognize(
__dirname + '/Captcha.png',
'eng',
{ logger: m => console.log(m) }
).then(({ data: { text } }) => {
const filteredText = Array.from(text.matchAll(/\d/g)).join("")
console.log(filteredText)
})
Here's the test for just the filtering function:
if (Array.from("209, 1".matchAll(/\d/g)).join("") !== "2091") {
throw("Not working")
}

I just started learning the inners of tesseract.js for an assignment.
API documentation explains how you may achieve what you want with the use of some parameters when lunching the job: tessedit_char_whitelist (setting white list characters makes the result only contains these characters) preserve_interword_spaces (keeps the space between words)
From https://github.com/naptha/tesseract.js/blob/master/docs/examples.md
const { createWorker } = require('tesseract.js');
const worker = createWorker();
(async () => {
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
await worker.setParameters({
tessedit_char_whitelist: '0123456789',
preserve_interword_spaces: '0',
});
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text);
await worker.terminate();
})();

Related

How can i do Text extraction in Nodejs with good quality?

I want to extract the text from an image. it was a PDF and i convert it to a tiff image.
The image was good quality.I want to extract the word GeNeSys-ID: and the number after it.
I tried with tesseract.js but the extraction was not good then i tried with node tesseract ocr and the extraction was better but not as i want. I also tried to use preprocessing but it doesn't help.
This is the code with node tesseract ocr
//const tesseract = require("tesseract.js");
const tesseract = require("node-tesseract-ocr")
const { exec } = require('child_process');
const config = {
lang: "deu",
oem: 3, // try different OEMs to see which one produces the best results
psm: 4, // try different PSMs to see which one produces the best results
}
let moduleInfo = {
dirPath: '',
pdf: 'a.pdf'
};
exec("gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=tiffgray -r600 -dDownsampleColorImages=true -dDownsampleGrayImages=true -dDownsampleMonoImages=true -dColorConversionStrategy=/saturation -dAutoRotatePages=/None -sOutputFile=" + moduleInfo.dirPath + "output.tiff " + moduleInfo.dirPath + moduleInfo.pdf, (err, stdout, stderr) => {
if (err) {
console.error(`Error: ${err}`);
return;
}
console.log(stdout);
tesseract
.recognize("output.tiff", config)
.then((text) => {
console.log("Result:", text)
})
.catch((error) => {
console.log(error.message)
})
});
and this is some input examples :
input1
input2

How can I read a CSV file from a URL in a Next.js application?

I have a Next.js application here which needs to read a CSV file from a URL in the same repo in multiple places, but I cannot seem to be able to retrieve this data. You can find the relevant file in my repo here.
Note, the URL I'm trying to pull data from is this: https://raw.githubusercontent.com/ivan-rivera/balderdash-next/main/public/test_rare_words.csv
Here is what I've tried so far:
Approach 1: importing the data
let vocab = {};
...
async function buildVocab() {
const words = await import(VOCAB_URL); // this works when I point to a folder in my directory, but it does not work when I deploy this app. If I point at the URL address, I get an error saying that it cannot find the module
for (let i = 0; i < words.length; i++) {
vocab[words[i].word] = words[i].definition;
}
}
Approach 2: papaparse
const papa = require("papaparse");
let vocab = {};
...
export async function buildVocab() {
await papa.parse(
VOCAB_URL,
{
header: true,
download: true,
delimiter: ",",
step: function (row) {
console.log("Row:", row.data); // this prints data correctly
},
complete: function (results) {
console.log(results); // this returns an object with several attributes among which is "data" and "errors" and both are empty
},
}
);
// this does not work because `complete` does not return anything
vocab = Object.assign({}, ...raw.map((e) => ({ [e.word]: e.definition })));
console.log(vocab);
}
Approach 3: needle
const csvParser = require("csv-parser");
const needle = require("needle");
let vocab = {};
...
let result = [];
needle
.get(VOCAB_URL)
.pipe(csvParser())
.on("data", (data) => {
result.push(data);
});
vocab = Object.assign({}, ...result.map((e) => ({ [e.word]: e.definition })));
// This approach also returns nothing, however, I noticed that if I force it to sleep, then I do get the results I want:
setTimeout(() => {
console.log(result);
}, 1000); // now this prints the data I'm looking for
What I cannot figure out is how to force this function to wait for needle to retrieve the data. I've declared it as an async function and I'm calling it with await buildVocab() but it doesn't help.
Any ideas how I can fix this? Sorry, I'm a JS beginner, so it's probably something fundamental that I'm missing :(
After spending hours on this, I think I finally found a solution:
let vocab = {};
export async function buildVocab() {
await fetch(VOCAB_URL)
.then((resp) => resp.text())
.then((text) => {
papa.parse(text, { header: true }).data.forEach((row) => {
vocab[row.word] = row.definition;
});
});
}
The only oddity that I still can't work out is this: I'm calling my buildVocab function inside another async function and I noticed that if I do not include a console.log statement in that function, then the vocab still does not get populated in time. Here is the function:
export async function sampleWord() {
await buildVocab();
const keys = Object.keys(vocab);
const index = Math.floor(Math.random() * keys.length);
console.log(`selected word: ${keys[index]}`); // this is important!
return keys[index];
}

Have I mismanaged a promise? or mismanaged a data type?

I would love to hear some ideas to solve this problem. Here's a javascript function. Make note of the comments, I'll refer to those specific lines later.
async function getGroupId(name) {
const response = await fetch(
environmentData.localFunctionsUrl + functionPath,
{
body: JSON.stringify({
command: `GetInterests`,
path: `/interest-categories/`
}),
method: `POST`
}
)
const groupId = await response.json().then((json) => {
return json.map((obj) => {
if(obj.name.substring(0, `YYYY-MM-DD`.length) === name.substring(0, `YYYY-MM-DD`.length)) {
return obj.id //LET'S CALL THIS LINE "yellow"
}
})[0] //AND THIS LINE, LET'S CALL IT "yellow-ish"
})
const retObj = { [groupId]: true } //LET'S CALL THIS LINE "orange"
return retObj
}
That function is called in my code like this:
async function registerSubscriber(data) {
const emailToHash = data.paymentIntent.metadata.contact_email
const response = await fetch(
environmentData.localFunctionsUrl + functionPath,
{
body: JSON.stringify({
command: `UpsertMember`,
path: `/members/`+ crypto.createHash(`md5`).update(emailToHash.toLowerCase()).digest(`hex`),
mailchimp_body: {
email_address: emailToHash,
merge_fields: {
COURSEDATE: data.paymentIntent.metadata.courseData_attr_name.substring(
0,
data.paymentIntent.metadata.courseData_attr_name.indexOf(`:`)
),
FNAME: data.paymentIntent.metadata.contact_firstname,
LNAME: data.paymentIntent.metadata.contact_surname
},
// HERE: NEXT LINE
interests: await getGroupId(data.paymentIntent.metadata.courseData_attr_name)
}
}),
method: `PATCH`
}
)
const json = await response.json()
return json
}
So, THIS ALL WORKS in my local environment. This code properly interacts with a lambda function I've created to interact with MailChimp's API.
Here's the problem: in my production environment (Netlify, fyi), line "yellow" is reached (a match is made in the if statement), so presumably there's an element available in line "yellow-ish".
But certainly groupId in line "orange" is undefined.
Theory 1:
My working theory is it's failing due to race-condition. I mean, perhaps line "orange" is being returned before line "yellow" produces the relevant data. But I'm (almost) certain I've structured the promises correctly -- the async and await keywords
Theory 2:
My other working theory is that I've mismanaged the data types. Is there a reason that {[groupId]:true} may work in a windows environment but not in a linux env (both env's running the same version of node).
Related to this theory: the value returned in line "yellow" sometimes begins with a number, sometimes with a letter. This shouldn't matter, but I mention it because line "orange" will sometimes like like this:
{ '123abc': true }
And sometimes without quotes like this:
{ abc123: true }
I presume this difference in syntax is a known behaviour in Javascript -- is it just how object keys are handled?
Something that caught my attention was the map inside the following method:
const groupId = await response.json().then((json) => {
return json.map((obj) => {
if(obj.name.substring(0, `YYYY-MM-DD`.length) === name.substring(0, `YYYY-MM-DD`.length)) {
return obj.id //LET'S CALL THIS LINE "yellow"
}
})[0] //AND THIS LINE, LET'S CALL IT "yellow-ish"
})
While this could provide you with the value you're looking for, it's generally bad practice, because a map would loop through the whole array and it'll replace each value of the array that doesn't match your conditions with undefined, so if the value you're looking for is not at index 0 then you'll most certainly get undefined when you run the method. The method you need is find which will get a value/object out of the array and make it directly accessible.
Try the following code:
async function getGroupId(name) {
const response = await fetch(
environmentData.localFunctionsUrl + functionPath,
{
body: JSON.stringify({
command: `GetInterests`,
path: `/interest-categories/`
}),
method: `POST`
}
);
const responseJson = await response.json();
const group = responseJson.find(obj => obj.name.substring(0, `YYYY-MM-DD`.length) === name.substring(0, `YYYY-MM-DD`.length));
return {
[group.id]: true
};
}
async function registerSubscriber(data) {
const {
paymentIntent: {
metadata: {
contact_email,
courseData_attr_name,
contact_firstname,
contact_surname
}
}
} = data;
const interests = await getGroupId(courseData_attr_name);
const response = await fetch(
environmentData.localFunctionsUrl + functionPath,
{
body: JSON.stringify({
command: 'UpsertMember',
path: `/members/${crypto.createHash(`md5`).update(contact_email.toLowerCase()).digest(`hex`)}`,
mailchimp_body: {
email_address: contact_email,
merge_fields: {
COURSEDATE: courseData_attr_name.substring(0, courseData_attr_name.indexOf(`:`)),
FNAME: contact_firstname,
LNAME: contact_surname
},
interests
}
}),
method: 'PATCH'
}
);
const json = await response.json();
return json
}
Unfortunately, I can't say for certain if your handling of the JSON response is wrong without any sample (dummy) data, but usually a JSON object is not an array, so you can't handle it with a map.
Side note, you should try to stick to either async/await or then/catch.
Your should work well, If you get groupId undefined then json.map((obj) => {...} returns empty array.
Try to debug or add console.log:
const groupId = await response.json().then((json) => {
console.log('server response', json);
return json.map((obj) => {
if(obj.name.substring(0, `YYYY-MM-DD`.length) === name.substring(0, `YYYY-MM-DD`.length)) {
return obj.id //LET'S CALL THIS LINE "yellow"
}
})[0] //AND THIS LINE, LET'S CALL IT "yellow-ish"
})
console.log('groupId', groupId);
const retObj = { [groupId]: true } //LET'S CALL THIS LINE "orange"
return retObj
PS: As mentioned in other comments you should stick to await or then and not try to mix them

Iterating through javascript object to see text (like python requests.text)

We're currently trying to transfer code from a Python web scraper to a Node.js web scraper. The source is the Pastebin API. When scraping, the response is a javascript object like this:
[
{
scrape_url: 'https://scrape.pastebin.com/api_scrape_item.php?i=FD1BhNuR',
full_url: 'https://pastebin.com/FD1BhNuR',
date: '1580299104',
key: 'FD1BhNuR',
size: '19363',
expire: '0',
title: 'Weight Loss',
syntax: 'text',
user: 'loscanary'
}
]
Our Python script uses the requests library to request data from Pastebin's API and to get access to the actual body of the paste, in addition to the parameters above, we loop through the first entry and retrieve its text value. Here is an excerpt:
response = requests.get("https://scrape.pastebin.com/api_scraping.php?limit=1")
parsed_json = response.json()
print(parsed_json)
for individual in parsed_json:
p = requests.get(individual['scrape_url'])
text = p.text
print(text)
This brings back the actual body of the paste(s), which we can then search through to scrape for more keywords.
In Node, I don't know how to retrieve the same text value of the "scrape_url" parameter in the same way as I can with requests.text. I've tried using axios and request but the furthest I can get is accessing the "scrape_url" parameter with something like this:
const scrape = async () => {
try {
const result = await axios.get(pbUrl);
console.log(result.data[0].scrape_url);
} catch (err) {
console.error(err);
}
}
scrape();
How could I get the same result as I can with .text from the Python Requests library and in a loop?
Here is an example of how to do it (as mentioned by Olvin Roght )
const scrape = async () => {
try {
const result = await axios.get(pbUrl);
result.data.forEach(async (individual) => {
const scrapeUrl = individual['scrape_url'];
const response = await axios.get(scrapeUrl);
const text = response.data;
console.log("this is the text value from the url:", text);
});
} catch (err) {
console.error(err);
}
}
scrape();

Parameterized/Prepared Statements usage pg-promise

I'm using koa v2 with pg-promise. I try to do a simple SELECT 2 + 2; within a Parameterized/Prepared Statement to test my setup:
// http://127.0.0.1:3000/sql/2
router.get('/sql/:id', async (ctx) => {
await db.any({
name: 'addition',
text: 'SELECT 2 + 2;',
})
.then((data) => {
console.log('DATA:', data);
ctx.state = { title: data }; // => I want to return num 4 instead of [Object Object]
})
.catch((error) => {
console.log('ERROR:', error);
ctx.body = '::DATABASE CONNECTION ERROR::';
})
.finally(pgp.end);
await ctx.render('index');
});
Which is rendering [Object Object] in the templates and returning this to the console from pg-monitor:
17:30:54 connect(postgres#postgres)
17:30:54 name="addition", text="SELECT 2 + 2;"
17:30:54 disconnect(postgres#postgres)
DATA: [ anonymous { '?column?': 4 } ]
My problem:
I want to store result 4 in ctx.state. I don't know how can I access it within [ anonymous { '?column?': 4 } ]?
Thank You for your help!
Edit:
I found another recommended(1) ways(2) to dealing with named parameters in the official wiki.
// http://127.0.0.1:3000/sql/2
router.get('/sql/:id', async (ctx) => {
const obj = {
id: parseInt(ctx.params.id, 10),
};
await db.result('SELECT ${id} + ${id}', obj)
.then((data) => {
console.log('DATA:', data.rows[0]['?column?']);
ctx.state = { title: data.rows[0]['?column?'] }; // => 4
})
.catch((error) => {
console.log('ERROR:', error);
ctx.body = '::DATABASE CONNECTION ERROR::';
})
.finally(pgp.end);
await ctx.render('index');
});
I changed the any object to result, which returning the raw text. Than I access number 4 like a javascript object. Am I doing something wrong? Is there another way to access this value?
What is the recommended, more faster, safer way of usage?
Since you are requesting just one value, you should use method one:
const {value} = await db.one({
name: 'addition',
text: 'SELECT 2 + 2 as value',
}); // value = 4
And for such example you cannot use types PreparedStatement or ParameterizedQuery, because they format query on the server side, and PostgreSQL does not support syntax like $1 + $1.
The real question is - do you really need those types?

Categories

Resources