I am trying to count number of pages pdf file by create function by not using inbuilt methods.
const fs = require("fs");
const pdf = require("pdf-parse");
let data = fs.readFileSync("pdf/book.pdf");
const countpage = (str) => {
return str.reduce((el) => {
return el + 1;
}, 0);
};
pdf(data).then(function (data) {
console.log(countpage(data.numpages));
});
i am try to output as like number of pages =272 like this when i use data.numpages i got result
but when same think by create function and count then i didn't got result
output like this {numberpage:273}
Related
I am currently trying to save an js object with some binary data and other values. The result should look something like this:
{
"value":"xyz",
"file1":"[FileContent]",
"file2":"[LargeFileContent]"
}
Till now I had no binary data so I saved everything in JSON. With the binary data I am starting to run into problems with large files (>1GB).
I tried this approach:
JSON.stringify or how to serialize binary data as base64 encoded JSON?
Which worked for smaller files with around 20MB. However if I am using these large files then the result of the FileReader is always an empty string.
The result would look like this:
{
"value":"xyz:,
"file1":"[FileContent]",
"file2":""
}
The code that is reading the blobs is pretty similar to the one in the other post:
const readFiles = async (measurements: FormData) => {
setFiles([]); //This is where the result is beeing stored
let promises: Array<Promise<string>> = [];
measurements.forEach((value) => {
let dataBlob = value as Blob;
console.log(dataBlob); //Everything is fine here
promises.push(
new Promise((resolve, reject) => {
const reader = new FileReader();
reader.readAsDataURL(dataBlob);
reader.onloadend = function () {
resolve(reader.result as string);
};
reader.onerror = function (error) {
reject(error);
};
})
);
});
let result = await Promise.all(promises);
console.log(result); //large file shows empty
setFiles(result);
};
Is there something else I can try?
Since you have to share the data with other computers, you will have to generate your own binary format.
Obviously you can make it as you wish, but given your simple case of just storing Blob objects with a JSON string, we can come up with a very simple schema where we first store some metadata about the Blobs we store, and then the JSON string where we replaced each Blob with an UUID.
This works because the limitation you hit is actually on the max length a string can be, and we can .slice() our binary file to read only part of it. Since we never read the binary data as string we're fine, the JSON will only hold a UUID in places where we had Blobs and it shouldn't grow too much.
Here is one such implementation I made quickly as a proof of concept:
/*
* Stores JSON data along with Blob objects in a binary file.
* Schema:
* 4 first bytes = # of blobs stored in the file
* next 4 * # of blobs = size of each Blob
* remaining = JSON string
*
*/
const hopefully_unique_id = "_blob_"; // <-- change that
function generateBinary(JSObject) {
let blobIndex = 0;
const blobsMap = new Map();
const JSONString = JSON.stringify(JSObject, (key, value) => {
if (value instanceof Blob) {
if (blobsMap.has(value)) {
return blobsMap.get(value);
}
blobsMap.set(value, hopefully_unique_id + (blobIndex++));
return hopefully_unique_id + blobIndex;
}
return value;
});
const blobsArr = [...blobsMap.keys()];
const data = [
new Uint32Array([blobsArr.length]),
...blobsArr.map((blob) => new Uint32Array([blob.size])),
...blobsArr,
JSONString
];
return new Blob(data);
}
async function readBinary(bin) {
const numberOfBlobs = new Uint32Array(await bin.slice(0, 4).arrayBuffer())[0];
let cursor = 4 * (numberOfBlobs + 1);
const blobSizes = new Uint32Array(await bin.slice(4, cursor).arrayBuffer())
const blobs = [];
for (let i = 0; i < numberOfBlobs; i++) {
const blobSize = blobSizes[i];
blobs.push(bin.slice(cursor, cursor += blobSize));
}
const pattern = new RegExp(`^${hopefully_unique_id}\\d+$`);
const JSObject = JSON.parse(
await bin.slice(cursor).text(),
(key, value) => {
if (typeof value !== "string" || !pattern.test(value)) {
return value;
}
const index = +value.replace(hopefully_unique_id, "") - 1;
return blobs[index];
}
);
return JSObject;
}
// demo usage
(async () => {
const obj = {
foo: "bar",
file1: new Blob(["Let's pretend I'm actually binary data"]),
// This one is 512MiB, which is bigger than the max string size in Chrome
// i.e it can't be stored in a JSON string in Chrome
file2: new Blob([Uint8Array.from({ length: 512*1024*1024 }, () => 255)]),
};
const bin = generateBinary(obj);
console.log("as binary", bin);
const back = await readBinary(bin);
console.log({back});
console.log("file1 read as text:", await back.file1.text());
})().catch(console.error);
How to combine these two codes, so it doesn't just covert csv to Json (first code), but also save this as an json array in an extra file?(second code)
this (first) code converts csv file to json array:
const fs = require("fs");
let fileReadStream = fs.createReadStream("myCsvFile.csv");
let invalidLineCount = 0;
const csvtojson = require("csvtojson");
csvtojson({ "delimiter": ";", "fork": true })
.preFileLine((fileLineString, lineIdx)=> {
let invalidLinePattern = /^['"].*[^"'];/;
if (invalidLinePattern.test(fileLineString)) {
console.log(`Line #${lineIdx + 1} is invalid, skipping:`, fileLineString);
fileLineString = "";
invalidLineCount++;
}
return fileLineString
})
.fromStream(fileReadStream)
.subscribe((dataObj) => {
console.log(dataObj);
// I added the second code hier, but it wirtes the last object of the array (because of the loop?)
}
});
and this (second) code saves the json array to an external file:
fs.writeFile('example.json', JSON.stringify(dataObj, null, 4);
The quistion is how to put the second codes into the first code (combine them)?
You can use .on('done',(error)=>{ ... }) method. (csvtojson). Push the data into a variable in subscribe method and write the data as JSON in .on('done'). (test was successful).
Check it out:
let fileReadStream = fs.createReadStream("username-password.csv");
let invalidLineCount = 0;
let data = []
csvtojson({ "delimiter": ";", "fork": true })
.preFileLine((fileLineString, lineIdx)=> {
let invalidLinePattern = /^['"].*[^"'];/;
if (invalidLinePattern.test(fileLineString)) {
console.log(`Line #${lineIdx + 1} is invalid, skipping:`, fileLineString);
fileLineString = "";
invalidLineCount++;
}
return fileLineString
})
.fromStream(fileReadStream)
.subscribe((dataObj) => {
// console.log(dataObj)
data.push(dataObj)
})
.on('done',(error)=>{
fs.writeFileSync('example.json', JSON.stringify(data, null, 4))
})
Not sure if you are able to change the library but I would definitely recommend Papaparse for this - https://www.npmjs.com/package/papaparse
Your code would then look something like this:
const fs = require('fs'), papa = require('papaparse');
var readFile = fs.createReadStream(file);
papa.parse(readFile, {
complete: function (results, file) {
fs.writeFile('example.json', JSON.stringifiy(results.data), function (err) {
if(err) console.log(err);
// callback etc
})
}
});
I try to download, unzip and process a GTFS file in zip format. Downloading and unzipping are working, but I get error message when try to use txt files with gtfs-utils module in gtfsFunc(). Output is undefined. Delays are hardcoded just for testing purpose.
const dl = new DownloaderHelper('http://www.bkk.hu/gtfs/budapest_gtfs.zip', __dirname);
dl.on('end', () => console.log('Download Completed'))
dl.start();
myVar = setTimeout(zipFunc, 30000);
function zipFunc() {
console.log('Unzipping started...');
var zip = new AdmZip("./budapest_gtfs.zip");
var zipEntries = zip.getEntries();
zip.extractAllTo("./gtfsdata/", true);
}
myVar = setTimeout(gtfsFunc, 40000);
function gtfsFunc() {
console.log('Processing started...');
const readFile = name => readCsv('./gtfsdata/' + name + '.txt')
const filter = t => t.route_id === 'M4'
readStops(readFile, filter)
.then((stops) => {
const someStopId = Object.keys(stops)[0]
const someStop = stops[someStopId]
console.log(someStop)
})
}
As #ChickenSoups said, you are trying to filter stops files with route_id field and this txt doesnt have this field.
The fields that stops has are:
stop_id, stop_name, stop_lat, stop_lon, stop_code, location_type, parent_station, wheelchair_boarding, stop_direction
Perhaps what you need is read the Trips.txt file instead Stops.txt, as this file has route_id field.
And you can accomplish this using readTrips function:
const readTrips = require("gtfs-utils/read-trips");
And your gtfsFunc would be:
function gtfsFunc() {
console.log("Processing started...");
const readFile = (name) => {
return readCsv("./gtfsdata/" + name + ".txt").on("error", console.error);
};
//I used 5200 because your Trips.txt contains routes id with this value
const filterTrips = (t) => t.route_id === "5200";
readTrips(readFile, filterTrips).then((stops) => {
console.log("filtered stops", stops);
const someStopId = Object.keys(stops)[0];
const someStop = stops[someStopId];
console.log("someStop", someStop);
});
}
Or if what you really want is to read Stops.txt, you just need to change your filter
const filter = t => t.route_id === 'M4'
to use some valid field, for example:
const filter = t => t.stop_name=== 'M4'
Stop data don't have route_id field.
You should try other data, such as Trip or Route
You can look at the first row in your data file to see which field do they have.
GTFS data structure here
I have a CSV file can contain around million records, how can I remove columns starting with _ and generate a resulting csv
For the sake of simplicity, consider i have the below csv
Sr.No Col1 Col2 _Col3 Col4 _Col5
1 txt png 676766 win 8787
2 jpg pdf 565657 lin 8787
3 pdf jpg 786786 lin 9898
I would want the output to be
Sr.No Col1 Col2 Col4
1 txt png win
2 jpg pdf lin
3 pdf jpg lin
Do i need to read the entire file to achive this or is there a better approach to do this.
const csv = require('csv-parser');
const fs = require('fs');
fs.createReadStream('data.csv')
.pipe(csv())
.on('data', (row) => {
// generate a new csv with removing specific column
})
.on('end', () => {
console.log('CSV file successfully processed');
});
Any help on how can i achieve this would be helpful.
Thanks.
To anyone who stumbles on the post
I was able to transform the csv's using below code using fs and csv modules.
await fs.createReadStream(m.path)
.pipe(csv.parse({delimiter: '\t', columns: true}))
.pipe(csv.transform((input) => {
delete input['_Col3'];
console.log(input);
return input;
}))
.pipe(csv.stringify({header: true}))
.pipe(fs.createWriteStream(transformedPath))
.on('finish', () => {
console.log('finish....');
}).on('error', () => {
console.log('error.....');
});
Source: https://gist.github.com/donmccurdy/6cbcd8cee74301f92b4400b376efda1d
Actually you can handle that by using two npm packages.
https://www.npmjs.com/package/csvtojson
to convert your library to JSON format
then use this
https://www.npmjs.com/package/json2csv
with the second library. If you know what are the exact fields you want. you can pass parameters to specifically select the fields you want.
const { Parser } = require('json2csv');
const fields = ['field1', 'field2', 'field3'];
const opts = { fields };
try {
const parser = new Parser(opts);
const csv = parser.parse(myData);
console.log(csv);
} catch (err) {
console.error(err);
}
Or you can modify the JSON object manually to drop those columns
Try this with csv lib
const csv = require('csv');
const fs = require('fs');
const csvString=`col1,col2
value1,value2`
csv.parse(csvString, {columns: true})
.pipe(csv.transform(({col1,col2}) => ({col1}))) // remove col2
.pipe(csv.stringify({header:true}))
.pipe(fs.createWriteStream('./file.csv'))
With this function I accomplished the column removal from a CSV
removeCol(csv, col) {
let lines = csv.split("\n");
let headers = lines[0].split(",");
let colNameToRemove = headers.find(h=> h.trim() === col);
let index = headers.indexOf(colNameToRemove);
let newLines = [];
lines.map((line)=>{
let fields = line.split(",");
fields.splice(index, 1)
newLines.push(fields)
})
let arrData = '';
for (let index = 0; index < newLines.length; index++) {
const element = newLines[index];
arrData += element.join(',') + '\n'
}
return arrData;
}
I am working with CSV file (~20MB). It has three columns as shown below and the file name presents the start time i.e. timestamp of the first row (in this case file name is 20200325131010000.csv).
x;y;z
3;-132;976
3;-131;978
3;-130;975
4;-132;975
5;-131;976
3;-132;975
The difference between the timestamp of the consecutive row is 20 ms. How can I efficiently populate the new date column to the existing file? The final CSV file should look like this:
timestamp;x;y;z
20200325131010000;3;-132;976
20200325131010020;3;-131;978
20200325131010040;3;-130;975
20200325131010060;4;-132;975
20200325131010080;5;-131;976
20200325131010100;3;-132;975
So far I have tried the following code:
const csv = require('csv-parser');
const fs = require('fs');
var json2csv = require('json2csv').parse;
var dataArray = [];
fs.createReadStream('C:/Users/Downloads/20200325131010000.csv')
.pipe(csv())
.on('data', (row) => {
row.timestamp= "20200325131010000";
dataArray.push(row);
})
.on('end', () => {
var result = json2csv({ data: dataArray, fields: Object.keys(dataArray[0]) });
fs.writeFileSync("test.csv", result);
});
The above code generates the following output (all timestamp are the same which is not desirable):
timestamp;x;y;z
20200325131010000;3;-132;976
20200325131010000;3;-131;978
20200325131010000;3;-130;975
20200325131010000;4;-132;975
20200325131010000;5;-131;976
20200325131010000;3;-132;975
The problem with this code is that it adds the same timestamp (i.e. 20200325131010000) to all the rows. How can I correctly populate the timestamp column?
Kindly suggest me an efficient way to achieve this.
I think this code should work and solve your problem.
const csv = require('csv-parser');
const fs = require('fs');
var json2csv = require('json2csv').parse;
const filePath = 'C:/Users/Downloads/20200325131010000.csv';
const fileName = (/^.+\/([^/]+)\.csv$/gi).exec(filePath)[1]; // extract file name from path
const intialTimestamp = parseInt(fileName); // convert string to number
let i = 0;
var dataArray = [];
fs.createReadStream(filePath)
.pipe(csv())
.on('data', (row) => {
row.timestamp= (intialTimestamp + (i * 20)).toString(); // increase the number by 20 everytime
dataArray.push(row);
i++;
})
.on('end', () => {
var result = json2csv({ data: dataArray, fields: Object.keys(dataArray[0]) });
fs.writeFileSync("test.csv", result);
});