I have some very large (> 500MB) JSON files that I need to map to a new format and upload to a new DB.
The old format:
{
id: '001',
timestamp: 2016-06-02T14:10:53Z,
contentLength: 123456,
filepath: 'original/...',
size: 'original'
},
{
id: '001',
timestamp: 2016-06-02T14:10:53Z,
contentLength: 24565,
filepath: 'medium/...',
size: 'medium'
},
{
id: '001',
timestamp: 2016-06-02T14:10:53Z,
contentLength: 5464,
filepath: 'small/...',
size: 'small'
}
The new format:
{
Id: '001',
Timestamp: 2016-06-02T14:10:53Z,
OriginalSize: {
ContentLength: 123456,
FilePath: 'original/...'
},
MediumSize: {
ContentLength: 24565,
FilePath: 'medium/...'
},
SmallSize: {
ContentLength: 5464,
FilePath: 'small/...'
}
}
I was achieving this with small datasets like this, processing the 'original' size first:
let out = data.filter(o => o.size === 'original).map(o => {
return {
Id: o.id,
Timestamp: o.timestamp,
OriginalSize: {
ContentLength: o.contentLength,
FilePath: o.filepath
}
};
});
data.filter(o => o.size !== 'original').forEach(o => {
let orig = out.find(function (og) {
return og.Timestamp === o.timestamp;
});
orig[o.size + 'Size'] = {
ContentLength: o.contentLength,
FilePath: o.filepath
};
)
// out now contains the correctly-formatted objects
The problem comes with the very large datasets, where I can't load the hundreds of megabytes of JSON into memory all at once. This seems like a great time to use streams, but of course if I read the file in chunks, running .find() on a small array to find the 'original' size won't work. If I scan through the whole file to find originals and then scan through again to add the other sizes to what I've found, I end up with the whole dataset in memory anyway.
I know of JSONStream, which would be great if I was doing a simple 1-1 remapping of my objects.
Surely I can't be the first one to run into this kind of problem. What solutions have been used in the past? How can I approach this?
I think the trick is to update the database on the fly. If the JSON file is too big for memory, then I expect the resulting set of objects (out in your example) is too big for memory too.
In the comments you state the JSON file has one object per line. Therefore using node.js builtin fs.createReadStream and readline to get each line of the text file. Next process the line (string) into a json object, and finally update the database.
parse.js
var readline = require('readline');
var fs = require('fs');
var jsonfile = 'text.json';
var linereader = readline.createInterface({
input: fs.createReadStream(jsonfile)
});
linereader.on('line', function (line) {
obj = parseJSON(line); // convert line (string) to JSON object
// check DB for existing id/timestamp
if ( existsInDB({id:obj.id, timestamp:obj.timestamp}) ) {
updateInDB(obj); // already exists, so UPDATE
}
else { insertInDB(obj); } // does not exist, so INSERT
});
// DUMMY functions below, implement according to your needs
function parseJSON (str) {
str = str.replace(/,\s*$/, ""); // lose trailing comma
return eval('(' + str + ')'); // insecure! so no unknown sources
}
function existsInDB (obj) { return true; }
function updateInDB (obj) { console.log(obj); }
function insertInDB (obj) { console.log(obj); }
text.json
{ id: '001', timestamp: '2016-06-02T14:10:53Z', contentLength: 123456, filepath: 'original/...', size: 'original' },
{ id: '001', timestamp: '2016-06-02T14:10:53Z', contentLength: 24565, filepath: 'medium/...', size: 'medium' },
{ id: '001', timestamp: '2016-06-02T14:10:53Z', contentLength: 5464, filepath: 'small/...', size: 'small' }
NOTE: I needed to quote the timestamp value to avoid a syntax error. From your question and example script I expect you either don't have this problem or already have this solved, maybe another way.
Also, my implementation of parseJSON may be different from how you are parsing the JSON. Plain old JSON.parse failed for me due to the properties not being quoted.
Setup some DB instance, that can store JSON documents. MongoDB or PostgreSQL (recently they introduced jsonb data type for storing json documents). Iterate through the old JSON documents and combine them to the new structure, using the DB as the storage - such that you overcome the memory problem.
I'm quite sure that there is no way how to achieve your goal without either a) compromising speed of the process (drastically) or b) creating poor man's DB from scratch (which seems like a bad thing to do :) )
Related
I have a JavaScript object of the following form:
const data = {
title: "short string",
descriptions: [
"Really long string...",
"Really long string..."
]
}
The long strings need to be excluded from the indexes and, for whatever reason, I can't figure out what format the object needs to be to save it:
const entityToSave = dataToDatastore(data);
datastore.save({
key: datastore.key(["TestEntity"]),
data: entityToSave
})
I simply need to know what entityToSave should look like. I've tried about twenty different shapes and every single attempt I've tried that uses excludeFromIndexes has either thrown an error saying the string was too large or ends up as an Entity type instead of an Array type.
I can get it to work via the GCP admin interface so I feel like I'm going crazy.
Edit: For convenience I am including a script that should run as long as you (1) set the PROJECT_ID and (2) add an appropriately long string to the descriptions array.
const { Datastore } = require("#google-cloud/datastore");
const PROJECT_ID = null;
const data = {
title: "short string",
descriptions: [
"Really long string...",
"Really long string...",
]
}
const entityToSave = dataToDatastore(data);
async function save() {
const datastore = new Datastore({
projectId: PROJECT_ID,
});
console.log(entityToSave);
const entity = {
key: datastore.key(["TestEntity"]),
data: entityToSave
};
datastore.save(entity);
}
function dataToDatastore(data) {
return data
}
save();
I simply need to know what dataToDatastore should look like. I've already tried numerous variations based on documentation and discussions from four or five different places and not one has worked.
You have to apply excludeFromIndexes: true to all entities in the array which is crossing the 1500 bytes limit.
const data = {
title: "short string",
descriptions: [
"Really long string...",//> 1500 bytes
"Really long string...", //> 1500 bytes
]
}
Here is how entityToSave should look like
const entityToSave = data.descriptions.map(description => ({
value: description,
excludeFromIndexes: true
}));
console.log(entityToSave);
// this will transform data.descriptions to
// [
// { value: 'Really long string...', excludeFromIndexes: true },
// { value: 'Really long string...', excludeFromIndexes: true }
// ]
Which will apply `excludeFromIndexes: true` to all entities
//Then save the entity
datastore.save({
key: datastore.key(["TestEntity"]),
data: entityToSave
})
For more information check this github issue and stackoverflow thread
I'm working on building a database from a set of CSV files and am running into issues pushing elements to a deeply nested array. I've seen examples for 2D arrays and the use of positional operators but I can't quite figure out how to use them for my situation.
What I am trying to do read the CSV file which has columns for the answer_id and the url string for the photo. I want to push the url string to the photos array for the corresponding answer_id. When I try to use the code below, I get a long error message which starts with:
MongoBulkWriteError: Cannot create field 'answers' in element
{
results: [
{
question_id: "1",
_id: ObjectId('6332e0b015c1d1f4eccebf4e'),
answers: [
{
answer_id: "5",
//....
}
],
}
]
}
I formatted the above message to make things easier to read. It may be worth noting the first row of my CSV file has '5' in the answer_id column which makes me think things are failing at the first try to update my document.
Here is an example of my code:
const ExampleModel = new Schema({
example_id: String,
results: [
{
question_id: String,
answers: [
{
answer_id: String,
photos: [
{ url: String },
]
}
]
}
]
});
// Example Operation
// row.answer_id comes from CSV file
updateOne: {
'filter': {
'results.answers.answer_id': row.answer_id,
},
'update': {
'$push': {
'results.answers.$.photos' : { 'url': 'test'}
}
}
}
I guess my question is can I update an array this deeply nested using Mongoose?
It is a typescript
Can anybody help with the followin:
I read data from CSV file
Transform this data on flight (remove some extra columns)
Then I want updated csv in stream get back to variable in the code.
Console.log(updatedCsv) // in stream - displays what I need
BUT!
When I try to push it into array nothing happens and then variable (in which I pushed data from stream) is considered undefined:
import * as fs from "fs";
import * as csv from "csv";
udateCsv(){
fs.createReadStream('allure-report/data/suites.csv')
.pipe(csv.parse({ delimiter: ',', columns: true }))
.pipe(csv.transform((input) => {
console.log(input) // <----- it shows in console data I needed
/* like this:
{
Status: 'passed',
'Start Time': 'Wed Nov 11 17:37:33 EET 2020',
'Stop Time': 'Wed Nov 11 17:37:33 EET 2020',
'Duration in ms': '1',
'Parent Suite': '',
Suite: 'The Internet Guinea Pig Website: As a user, I can log into the secure area',
'Sub Suite': '',
'Test Class': 'The Internet Guinea Pig Website: As a user, I can log into the secure area',
'Test Method': 'Hook',
Name: 'Hook',
Description: ''
}
*/
skipHeaders.forEach((header) => delete input[header]);
this.rowsArray = input // NOTHING HAPPENS, rowsArray: string[] = new Array(); input - I don't know what is the type or if I use push. I can't get this data out of pipe
return input;
}))
.pipe(csv.stringify({ header: true }))
.pipe(fs.createWriteStream( this.path))
AND ALSO
as a workaround I wanted to read the newly generated csv but it is also unseccesfful, looks like I need to use promises. I tried some example from internet but was fail. PLEASE HELP
For those who wondering - I was able to resolve my goal using the following approach:
BUT!! I still wonder how to handle this problem via Promises, async/await approaches.
class CsvFormatter{
pathToNotUpdatedCsv: string
readline: any
readStream: any
headers: any
fieldSchema: string[] = new Array()
rowsArray: string[] = new Array()
constructor(pathToCsv: string, encoding: string) {
this.pathToNotUpdatedCsv = pathToCsv
this.readStream = fs.createReadStream(this.pathToNotUpdatedCsv, encoding = 'utf8');
}
async updateCsv(){
//read all csv lines of not updated file
this.readline = readline.createInterface({
input: this.readStream,
crlfDelay: Infinity
});
//save them to array
for await (const line of this.readline) {
this.rowsArray.push(line)
}
//remove columns in csv and return updated csv array
this.rowsArray = this.getUpdatedRows()
//separating headers and other rows in csv
this.headers = this.rowsArray.shift()
}
getUpdatedRows(){
let headersBeforeUpdate = this.removeExtraQuotes(this.rowsArray[0])
let rowsAfterUpdate = []
let indexesOfColumnToDelete = []
let partOfUpdatedArray = []
//get indexes which will be used for deletion of headers and content rows
skipHeaders.forEach((header) => {
indexesOfColumnToDelete.push(headersBeforeUpdate.indexOf(header))
})
//delete rows by index
this.rowsArray.forEach(row => {
partOfUpdatedArray = this.removeExtraQuotes(row)
indexesOfColumnToDelete.forEach(index=>{
partOfUpdatedArray.splice(index)
})
rowsAfterUpdate.push(partOfUpdatedArray)
})
return rowsAfterUpdate
}
I have a large json file with all my previous users. I need to prepare them to be imported. I keep getting this error : Error 4 failed to import: FirebaseAuthError: The password hash must be a valid byte buffer.
What is the proper way to store a hashed password as byte buffer in a json?
var jsonFile = require('./users.json');
var fs = require('fs')
let newArr = []
jsonFile.file.slice(0, 5).map( val => {
newArr.push({
uid: val.id,
email: val.email,
passwordHash: Buffer.from(val.password) // val.password is hashed
})
})
let data = JSON.stringify(newArr);
fs.writeFileSync('newArr.json', data);
In my main import file
var jsonFile = require('./newArr.json');
// I was testing it like that and everything is working fine.
const userImportRecords = [
{
uid: '555555555555',
email: 'user#example.com',
passwordHash: Buffer.from('$2a$10$P6TOqRVAXL2FLRzq9Ii6AeGqzV4mX8UNdpHvlLr.4DPxq2Xsd54KK')
}
];
admin.auth().importUsers(jsonFile, {
hash: {
algorithm: 'BCRYPT',
rounds: 10
}
})
Your first code snippet writes Buffer values to the file system. This doesn't work the way you expect. For instance, try running the following example:
const val = {uid: 'test', passwordHash: Buffer.from('test')};
fs.writeFileSync('newArr.json', JSON.stringify(val));
The resulting file will contain the following text:
{"uid":"test","passwordHash":{"type":"Buffer","data":[116,101,115,116]}}
When you require() this file, the passwordHash gets assigned to the object { type: 'Buffer', data: [ 116, 101, 115, 116 ] }. That's not the Buffer type expected by the importUsers() API.
I believe your newArr variable contains the right kind of array that can be passed into importUsers(). But writing it to the file system, and then reloading it changes the type of all Buffer fields.
I found a workaround to this problem. I'm parsing users.json directly inside the importUsers() file. Since I don't have to store the Buffer inside a json file again, the passwordHash stay a buffer.
This is the right way to do it
let newArr = []
jsonFile.file.slice(0, 5).map( val => {
newArr.push({
uid: val.id,
email: val.email,
passwordHash: Buffer.from(val.password)
})
})
admin.auth().importUsers(newArr, {
hash: {
algorithm: 'BCRYPT',
rounds: 10
}
})
I have been playing around with a module from NPM called JSON-Query, I originally able to make the module function with JSON embedded in my js.
I have spent about two days attempting to make it query JSON that is external and in a JSON file.
The original code that was functioning looked something like this.
var jsonQuery = require('json-query')
var data = {
people: [
{name: 'Matt', country: 'NZ'},
{name: 'Pete', country: 'AU'},
{name: 'Mikey', country: 'NZ'}
]
}
jsonQuery('people[country=NZ].name', {
data: data
}) //=> {value: 'Matt', parents: [...], key: 0} ... etc
I was able to query the internal JSON to find the key I was looking for.
I realized I need the ability to update the JSON while the code is live, so I moved the JSON to its own file.
Currently my main JS file looks like this.
var jsonQuery = require('json-query');
var fs = require('fs');
function querydb(netdomain){
fs.readFile('./querykeys.json', 'utf8', function (err, data) {
if (err){console.log('error');}
var obj = JSON.parse(data);
console.log(jsonQuery('servers[netshare=Dacie2015].netdomain', {
obj: obj
}));
});
}
querydb();
My JSON file that contains the json looks like this.
{
"servers": [
{"netdomain": "google.com", "netshare": "password", "authip":"216.58.203.46"},
{"netdomain": "localhost", "netshare": "localghost", "authip":"127.0.0.1"},
{"netdomain": "facebook.com", "netshare": "timeline", "authip":"31.13.69.228"}
]
}
The issue I have ran into, I am unable to query the JSON anymore, when the function QueryDB() is ran, no matter what is in the place to query the JSON, i get no response locating my key.
Currently the response I get from the server when i try to query the JSON file is
{ value: null,
key: 'netdomain',
references: [],
parents: [ { key: 'servers', value: null }, { key: null, value: null } ] }
To be abundantly clear, i believe my issue is the way i call my object into play, i have played with the structure of the JSON-Query and have been unable to accomplish being able to isolate a key.
Any help on this would be amazing, the module that i am working with can be found on npm at [NPM]https://www.npmjs.com/package/json-query
Thank you
I think this is just a typo. Shouldn't this:
obj: obj
be this?
data: obj