Javascript serialization and performance with V8 and PostgreSQL

Javascript serialization and performance with V8 and PostgreSQL - javascript

I have been experimenting with PostgreSQL and PL/V8, which embeds the V8 JavaScript engine into PostgreSQL. Using this, I can query into JSON data inside the database, which is rather awesome.
The basic approach is as follows:
CREATE or REPLACE FUNCTION
json_string(data json, key text) RETURNS TEXT AS $$
var data = JSON.parse(data);
return data[key];
$$ LANGUAGE plv8 IMMUTABLE STRICT;
SELECT id, data FROM things WHERE json_string(data,'name') LIKE 'Z%';
Using, V8 I can parse JSON data into JS, then return a field and I can use this as a regular pg query expression.
BUT
On large datasets, performance can be an issue, as for every row I need to parse the data.
The parser is fast, but it is definitely the slowest part of the process and it has to happen every time.
What I am trying to work out (to finally get to an actual question) is if there is a way to cache or pre-process the JSON ... even storing a binary representation of the JSON in the table that could be used by V8 automatically as a JS object might be a win. I've had a look at using an alternative format such as messagepack or protobuf, but I don't think they will necessarily be as fast as the native JSON parser in any case.
THOUGHT
PG has blobs and binary types, so the data could be stored in binary, then we just need a way to marshall this into V8.

Postgres supports indexes on arbitrary function calls. The following index should do the trick :
CREATE INDEX json_idx ON things (json_string(field,'name'));

The short version appears to be that with Pg's new json support, so far there's no way to store json directly in any form other than serialised json text. (This looks likely to change in 9.4)
You seem to want to store a pre-parsed form that's a serialised representation of how v8 represents the json in memory, and that's not currently supported. It's not even clear that v8 offers any kind of binary serialisation/deserialisation of json structures. If it doesn't do so natively, code would need to be added to Pg to produce such a representation and to turn it back into v8 json data structures.
It also wouldn't necessarily be faster:
If json was stored in a v8 specific binary form, queries that returned the normal json representation to clients would have to format it each time it was returned, incurring CPU cost.
A binary serialised version of json isn't the same thing as storing the v8 json data structures directly in memory. You can't write a data structure that involves any kind of graph of pointers out to disk directly, it has to be serialised. This serialisation and deserialisation has a cost, and it might not even be much faster than parsing the json text representation. It depends a lot on how v8 represents JavaScript objects in memory.
The binary serialised representation could easily be bigger, since most json is text and small numbers, where you don't gain any compactness from a binary representation. Since storage size directly affects the speed of table scans, value fetches from TOAST, decompression time required for TOASTed values, index sizes, etc, you could easily land up with slower queries and bigger tables.
I'd be interested to see whether an optimisation like what you describe is possible, and whether it'd turn out to be an optimisation at all.
To gain the benefits you want when doing table scans, I guess what you really need is a format that can be traversed without having to parse it and turn it into what's probably a malloc()'d graph of javascript objects. You want to be able to give a path expression for a field and grab it out directly from the serialised form where it's been read into a Pg read buffer or into shared_buffers. That'd be a really interesting design project, but I'd be surprised if anything like it existed in v8.
What you really need to do is research how the existing json-based object databases do fast searches for arbitrary json paths and what their on-disk representations are, then report back on pgsql-hackers. Maybe there's something to be learned from people who've already solved this - presuming, of course, that they have.
In the mean time, what I'd want to focus on is what the other answers here are doing: Working around the slow point and finding other ways to do what you need. You could also look into helping to optimise the json parser, but depending on whether the v8 one or some other one is in use that might already be far past the point of diminishing returns.
I guess this is one of the areas where there's a trade-off between speed and flexible data representation.

perhaps instead of making the retrieval phase responsible for parsing the data, creating a new data type which could pre-disseminate json data on input might be a better approach?
http://www.postgresql.org/docs/9.2/static/sql-createtype.html

I don't have any experience with this, but it got me curious so I did some reading.
JSON only
What about something like the following (untested, BTW)? It doesn't address your question about storing a binary representation of the JSON, it's an attempt to parse all of the JSON at once for all of the rows you're checking, in the hope that it will yield higher performance by reducing the processing overhead of doing it individually for each row. If it succeeds at that, I'm thinking it may result in higher memory consumption though.
The CREATE TYPE...set_of_records() stuff is adapted from the example on the wiki where it mentions that "You can also return records with an array of JSON." I guess it really means "an array of objects".
Is the id value from the DB record embedded in the JSON?
Version #1
CREATE TYPE rec AS (id integer, data text, name text);
CREATE FUNCTION set_of_records() RETURNS SETOF rec AS
$$
var records = plv8.execute( "SELECT id, data FROM things" );
var data = [];
// Use for loop instead if better performance
records.forEach( function ( rec, i, arr ) {
data.push( rec.data );
} );
data = "[" + data.join( "," ) + "]";
data = JSON.parse( data );
records.forEach( function ( rec, i, arr ) {
rec.name = data[ i ].name;
} );
return records;
$$
LANGUAGE plv8;
SELECT id, data FROM set_of_records() WHERE name LIKE 'Z%'
Version #2
This one gets Postgres to aggregate / concatenate some values to cut down on the processing done in JS.
CREATE TYPE rec AS (id integer, data text, name text);
CREATE FUNCTION set_of_records() RETURNS SETOF rec AS
$$
var cols = plv8.execute(
"SELECT" +
"array_agg( id ORDER BY id ) AS id," +
"string_agg( data, ',' ORDER BY id ) AS data" +
"FROM things"
)[0];
cols.data = JSON.parse( "[" + cols.data + "]" );
var records = cols.id;
// Use for loop if better performance
records.forEach( function ( id, i, arr ) {
arr[ i ] = {
id : id,
data : cols.data[ i ],
name : cols.data[ i ].name
};
} );
return records;
$$
LANGUAGE plv8;
SELECT id, data FROM set_of_records() WHERE name LIKE 'Z%'
hstore
How would the performance of this compare?: duplicate the JSON data into an hstore column at write time (or if the performance somehow managed to be good enough, convert the JSON to hstore at select time) and use the hstore in your WHERE, e.g.:
SELECT id, data FROM things WHERE hstore_data -> name LIKE 'Z%'
I heard about hstore from here: http://lwn.net/Articles/497069/
The article mentions some other interesting things:
PL/v8 lets you...create expression indexes on specific JSON elements and save them, giving you stored search indexes much like CouchDB's "views".
It doesn't elaborate on that and I don't really know what it's referring to.
There's a comment attributed as "jberkus" that says:
We discussed having a binary JSON type as well, but without a protocol to transmit binary values (BSON isn't at all a standard, and has some serious glitches), there didn't seem to be any point.
If you're interested in working on binary JSON support for PostgreSQL, we'd be interested in having you help out ...

I don't know if it would be useful here, but I came across this: pg-to-json-serializer. It mentions functionality for:
parsing JSON strings and filling postgreSQL records/arrays from it
I don't know if it would offer any performance benefit over what you've been doing so far though, and I don't really even understand their examples.
Just thought it was worth mentioning.

Related

JSON.parse() on a large array of objects is using way more memory than it should

I generate a ~200'000-element array of objects (using object literal notation inside map rather than new Constructor()), and I'm saving a JSON.stringify'd version of it to disk, where it takes up 31 MB, including newlines and one-space-per-indentation level (JSON.stringify(arr, null, 1)).
Then, in a new node process, I read the entire file into a UTF-8 string and pass it to JSON.parse:
var fs = require('fs');
var arr1 = JSON.parse(fs.readFileSync('JMdict-all.json', {encoding : 'utf8'}));
Node memory usage is about 1.05 GB according to Mavericks' Activity Monitor! Even typing into a Terminal feels laggier on my ancient 4 GB RAM machine.
But if, in a new node process, I load the file's contents into a string, chop it up at element boundaries, and JSON.parse each element individually, ostensibly getting the same object array:
var fs = require('fs');
var arr2 = fs.readFileSync('JMdict-all.json', {encoding : 'utf8'}).trim().slice(1,-3).split('\n },').map(function(s) {return JSON.parse(s+'}');});
node is using just ~200 MB of memory, and no noticeable system lag. This pattern persists across many restarts of node: JSON.parseing the whole array takes a gig of memory while parsing it element-wise is much more memory-efficient.
Why is there such a huge disparity in memory usage? Is this a problem with JSON.parse preventing efficient hidden class generation in V8? How can I get good memory performance without slicing-and-dicing strings? Must I use a streaming JSON parse 😭?
For ease of experimentation, I've put the JSON file in question in a Gist, please feel free to clone it.

A few points to note:
You've found that, for whatever reason, it's much more efficient to do individual JSON.parse() calls on each element of your array instead of one big JSON.parse().
The data format you're generating is under your control. Unless I misunderstood, the data file as a whole does not have to be valid JSON, as long as you can parse it.
It sounds like the only issue with your second, more efficient method is the fragility of splitting the original generated JSON.
This suggests a simple solution: Instead of generating one giant JSON array, generate an individual JSON string for each element of your array - with no newlines in the JSON string, i.e. just use JSON.stringify(item) with no space argument. Then join those JSON strings with newline (or any character that you know will never appear in your data) and write that data file.
When you read this data, split the incoming data on the newline, then do the JSON.parse() on each of those lines individually. In other words, this step is just like your second solution, but with a straightforward string split instead of having to fiddle with the character counts and curly braces.
Your code might look something like this (really just a simplified version of what you posted):
var fs = require('fs');
var arr2 = fs.readFileSync(
'JMdict-all.json',
{ encoding: 'utf8' }
).trim().split('\n').map( function( line ) {
return JSON.parse( line );
});
As you noted in an edit, you could simplify this code to:
var fs = require('fs');
var arr2 = fs.readFileSync(
'JMdict-all.json',
{ encoding: 'utf8' }
).trim().split('\n').map( JSON.parse );
But I would be careful about this. It does work in this particular case, but there is a potential danger in the more general case.
The JSON.parse function takes two arguments: the JSON text and an optional "reviver" function.
The [].map() function passes three arguments to the function it calls: the item value, array index, and the entire array.
So if you pass JSON.parse directly, it is being called with JSON text as the first argument (as expected), but it is also being passed a number for the "reviver" function. JSON.parse() ignores that second argument because it is not a function reference, so you're OK here. But you can probably imagine other cases where you could get into trouble - so it's always a good idea to triple-check this when you pass an arbitrary function that you didn't write into [].map().

I think a comment hinted at the answer to this question, but I'll expand on it a little. The 1 GB of memory being used presumably includes a large number of allocations of data that is actually 'dead' (in that it has become unreachable and is therefore not really being used by the program any more) but has not yet been collected by the Garbage Collector.
Almost any algorithm processing a large data set is likely to produce a very large amount of detritus in this manner, when the programming language/technology used is a typical modern one (e.g. Java/JVM, c#/.NET, JavaScript). Eventually the GC removes it.
It is interesting to note that techniques can be used to dramatically reduce the amount of ephemeral memory allocation that certain algorithms incur (by having pointers into the middles of strings), but I think these techniques are hard or impossible to employ in JavaScript.

Storing temporary client side data

I need to store client side data temporarily. The data will be trashed on refresh or redirect. What is the best way to store the data?
using javascript by putting the data inside a variable
var data = {
a:"longstring",
b:"longstring",
c:"longstring",
}
or
putting the data inside html elements (as data-attribute inside div tags)
<ul>
<li data-duri="longstring"></li>
<li data-duri="longstring"></li>
<li data-duri="longstring"></li>
</ul>
The amount of data to temporarily store could get a lot because the data I need to store are image dataUri's and a user that does not refresh for the whole day could stack up maybe 500+ images with a size of 50kb-3mb. (I am unsure if that much data could crash the app because of too much memory consumption. . please correct me if I am wrong.)
What do you guys suggest is the most efficient way to keep the data?

I'd recommend storing in JavaScript and only updating the DOM when you actually want to display the image assuming all the image are not stored at the same time. Also note the browser will also store the image in its own memory when it is in the DOM.
Update: As comments have been added to the OP I believe you need to go back to customer requirements and design - caching 500 x 3MB images is unworkable - consider thumbnails etc? This answer only focuses on optimal client side caching if you really need to go that way...
Data URI efficiency
Data URIs use base64 which adds an overhead of around 33% representing binary data in ASCII.
Although base64 is required to update the DOM the overhead can be avoided by storing the data as binary strings and encoding and decoding using atob() and btoa() functions - as long as you drop references to the original data allowing it to be garbage collected.
var dataAsBase64 = "iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg==";
var dataAsBinary = atob(dataAsBase64);
console.log(dataAsBinary.length + " vs " + dataAsBase64.length);
// use it later
$('.foo').attr("src", "data:image/png;base64," + btoa(dataAsBinary));
String memory efficiency
How much RAM does each character in ECMAScript/JavaScript string consume? suggests they take 2 bytes per character - although this is still could be browser dependent.
This could be avoided by using ArrayBuffer for 1-to-1 byte storage.
var arrayBuffer = new Uint8Array(dataAsBinary.length );
for (i = 0; i < dataAsBinary.length; i++) {
arrayBuffer[i] = dataAsBinary.charCodeAt(i);
}
// allow garbage collection
dataAsBase64 = undefined;
// use it later
dataAsBase64 = btoa(String.fromCharCode.apply(null, arrayBuffer));
$('.foo').attr("src", "data:image/png;base64," + btoa(dataAsBinary));
Disclaimer: Note all this add a lot of complexity and I'd only recommend such optimisation if you actually find a performance problem.
Alternative storage
Instead of using browser memory
local storage - limited, typically 10MB, certainly won't allow - 500 x 3MB without specific browser configuration.
Filesystem API - not yet widely supported, but ideal solution - can create temp files to offload to disk.

if you really want to loose the data on a refresh, just use a javascript hash/object var storage={} and you have a key->value store. If you would like to keep the data during the duration of the user visiting the page (until he closes the browser window), you could use sessionStorage or to persist the data undefinetly (or until the user deletes it), use localStorage or webSQL
putting data into the DOM (as a data-attribute or hidden fields etc) is not a good idea as the process for javascript to go into the DOM and pull that information out is very expensive (crossing borders between the javascript- and the DOM-world (the website structure) doesn't come cheap)

Using Javascript variable is the best way to store you temp data. You may consider to storing your data inside a DOM attribute only if the data is related to a specific DOM element.
About the performance, storing your data directly in a javascript variable will probably be faster since storing data in a DOM element would also involve javascript in addition to the DOM modifications. If the data isn't related to an existing DOM element, you'll also have to create a new element to store that value and make sure it isn't visible to the user.

The OP mentions a requirement for the data to be forcibly transient i.e. (if possible) unable to be saved locally on the client - at least that is how I read it.
If this type of data privacy is a firm requirement for an application, there are multiple considerations when dealing with a browser environment, I am unsure whether the images in question are to be displayed as images to the user, or where in relation to the client the source data of the images is coming from. If the data is coming into the browser over the network, you might do well (or better than the alternative, at least) to use a socket or other raw data connection rather than HTTP requests, and consider something like a "sentinel" value in the stream of bytes, to indicate boundaries of image data.
Once you have the bytes coming in, you could, I believe, (or soon will be able to) pass the data via a generator function into a typedArray via the iterator protocol, see: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Uint8Array
// From an iterable
var iterable = function*(){ yield* [1,2,3]; }();
var uint8 = new Uint8Array(iterable);
// Uint8Array[1, 2, 3]
And then perhaps integrate those arrays as private members of some class you use to manage their lifecycle? see:
https://www.nczonline.net/blog/2014/01/21/private-instance-members-with-weakmaps-in-javascript/
var Person = (function() {
var privateData = {},
privateId = 0;
function Person(name) {
Object.defineProperty(this, "_id", { value: privateId++ });
privateData[this._id] = {
name: name
};
}
Person.prototype.getName = function() {
return privateData[this._id].name;
};
return Person;
}());
I think you should be able to manage the size / wait problem to some extent with the generator method of creating the byte arrays as well, perhaps check for sane lengths, time passed on this iterator, etc.
A general set of ideas more than an answer, and none of which are my own authorship, but this seems to be appropriate to the question.

Why not used #Html.Hidden ?
#Html.Hidden("hId", ViewData["name"], new { #id = "hId" })

There are various ways to do this, depending upon your requirement:
1) We can make use of constant variables, create a file Constants.js and can be used to store data as
"KEY_NAME" : "someval"
eg:
var data = {
a:"longstring",
b:"longstring",
c:"longstring",
}
CLIENT_DATA = data;
Careful: This data will be lost if you refresh the screen, as all the variables memory is just released out.
2) Make use of cookieStore, using:
document.cookie = some val;
For reference :http://www.w3schools.com/js/tryit.asp?filename=tryjs_cookie_username
Careful: Cookie store data has an expiry period also has a data storage capacity https://stackoverflow.com/a/2096803/1904479.
Use: Consistent long time storage. But wont be recommended to store huge data
3) Using Local Storage:
localStorage.setItem("key","value");
localStorage.getItem("key");
Caution: This can be used to store value as key value pairs, as strings, you will not be able to store json arrays without stringify() them.
Reference:http://www.w3schools.com/html/tryit.asp?filename=tryhtml5_webstorage_local
4) Option is to write the data into a file
Reference: Writing a json object to a text file in javascript

How to convert a JSON string to a JSON string with a different structure

I am building an application where data is retrieved from a third party system as a JSON string. I need to convert this JSON string to another JSON string with a different structure such that it can be used with pre-existing functions defined in a internal Javascript library.
Ideally I want to be able to perform this conversion on the client machine using Javascript.
I have looked at JSONT as a means of achieving this but that project does not appear to be actively maintained:
http://goessner.net/articles/jsont/
Is there a de facto way of achieving this? Or do I have to roll my own mapping code?

You shouldn't be passing JSON into an internal JavaScript library. You should parse the JSON into a JS object, then iterate over it, transforming it into the new format
Example
var json = '[{"a": 1:, "b": 2}, {"a": 4:, "b": 5}]';
var jsObj = JSON.parse(json);
// Transform property a into aa and property b into bb
var transformed = jsObj.map(function(obj){
return {
aa: obj.a,
bb: obj.b
}
});
// transformed = [{aa:1, bb:2},{aa:4, bb:5}]
If you really want JSON you'd just call JSON.stringify(transformed)
https://developer.mozilla.org/en-US/docs/JavaScript/Reference/Global_Objects/Array/map
Here's another answer with an even more complicated transformation How to make a jquery Datatable array out of standard json?

From what I can tell from the home page, the JSONT project is about transforming JSON into entirely different formats anyway (i.e. JSON => HTML).
It's going to be a lot simpler to write your own mapping code, possibly just as a from_json() method on the object you're creating (so YourSpecialObject.from_json(input); returns an instance of that object generated from the JSON data).
From your question, I'm not sure if this fits your use case, but hopefully someone else will have a better answer soon.

Another option is using XSLT. As there are SAX readers and writers for JSON, you can write happily use XSLT with JSON. There's no horrific JSON to XML and back conversion needs to go on. See: http://www.gerixsoft.com/blog/json/xslt4json
I can definitely see the irony in using a XML based language to tranform JSON - but it seems like a good option.
Otherwise you're probably best of writing your own mapping code.

How should i save the response data in a object(json) to have better manipulation and performance?

what i am doing is:
1. Get values from ajax response(which is in json format) for listing rows of data which
response = {"categories":[{"name":"General","id":"6305","pop":"show when clicked"},{"name":"Navigation","id":"6043","pop":"show when clicked"},{"name":"New","id":"6051","pop":"show when clicked"},{"name":"Time","id":"6117","pop":"show when clicked"},{"name":"Reesh","id":"6207","pop":"show when clicked"}]}
2 . I will parse the json and store in a object like this
ex:
object= {6305:{"name":"General","id":"6305","pop":"show when clicked"},
6043:{"name":"Navigation","id":"6043","pop":"show when clicked"},
6051:{"name":"New","id":"6051","pop":"show when clicked"},
6117:{"name":"Time","id":"6117","pop":"show when clicked"},
6207:{"name":"Reesh","id":"6207","pop":"show when clicked"}};
why i am doing this is because i can get the data using the id
ex: object[6305] will give me the data.
3 .So that i can retrieve the data and also make changes to values in the object using the id when changes occur in db.
ex: object[6350].pop="changed";
Please tell me:
-->whether is this the correct method or i can do it in a much simpler or efficient way?
-->whether i can store the json response as it is and parse data as it is? if so please explain with example.

Yes, of course you would not need to build the object:
function getObject(id) {
for (var i=0; i<response.categories.length; i++)
if (response.categories[i].id == id)
return response.categories[i];
return null;
}
However, if you often need to access objects by their ids this function would be slow. Creating the lookup table as you did will not create much memory overhead, but make retrieving data much faster.
BTW: Your title question "save data as object or json" is confusing. Serializing manipulated objects back to JSON makes no sense, as you always will use the parsed objects. Of course, if you just needed to manipulate a JSON string, and knew exactly what to do, (simple) string manipulation could be faster than parsing, manipulating and stringifying.

Best practice for writing ARRAYS

I've got an array with about 250 entries in it, each their own array of values. Each entry is a point on a map, and each array holds info for:
name,
another array for points this point can connect to,
latitude,
longitude,
short for of name,
a boolean,
and another boolean
The array has been written by another developer in my team, and he has written it as such:
names[0]=new Array;
names[0][0]="Campus Ice Centre";
names[0][1]= new Array(0,1,2);
names[0][2]=43.95081811364498;
names[0][3]=-78.89848709106445;
names[0][4]="CIC";
names[0][5]=false;
names[0][6]=false;
names[1]=new Array;
names[1][0]="Shagwell's";
names[1][1]= new Array(0,1);
names[1][2]=43.95090307839151;
names[1][3]=-78.89815986156464;
names[1][4]="shg";
names[1][5]=false;
names[1][6]=false;
Where I would probably have personally written it like this:
var names = []
names[0] = new Array("Campus Ice Centre", new Array[0,1,2], 43.95081811364498, -78.89848709106445, "CIC", false, false);
names[1] = new Array("Shagwell's", new Array[0,1], 43.95090307839151, -78.89815986156464, 'shg", false, false);
They both work perfectly fine of course, but what I'm wondering is:
1) does one take longer than the other to actually process?
2) am I incorrect in assuming there is a benefit to the compactness of my version of the same thing?
I'm just a little worried about his 3000 lines of code versus my 3-400 to get the same result.
Thanks in advance for any guidance.

What you really want to do here is define a custom data type which represents your data more accurately. I'm not sure what language you are using so here is some psuedocode:
class Location
{
double latitude;
double longitude;
String Name;
String Abbreviation;
bool flag1;//you should use a better name
bool flag2;
}
Then you can just create an array to hold all the Location objects and it would be much more readable and maintainable.
Locations = new Array;
Locations[0] = new Location("Shagwell's",...);
....
===EDIT===
Because you said you are using javascript then the best practise would probably be to store your data in a json text file, this has the benefit of removing the data from the code file and having a very easily editable data source if you want to make changes.
your JSON file would look like this
[{"lat":"23.2323", "long":"-72.3", "name":"Shagwell's" ...},
{"lat":"26.2323", "long":"-77.3", "name":"loc2" ...},
...]
You could then store the json text in an accesible place on your webserver say "data.json", then if you are using jquery you can load it in by doing something like this:
$.getJSON("data.json", function(data) { //do something with the data});

With structured data, like your example, both you and your co-worker are relatively "wrong". From the looks of things, you should have implemented an array of structures, assuming of course that the data you are presenting is truly unordered, which I would be willing to guess it probably isn't. Arrays are used too often, because they are amongst the first data structures we learn, but very often aren't the best choice.
As to performance, that more often comes down to the data access code than the data type itself. Frankly too, unless you are dealing with gigantic datasets or literally real time applications, performance should be a non issue.
As to the two examples you have posted, after the compiler is done with them, they will be virtually identical.

I personally find the former much more readable. From a performance perspective, the difference is probably minimal.

Leaving the other answers aside (although I agree with the others that you need structs here) your co-workers way seems better to me. Like Serapth says, the compiler will optimize away the differences and the original code has better readability.

Develop Reference

JavaScript is the programming language of the Web.