I'm in the throes of re-building something that built almost a year ago (don't ask where the old version went - it's embarrassing).
The core functionality uses an $.getJSON (ajax-ish) call in javascript that runs a PHP script that runs a PostgreSQL query that builds a JSON object and returns it. (Pause for breath).
The issue is what PostgreSQL spits out when it's its turn to shine.
I'm aware of the build_json_object() and build_json_array() functionality in PostgreSQL 9.4+, but one of the DBs on which this has to run hasn't been upgraded from 9.2 and I don't have time to do so in the next month or so.
For now I am using row_to_json() (and ST_AsGeoJSON() on the geometry) to build my GeoJSON collection, which gets flung back at the client via a callback.
Taking my cue from this very nice post (and staying within very small epsilon of that post's query structure), I run the following query:
select row_to_json(fc)
from (SELECT 'FeatureCollection' As type,
array_to_json(array_agg(f)) As features
from (SELECT 'Feature' as type,
row_to_json((select l from (select $vars) as l)) as properties,
ST_AsGeoJSON(ST_Transform(lg.g1,4326)) as geometry
from $source_table as lg
where g1 && ST_Transform(ST_SetSRID(ST_MakeEnvelope($bounds),4326),4283)
) as f ) as fc;
($vars, $source_table and $bounds are supplied by PHP from POST variables).
When I fetchAll(PDO::FETCH_ASSOC) that query to $result, and json_encode($result[0]["row_to_json"]), the object returned to javascript is an object which can be JSON.parse()'d to give the expected (an Object with a FeatureCollection which in turn contains a bunch of Features, one of which is geometry).
So far, so good. And quick - gets the data and is back in a second or so.
The problem is that at the query stage, the array of stuff that relates to the geometry is double-quoted: the relevant segment of the JSON for an individual Feature looks like
{"type":"Feature","geometry":"{\\"type\\":\\"Polygon\\",
\\"coordinates\\":"[[[146.885447408,-36.143199088],
[146.884964384,-36.143136232],
... etc
]]"
}",
"properties":{"address_pfi":"126546461",
"address":"blah blah",
...etc }
}
This is what I get if I COPY the PostgreSQL query result to file: it's before any mishandling of the output.
Note the (double-escaped) double-quotes that only affect attributes (in the non-JSON sense) of the geometry {type, coordinates}: the "geometry" bit looks like
"geometry":"{stuff}"
instead of
"geometry":{stuff}
If the JSON produced by PostgreSQL is put through the parser/checker at GeoJSONLint, it dies in a screaming heap (which it should - it's absolutely not 'spec') - and of course it's never going to render: it spits out 'invalid type' as you might expect.
For the moment I've sorted it out by a kludge (my normal M.O.) - when $.getJSON returns the object, I
turn it into a string, then
.replace(/"{/g, '{') and .replace(/}"/g, '}') and .replace(/\\/g, ''), and then
turn it back into an object and proceed with shenanigans.
This is not good practice (to say the least): it would be far better if the query itself could be encouraged to return valid GeoJSON.
It seems clear that the problem is the row_to_json() stage: it sees the attribute-set for "geometry" and treats it differently from the attribute-set for "properties" - it (incorrectly) quote-escapes the "geometry" (after slash-escaping all double-quotes) one but (correctly) leaves the "properties" one as-is.
So after this book-length prelude... the question.
Is there some nuance about the query that I'm missing or ignoring? I've RTFD for the relevant PostgreSQL commands, and apart from prettification switches there is nothing that I'm aware of.
And of course, if there is a parsimonious way of doing the whole round-trip I would embrace it: the only caveat is as it must retain its 'live-fetch' nature - the $.getJSON runs under a listener that triggers on "idle" in a Google Map, and the source table, variables of interest and zoom (which determines $bounds) are user-determined.
(Think of it as being a way to have a map layer that updates with pan and zoom by only fetching ~200-300 simple-ish (cadastre) features at a time -0 far better that, than to generate a tile pyramid for an entire state for zooms 10-19. I bet someone has already done such a thing on bl.ocks, but I haven't found it).
It seems that you are missing the cast to json.
It should be
ST_AsGeoJSON(ST_Transform(lg.g1,4326))::json
Without the cast, st_asgeojson returns a string, that is double-encoded.
However, you could also get attributes and geoJson, than json_decode the json with PHP, create a geoJson featurecollection array with php, and finally json_encode the whole result.
Related
I would like to come straight to the point and show you my sample data, which is around the average of 180.000 lines from a .csv file, so a lot of lines. I am reading in the .csv with papaparse. Then I am saving the data as array of objects, which looks like this:
I just used this picture as you can also see all the properties my objects have or should have. The data is from Media Transperency Data, which is open source and shows the payments between institiutions.
The array of objects is saved by using the localforage technology, which is basically an IndexedDB or WebSQL with localstorage like API. So I save the data never on a sever! Only in the client!
The Question:
So my question is now, the user can add the sourceHash and/or targetHash attributes in a client interface. So for example assume the user loaded the "Energie Steiermark Kunden GmbH" object and now adds the sourceHash -- "company" to it. So basically a tag. This is already reflected in the client and shown, however I need to get this also in the localforage and therefore rewrite the initial array of objects. So I would need to search for every object in my huge 180.000 lines array that has the name "Energie Steiermark Kunden GmbH", as there can be multiple and set the property sourceHash to "company". Then save it again in the localforage.
The first question would be how to do this most efficient? I can get the data out of localforage by using the following method and set it respectively.
Get:
localforage.getItem('data').then((value) => {
...
});
Set:
localforage.setItem('data', dataObject);
However, the question is how do I do this most efficiently? I mean if the sourceNode only starts with "E" for example we don't need to search all sourceNode's. The same goes of course for the targetNode.
Thank you in advance!
UPDATE:
Thanks for the answeres already! And how would you do it the most efficient way in Javascript? I mean is it possible to do it in few lines. If we assume I have for example the current sourceHash "company" and want to assign it to every node starting with "Energie Steiermark Kunden GmbH" that appear across all timeNode's. It could be 20151, 20152, 20153, 20154 and so on...
Localforage is only a localStorage/sessionStorage-like wrapper over the actual storage engine, and so it only offers you the key-value capabilities of localStorage. In short, there's no more efficient way to do this for arbitrary queries.
This sounds more like a case for IndexedDB, as you can define search indexes over the data, for instance for sourceNodes, and do more efficient queries that way.
It is my understanding that -- from a performance perspective -- direct assignment is more desirable than .push() when populating an array.
My code is currently as follows:
for each (var e in Collection) {
do {
DB_Query().forEach(function(e){data.push([e.title,e.id])});
} while (pageToken);
}
DB_Query() method runs a Google Drive query and returns a list.
My issue arises because DB_Query() can return a list of variable length. As such, if I construct data = new Array(100), direct assignment has the potential to go out of bounds.
Is there a method by which I could try and catch an Out of Bounds exception to have values directly assigned for the 100 pre-allocated indices, but use .push() for any overflow? The expectation here is that an OOB exception will not occur often.
Also, I'm not sure if it matters, but I am clearing the array after a counter variable is >=100 using the following method:
while(data.length > 0) {data.pop()}
In Javascript, if you set a value at an index bigger than the array length, it'll automatically "stretch" the array. So there's no need to bother with this. If you can make a good guess about your array size, go for it.
About your clearing loop: that's correct, and it seems that pop is indeed the fastest way.
My original suggestion was to set the array length back to zero: data.length = 0;
Now a tip that I think really makes a performance difference here: you're worrying with the wrong part!
In Apps Script, what takes long is not resizing arrays dynamically, or working your data, that's fast. The issue is always with the "API calls". That is, using UrlFetch or Spreadsheet.Range.getValue and so on.
You should take care to make the minimum amount of API calls possible and in your case (I'm guessing now, since I haven't seen your whole code) you seem to be doing it wrong. If DB_Query is costly (in API calls terms) you should not have it nested under two loops. The best solution usually involves figuring out everything you'll need before-hand (do as many loops you need, if it doesn't call anywhere), then pass all parameters to do a bulk operation and gather it all at once (in one API call), even if it involves getting more data than you needed. Then, with the whole data at hand, loop through and transform it as required (that's the fast part).
I have been experimenting with PostgreSQL and PL/V8, which embeds the V8 JavaScript engine into PostgreSQL. Using this, I can query into JSON data inside the database, which is rather awesome.
The basic approach is as follows:
CREATE or REPLACE FUNCTION
json_string(data json, key text) RETURNS TEXT AS $$
var data = JSON.parse(data);
return data[key];
$$ LANGUAGE plv8 IMMUTABLE STRICT;
SELECT id, data FROM things WHERE json_string(data,'name') LIKE 'Z%';
Using, V8 I can parse JSON data into JS, then return a field and I can use this as a regular pg query expression.
BUT
On large datasets, performance can be an issue, as for every row I need to parse the data.
The parser is fast, but it is definitely the slowest part of the process and it has to happen every time.
What I am trying to work out (to finally get to an actual question) is if there is a way to cache or pre-process the JSON ... even storing a binary representation of the JSON in the table that could be used by V8 automatically as a JS object might be a win. I've had a look at using an alternative format such as messagepack or protobuf, but I don't think they will necessarily be as fast as the native JSON parser in any case.
THOUGHT
PG has blobs and binary types, so the data could be stored in binary, then we just need a way to marshall this into V8.
Postgres supports indexes on arbitrary function calls. The following index should do the trick :
CREATE INDEX json_idx ON things (json_string(field,'name'));
The short version appears to be that with Pg's new json support, so far there's no way to store json directly in any form other than serialised json text. (This looks likely to change in 9.4)
You seem to want to store a pre-parsed form that's a serialised representation of how v8 represents the json in memory, and that's not currently supported. It's not even clear that v8 offers any kind of binary serialisation/deserialisation of json structures. If it doesn't do so natively, code would need to be added to Pg to produce such a representation and to turn it back into v8 json data structures.
It also wouldn't necessarily be faster:
If json was stored in a v8 specific binary form, queries that returned the normal json representation to clients would have to format it each time it was returned, incurring CPU cost.
A binary serialised version of json isn't the same thing as storing the v8 json data structures directly in memory. You can't write a data structure that involves any kind of graph of pointers out to disk directly, it has to be serialised. This serialisation and deserialisation has a cost, and it might not even be much faster than parsing the json text representation. It depends a lot on how v8 represents JavaScript objects in memory.
The binary serialised representation could easily be bigger, since most json is text and small numbers, where you don't gain any compactness from a binary representation. Since storage size directly affects the speed of table scans, value fetches from TOAST, decompression time required for TOASTed values, index sizes, etc, you could easily land up with slower queries and bigger tables.
I'd be interested to see whether an optimisation like what you describe is possible, and whether it'd turn out to be an optimisation at all.
To gain the benefits you want when doing table scans, I guess what you really need is a format that can be traversed without having to parse it and turn it into what's probably a malloc()'d graph of javascript objects. You want to be able to give a path expression for a field and grab it out directly from the serialised form where it's been read into a Pg read buffer or into shared_buffers. That'd be a really interesting design project, but I'd be surprised if anything like it existed in v8.
What you really need to do is research how the existing json-based object databases do fast searches for arbitrary json paths and what their on-disk representations are, then report back on pgsql-hackers. Maybe there's something to be learned from people who've already solved this - presuming, of course, that they have.
In the mean time, what I'd want to focus on is what the other answers here are doing: Working around the slow point and finding other ways to do what you need. You could also look into helping to optimise the json parser, but depending on whether the v8 one or some other one is in use that might already be far past the point of diminishing returns.
I guess this is one of the areas where there's a trade-off between speed and flexible data representation.
perhaps instead of making the retrieval phase responsible for parsing the data, creating a new data type which could pre-disseminate json data on input might be a better approach?
http://www.postgresql.org/docs/9.2/static/sql-createtype.html
I don't have any experience with this, but it got me curious so I did some reading.
JSON only
What about something like the following (untested, BTW)? It doesn't address your question about storing a binary representation of the JSON, it's an attempt to parse all of the JSON at once for all of the rows you're checking, in the hope that it will yield higher performance by reducing the processing overhead of doing it individually for each row. If it succeeds at that, I'm thinking it may result in higher memory consumption though.
The CREATE TYPE...set_of_records() stuff is adapted from the example on the wiki where it mentions that "You can also return records with an array of JSON." I guess it really means "an array of objects".
Is the id value from the DB record embedded in the JSON?
Version #1
CREATE TYPE rec AS (id integer, data text, name text);
CREATE FUNCTION set_of_records() RETURNS SETOF rec AS
$$
var records = plv8.execute( "SELECT id, data FROM things" );
var data = [];
// Use for loop instead if better performance
records.forEach( function ( rec, i, arr ) {
data.push( rec.data );
} );
data = "[" + data.join( "," ) + "]";
data = JSON.parse( data );
records.forEach( function ( rec, i, arr ) {
rec.name = data[ i ].name;
} );
return records;
$$
LANGUAGE plv8;
SELECT id, data FROM set_of_records() WHERE name LIKE 'Z%'
Version #2
This one gets Postgres to aggregate / concatenate some values to cut down on the processing done in JS.
CREATE TYPE rec AS (id integer, data text, name text);
CREATE FUNCTION set_of_records() RETURNS SETOF rec AS
$$
var cols = plv8.execute(
"SELECT" +
"array_agg( id ORDER BY id ) AS id," +
"string_agg( data, ',' ORDER BY id ) AS data" +
"FROM things"
)[0];
cols.data = JSON.parse( "[" + cols.data + "]" );
var records = cols.id;
// Use for loop if better performance
records.forEach( function ( id, i, arr ) {
arr[ i ] = {
id : id,
data : cols.data[ i ],
name : cols.data[ i ].name
};
} );
return records;
$$
LANGUAGE plv8;
SELECT id, data FROM set_of_records() WHERE name LIKE 'Z%'
hstore
How would the performance of this compare?: duplicate the JSON data into an hstore column at write time (or if the performance somehow managed to be good enough, convert the JSON to hstore at select time) and use the hstore in your WHERE, e.g.:
SELECT id, data FROM things WHERE hstore_data -> name LIKE 'Z%'
I heard about hstore from here: http://lwn.net/Articles/497069/
The article mentions some other interesting things:
PL/v8 lets you...create expression indexes on specific JSON elements and save them, giving you stored search indexes much like CouchDB's "views".
It doesn't elaborate on that and I don't really know what it's referring to.
There's a comment attributed as "jberkus" that says:
We discussed having a binary JSON type as well, but without a protocol to transmit binary values (BSON isn't at all a standard, and has some serious glitches), there didn't seem to be any point.
If you're interested in working on binary JSON support for PostgreSQL, we'd be interested in having you help out ...
I don't know if it would be useful here, but I came across this: pg-to-json-serializer. It mentions functionality for:
parsing JSON strings and filling postgreSQL records/arrays from it
I don't know if it would offer any performance benefit over what you've been doing so far though, and I don't really even understand their examples.
Just thought it was worth mentioning.
i have a map system (grid) for my website. I have defined 40000 'fields' on a grid. Each field has a XY value (for x(1-200) and y(1-200)) and a unique identifier: fieldid(1-40000).
I have a viewable area of 16x9 fields. When the user visits website.com/fieldid/422 it displays 16x9 fields starting with fieldid 422 in the upperleft corner. This obviously follows the XY system, which means the field in the second row, right below #422 is #622.
The user should be able to navigate Up, Down, Left and Right (meaning increment/decrement the X or Y value accordingly). I have a function which converts XY values to fieldids and vice-versa.
Everything good so far, I can:
Reload the entire page when a user clicks a navigate button (got this)
Send an ajax-request and get a jsonstring with the new 16x9 fields (got this)
But I want to build in some sort of caching system so that the data sent back from the server can be minimized after the first load. This would probably mean only sending new 'rows' or 'columns' of fields and storing them in somesort of javascript multidimensional array bigger then the 16x9 used for displaying. But I can't figure it out. Can somebody assist?
I see two possible solutions.
1 If you use ajax to get new tiles and do not reload entire page very often, you may just use an object that holds the contents of each tile, using unique tile ids as keys, like:
var mapCache = {
'1' : "tile 1 data",
'2' : "tile 2 data"
//etc.
}
When the user request new tiles, first check if you have them in your object (you know which tiles will be needed for given area), then download only what you need and add new key/value pairs to the cache. Obviously all cached data will disappear as soon as the page is reloaded by user.
2 If you reload the page for each request you might split your tiles into separate javascript "files". It doesn't really matter how it would be implemented on the server - static files like tile1.js, tile2.js etc, or dynamic script (probably with some server-side cache) like tile.php?id=1, tile.php?id=2 etc. What's important is that the server sends proper HTTP headers and makes it possible for the browser to cache these requests. So when a page containing some 144 tiles is requested you have 144 <script /> elements, each one containing data for one tile and each one will be stored in browser's cache. This solution makes sense only if there's lot of data for each tile and data doesn't change on the server very often, or/and there's significant cost of tile generation/trasfer.
You could just have an array of 40,000 references. Basically, empty array elements don't take up a lot of room until you actually put something in them (its one of the advantages of a dynamically typed language). Javascript doesn't know if you are going to put an int or an object into an array element, so it doesn't allocate the elements until yo put something in them. So to summarize, just put them in an array - that simple!
Alternatively, if you don't want the interpreter to allocate 40,000 NULLs at start, you could use a dictionary method, with the keys being the 1 in 40,000 array indices. Now the unused elements don't even get allocated. Though if you are going to eventually fill a substantial portion of the map, the dictionary method is much less efficient.
Have a single associative array, which initially starts out with zero values.
If the user visits, say, grid 32x41y, you set a value for the array like this:
if (!(visitedGrids.inArray('32'))
{
visitedGrids['32'] = {}
}
visitedGrids['32']['41'] = data;
(This is pseudo-code; I haven't checked the syntax.)
Then you can check to see if the user has visited the appropriate grid coordinates by seeing if there is a value in the associative array.
How do you fix a names mismatch problem, if the client-side names are keywords or reserved words in the server-side language you are using?
The DOJO JavaScript toolkit has a QueryReadStore class that you can subclass to submit REST patterned queries to the server. I'm using this in conjunction w/ the FilteringSelect Dijit.
I can subclass the QueryReadStore and specify the parameters and arguments getting passed to the server. But somewhere along the way, a "start" and "count" parameter are being passed from the client to the server. I went into the API and discovered that the QueryReadStore.js is sending those parameter names.
I'm using Fiddler to confirm what's actually being sent and brought back. The server response is telling me I have a parameter names mismatch, because of the "start" and "count" parameters. The problem is, I can't use "start" and "count" in PL/SQL.
Workaround or correct implementation advice would be appreciated...thx.
//I tried putting the code snippet in here, but since it's largely HTML, that didn't work so well.
While it feels like the wrong thing to do, because I'm hacking at a well tested, nicely written JavaScript toolkit, this is how I fixed the problem:
I went into the DOJOX QueryReadStore.js and replaced the "start" and "count" references with acceptable (to the server-side language) parameter names.
I would have like to handled the issue via my PL/SQL (but I don't know how to get around reserved words) or client-side code (subclassing did not do the trick)...without getting into the internals of the library. But it works, and I can move on.
As opposed to removing it from the API, as you mentioned, you can actually create a subclass with your own fetch, and remove start/count parameters (theoretically). Have a look at this URL for guidance:
http://www.sitepen.com/blog/2008/06/25/web-service-data-store/
Start and count are actually very useful because they allow you to pass params for the query that you can use to filter massive data sets, and it helps to manage client-side paging. I would try to subclass instead, intercept, and remove.
Is your pl/sql program accessed via a URL and mod_plsql? If so, then you can use "flexible parameter passing" and the variables get assigned to an array of name/value pairs.
Define your package spec like this...
create or replace package pkg_name
TYPE plsqltable
IS
TABLE OF VARCHAR2 (32000)
INDEX BY BINARY_INTEGER;
empty plsqltable;
PROCEDURE api (name_array IN plsqltable DEFAULT empty ,
value_array IN plsqltable DEFAULT empty
);
END pkg_name;
Then the body:
CREATE OR REPLACE PACKAGE BODY pkg_name AS
l_count_value number;
l_start_value number;
PROCEDURE proc_name (name_array IN plsqltable DEFAULT empty,
value_array IN plsqltable DEFAULT empty) is
------------
FUNCTION get_value (p_name IN VARCHAR) RETURN VARCHAR2 IS
BEGIN
FOR i IN 1..name_array.COUNT LOOP
IF UPPER(name_array(i)) = UPPER(p_name) THEN
RETURN value_array(i);
END IF;
END LOOP;
RETURN NULL;
END get_value;
----------------------
begin
l_count_value := get_value('count');
l_start_value := get_value('start');
end api;
end pkg_name;
Then you can call pkg_name.api using
http://server/dad/!pkg_name.api?start=3&count=3