SQL on top of apache arrow in-browser?

SQL on top of apache arrow in-browser? - javascript

I have data that is stored on a client's browser in-memory. For example, let's say the dataset is as follows:
"name" (string), "age" (int32), "isAdult" (bool)
"Tom" , 29 1
"Tom" , 14 0
"Dina" , 20 1
I would like to run non-trivial SQL statements on this data in javascript, such as:
SELECT name, GROUP_CONCAT(age ORDER BY age) ages
FROM arrowData a1 JOIN arrowData a2 USING (name)
WHERE a1.isAdult != a2.isAdult
And I would get:
"name" (string), "ages" (string)
"Tom" "14,29"
The data that I have in javascript is stored in as apache Arrow (also used in connection with Perspective), and I'd like to execute SQL on that apache Arrow data as well. As a last resort, I think it would be possible to use sqllite in wasm, but I'm hoping there might be a simpler way where I can query the Arrow data directly, without having to move it all into a sqllite store in order to execute a query on it.
Are there any ways to do this?

It is good stuff what you are looking for. :) Sadly thanks to some trends in ~2010 as far as I know there is no actively maintained and supported API for this. But...
If you want to have full ANSI SQL on the client side in memory and you are willing to populate the database you could run the mentioned SQLite. Maybe this the only fulfilling option for you (if you could not leave some of the requirements).
If you could allow the luxury to copy data you could check out the AlaSQL project it does support join-s and some of the ANSI SQL features, but it is not complete and it contains known disruptive bugs:
Please be aware that AlaSQL has bugs. Beside having some bugs, there
are a number of limitations:
AlaSQL has a (long) list of keywords that must be escaped if used for
column names. When selecting a field named key please write SELECT
key FROM ... instead. This is also the case for words like value,
read, count, by, top, path, deleted, work and offset.
Please consult the full list of keywords.
It is OK to SELECT 1000000 records or to JOIN two tables with 10000
records in each (You can use streaming functions to work with longer
datasources - see test/test143.js) but be aware that the workload is
multiplied so SELECTing from more than 8 tables with just 100 rows in
each will show bad performance. This is one of our top priorities to
make better.
Limited functionality for transactions (supports only for
localStorage) - Sorry, transactions are limited, because AlaSQL
switched to more complex approach for handling PRIMARY KEYs / FOREIGN
KEYs. Transactions will be fully turned on again in a future version.
A (FULL) OUTER JOIN and RIGHT JOIN of more than 2 tables will not
produce expected results. INNER JOIN and LEFT JOIN are OK.
Please use aliases when you want fields with the same name from
different tables (SELECT a.id AS a_id, b.id AS b_id FROM ?).
At the moment AlaSQL does not work with JSZip 3.0.0 - please use
version 2.x.
JOINing a sub-SELECT does not work. Please use a with structure
(Example here) or fetch the sub-SELECT to a variable and pass it as an
argument (Example here).
AlaSQL uses the FileSaver.js library for saving files locally from the
browser. Please be aware that it does not save files in Safari 8.0.
There are probably many others. Please help us fix them by submitting
an issue. Thank you!
We planned to use it in one project, but there were more problems than solutions (for us) while introducing the project to our stack. So we backed out of from it. So I do not have production experience with this piece software...
At older times I hoped that Google Gears will support something like the desired function but partly it got replaced by HTML5 client side storage and sadly the project got discontinued.
The HTML5 WebSQL Database would have been perfect for your use-case, but it is sadly depricated. Tho most (?) browsers still support it in 2019. You can check some examples here. If you can allow yourself to build on a depricated API this could be the solution, but I do not really recommend it as it is not guaranteed that it will work...
When our project run having the same problems we ended up having to use the localStorage and program every "SELECT" manually, which of course was not at all ANSI SQL like...
If we roll back to the original problem "[SQL] query the Arrow data directly" I have no adapter in mind to use it as SQL... These kind of operations still tend to be on the server side and with the wasm SQLite I think those are the options.

You can use Alasql to do some of what you want, but it does not support grouping.
var data = [
{
name: 'Tom',
age: 29,
isAdult: 1
},
{
name: 'Tom',
age: 14,
isAdult: 0
},
{
name: 'Dina',
age: 20,
isAdult: 1
}
];
var res = alasql('SELECT name, age from ? a1 JOIN ? a2 WHERE a1.isAdult != a2.isAdult AND a1.name = a2.name', [data, data]);
document.getElementById('result').textContent = JSON.stringify(res);
<script src="https://cdn.jsdelivr.net/alasql/0.2/alasql.min.js"></script>
<span id="result"></span>

There is DuckDB Wasm now which can run SQL on arrow tables.
https://www.npmjs.com/package/#duckdb/duckdb-wasm
https://duckdb.org/2021/10/29/duckdb-wasm.html
DuckDB-Wasm is an in-process analytical SQL database for the browser.
It is powered by WebAssembly, speaks Arrow fluently, reads Parquet,
CSV and JSON files backed by Filesystem APIs or HTTP requests and has
been tested with Chrome, Firefox, Safari and Node.js.

Related

OData CRM query: Any open opportunities with a given target?

Our organization has a CRM installation on which we've done extensive customization. Right now I'm trying to implement a solution to enforce a business rule: Prevent users from updating a program to inactive when the program is a designation opportunity on an open opportunity.
I know how to prevent the update; return false from OnSave() in the JavaScript. I haven't been able to find out when that's the case. The best idea I've come up with is to make a SOAP call to the OData endpoint in CRM, but I've come across a sticking point at the last step. (If you've got a better idea I'm totally open to it.)
Here's what I've got. I can get the program in question:
programset(guid'thisone')
.../OrganizationData.svc/uwkc_programSet(guid'F4D75E9D-3A79-E611-80DA-
C4346BACAAC0')
I can get the associated designations:
programset(guid'thisone')/program-desig
.../OrganizationData.svc/uwkc_programSet(guid'F0D75E9D-3A79-E611-80DA-C4346BACAAC0')/uwkc_uwkc_program_uwkc_opportunitydesignation
and the associated Opportunities to those:
programset(guid'thisone')/program-desig?$expand=desig-opportunity
...OrganizationData.svc/uwkc_programSet(guid'F0D75E9D-3A79-E611-80DA-C4346BACAAC0')/uwkc_uwkc_program_uwkc_opportunitydesignation?$expand=uwkc_opportunity_uwkc_opportunitydesignation
... but now I get a little stuck.
I can filter on a primitive value on the Opportunity (link + field)
...$filter=opp-oppdesig/EstimatedCloseDate gt DateTime('2016-07-01')
...OrganizationData.svc/uwkc_programSet(guid'F0D75E9D-3A79-E611-80DA-C4346BACAAC0')/uwkc_uwkc_program_uwkc_opportunitydesignation?$expand=uwkc_opportunity_uwkc_opportunitydesignation&$filter=uwkc_opportunity_uwkc_opportunitydesignation/EstimatedCloseDate%20gt%20DateTime%272016-07-01%27
and I can filter on a complex value on the Designation (field + value)
...$filter=statecode/Value gt 0
...OrganizationData.svc/uwkc_programSet(guid'F0D75E9D-3A79-E611-80DA-C4346BACAAC0')/uwkc_uwkc_program_uwkc_opportunitydesignation?$expand=uwkc_opportunity_uwkc_opportunitydesignation&$filter=statecode/Value%20gt%200
but I can't make a filter work on a complex value on the Opportunity (connection + field + value)
...$filter=opp-oppdesig/statecode/Value gt 0
...OrganizationData.svc/uwkc_programSet(guid'F0D75E9D-3A79-E611-80DA-C4346BACAAC0')/uwkc_uwkc_program_uwkc_opportunitydesignation?$expand=uwkc_opportunity_uwkc_opportunitydesignation&$filter=uwkc_opportunity_uwkc_opportunitydesignation/statecode/Value%20gt%200
No property 'statecode' exists in type 'Microsoft.Xrm.Sdk.Entity' at
position 45.
How can I filter on the state of an entity two away from what I'm looking at? Or, if there's a better way, what's the best way to prevent in-use programs from being deactivated?

First issue is that you should be using the schema name of the attribute (StateCode), and not the logical name (statecode).
However, I believe it will then just return another error message:
<error xmlns="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata">
<code>-2147220989</code>
<message xml:lang="en-US">attributeName</message>
</error>
For some reason, it seems that filtering on complex types in an expanded entity does not work properly for the SOAP endpoint. And from what I have tested, the new Web API does not support this kind of depth in a query either yet.
One solution to your problem is to just fetch all the results, and then perform the filtering manually in your code. This of course works best if you can assume that there are not too many related entities retrieved in this kind of query. Also be sure to use $select to retrieve only the necessary attributes, as it greatly reduces the time it takes for a query to finish.
Another solution is to perform the query using FetchXML instead. This can be done either via the Web API, or as a SOAP request you construct yourself.
A third solution is to split your query into multiple queries, so you don't have to filter on a state two entities away in a query.

How can I show recent searches done through a textbox using JavaScript/jQuery?

What I want to achieve is that when I focus on the search bar, it should show me a list of recent searches done during this session and I should be able to select one of them and it should appear on the textbook.
Any help appreciated. If possible I woul like to store these recent searches data to browser cache so that whenever I reach this website it should show me the list.
Thanks in advance.

Assuming you will be using a web based language like html or JavaScript, a good start would be to store each search in an array.
using javascript along with the jQuery library you can easily add items to an array each time a user clicks a button.
JavaScript:
var myArray = [];
myArray.push($('#yourTextBox').val());
Then you could use jquery's $.each function to display each item in a DOM element.
See the sample below: (I used HTML and javascript with jquery 1.11)
http://jsfiddle.net/1ncf0b6f/3/

TL;DR I've written a library to handle this + it's edge cases, see https://github.com/JonasBa/recent-searches#readme for usage.
You should store the recent searchees in LocalStorage and retrieve them, then decide on your implementation on how you want to render them. This obviously has some edge cases that you need to consider, which is why I wrote a library to do exactly this, see below examples
Examples
Expiration:
Consider that someone searches for a query iPhone, but has looked for a query repairing iPhone 1 month ago, that repairing iPhone query is likely obsolete.
Ranking of recent searches
Same goes for ranking when doing prefix search, if a user has made a query "apple television" 3h ago and a query "television cables" 8h ago, and they now search for "television", you want to probably implement a ranking system for the two.
Safely handling storage
Just writing to LocalStorage will result in a massive JSON that you'll need to parse every time, thus gradually slowing down your application until you hit the limit and loose this functionality.
I've built a recent-searches library which helps you tackle all that. You can use it via npm and find it here. It will help you with all of the above issues and allow you to build recent-searches really quickly!

Performance issues with EmberJS and Rails 4 API

I have an EmberJS application which is powered by a Rails 4 REST API. The application works fine the way it is, however it is becoming very sluggish based on the kind of queries that are being performed.
Currently the API output is as follows:
"projects": [{
"id": 1,
"builds": [1, 2, 3, 4]
}]
The problem arises when a user has lots of projects with lots of builds split between them. EmberJS currently looks at builds key then makes a request to /builds?ids[]=1&ids[]=2 which is the kind of behaviour I want.
This question could have one of two solutions.
Update Rails to load the build_ids more efficiently
Update EmberJS to support different queries for builds
Option 1: Update Rails
I have tried various solutions regarding eager loading and manually grabbing the IDs using custom methods on the serializer. Both of these solution add a lot of extra code that I'd rather not do and still do individual queries per project.
By default rails also does SELECT * style queries when doing has_many and I can't figure out how to overwrite this at the serializer layer. I also wrote a horrible solution which got the entire thing to one fast query but it involved writing raw SQL which I know isn't the Rails way of doing things and I'd rather not have such a huge complex untestable query as the default scope.
Option 2: Make Ember use different queries
Instead of requesting /builds?ids[]=1&ids[]=2 I would rather not include the builds key at all on the project and make a request to /builds?project_id=1 when I access that variable within Ember. I think I can do this manually on a per field basis by using something similar to this:
builds: function () {
return this.store.find('builds', { project_id: this.get('id') });
}.property()
instead of the current:
builds: DS.hasMany('build', { async: true })
It's also worth mentioning that this doesn't only apply to "builds". There are 4 other keys on the project object that do the same thing so that's 4 queries per project.

Have you made sure that you have properly added indexes to your database? Adding and index on the builds table on project_id will make it work a lot faster.
Alternatively you should use the links attribute to load your records.
{"projects": [{
"id": 1,
"links": {
"builds": "/projects/1/builds"
}
}]}
This means that the builds table will only be queried when the relationships is accessed.

Things you can try:
Make sure your rails controller only selects the columns needed for JSON serialization.
Ensure you have indexes on the columns present in your where and join clauses unless the column is boolean or has low number of distinct values. Always ensure you have indexes on foreign key columns.
Be VERY VERY careful with how you are using ActiveRecord joins vs includes vs preload vs eager and references. This area is fraught with problems composing scopes together and subtle things can alter the SQL generated and number of queries issued and even what actual results are returned. I noticed differences in minor point releases of AR 4 yielding different query results because of the join strategy AR would choose.
Often you want to aim to reduce the number of SQL's issued to the database but joining tables is not always the best solution. You will need to benchmark and use EXPLAIN to see what works better for your queries. Sometimes sub queries/sub-selects can be more efficient.
Querying by parent_id is a good option if you can get Ember Data to perform the request that way as the database has a simpler query.
You could consider using Ember-Model instead of Ember-Data, I am using it currently as its much simpler and easier to adapt to my needs, and supports multi-fetch to avoid 1+N request problems.
You may be able to use embedded models or side-loaded models so your server can reduce the number of web requests AND the number of SQLs and return what the client needs in one request / one SQL. Ember-Model supports both embedded and side-loaded models, so Ember-Data being more ambitious may also.
Although it appears from your question that Ember-Data is doing a multi-fetch, make sure you are doing SQL IN clause for those ID's instead of separate queries.
Make sure that the SQL on your rails side is not fanning out in a 1+N pattern. Using the includes options to effect eager loading on AR relations may help avoid 1+N queries or it may unnecessarily load models depending on the results needed in your response.
I also found that the Ruby JSON serializer libraries are less than optimal. I created a gem ToJson that speeds up JSON serializing many times over the existing solutions. You can try it and benchmark for yourself.
I found that ActiveRecord (including AR 4) didn't work well for me and I moved to Sequel in the end because it gave me so much more control over join types, join conditions, and query composition and tactical eager loading, plus it was just faster, has wider support for standard SQL features and excellent support for postgres features and extensions. These things can make a huge difference to the way you design your database schema and the performance and types of queries you can achieve.
Using Sequel and ToJson I can serve around 30-50 times more requests than I could with ActiveRecord + JBuilder for most of my queries, and in some instances its hundreds times better than what I was achieving with AR (especially creates/updates). Besides Sequel being faster at instantiating models from the DB, it also has a Postgres streaming adapter which makes it even faster again for large results.
Changing your data access/ORM layer and JSON serialisation can achieve 30-50 times faster performance or alternatively require managing 30-50 less servers for the same load. It's nothing to sneeze at.

How to make a local offline database

I'm making a to-do list application with HTML, CSS, and JavaScript, and I think the best way for me to store the data would be a local database. I know how to use localStorage and sessionStorage, and I also know how to use an online MySQL database. However, this application must be able to run offline and should store its data offline.
Is there a way I could do this with just HTML and JavaScript?
Responding to comments:
"You said you know how to use localStorage... so what seems to be the problem?"
#Lior All I know about localStorage is that you can store a single result, as a variable whereas I wish to store a row with different columns containing diffenent data about the object. However, can localStorage hold an object and if so is it referenced with the usual object notation?
Any implementation will probably depend on what browser(s) your users prefer to use.
#paul I think chrome will be most popular.
Okay, I would like to clarify that what I was asking was indeed How can I do this with JavaScript and HTML rather than Is there a way I could do this with just HTML and JavaScript?. Basically, I wanted a type of SQL database that would save its contents on the user's machine instead of online.
What solved my problem was using WebDB or WEBSQL (I think it was called something like that that).

I'm about 3 years late in answering this, but considering that there was no actual discussion on the available options at the time, and that the database that OP ended up choosing is now deprecated, I figured i'd throw in my two cents on the matter.
First, one needs to consider whether one actually needs a client-side database. More specifically...
Do you need explicit or implicit relationships between your data items?
How about the ability to query over said items?
Or more than 5 MB in space?
If you answered "no" to all of the above, go with localStorage and save yourself from the headaches that are the WebSQL and IndexedDB APIs. Well, maybe just the latter headache, since the former has, as previously mentioned , been deprecated.
Otherwise, IndexedDB is the only option as far as native client-side databases go, given it is the only one that remains on the W3C standards track.
Check out BakedGoods if you want to utilize any of these facilities, and more, without having to write low-level storage operation code. With it, placing data in the first encountered native database which is supported on a client, for example, is as simple as:
bakedGoods.set({
data: [{key: "key1", value: "val1"}, {key: "key2", value: "val2"}],
storageTypes: ["indexedDB", "webSQL"],
//Will be polyfilled with defaults for equivalent database structures
optionsObj: {conductDisjointly: false},
complete: function(byStorageTypeStoredKeysObj, byStorageTypeErrorObj){}
});
Oh, and for the sake of complete transparency, BakedGoods is maintained by this guy right here :) .

Fusion Tables query speed

I'm trying to implement the autocomplete logics available in the Fusion Tables interface using just client side JavaScript:
So far I found this, which works great: https://developers.google.com/fusiontables/docs/samples/autocomplete
It allows me to retrieve all the values for a property, grouped together, so I can autocomplete them. The issue is that it's extremely slow. The query
"SELECT 'Store Name', COUNT() " +
'FROM ' + tableId + " GROUP BY 'Store Name'
takes up to 10 seconds to run, each time. This is because my table is quite big with more than 150 thousand rows.
However, the native interface from the screenshot above is dead fast. I tried looking into the code and see what type of query they were making (maybe they have a cache of these results), but I cannot find anything to lead me to a solution.
Any ideas? My thinking is that if the Google native interface is doing it, then there most definitely is a way for me to do it as well... I want to avoid having to use a third party server to cache these results, that would be an easy fix, and it's not the solution to my problem.

I think they use something like a nested set and a trie datastructure on the server side. A nested set is fast for queries but not for insertion and a trie datastructure is also fast for text retrieval. I think you can combine the 2 to make a fast look-up.

Develop Reference

JavaScript is the programming language of the Web.