Working with very large Javascript array or objects

Working with very large Javascript array or objects - javascript

Currently I'm working on a script where I've got to store more than 1,250,000 objects. I'm using the push() function in a jQuery each() loop like this:
words.push({
"page": pagenumber,
"content": word,
"hpos": $(this).attr("HPOS"),
"vpos": $(this).attr("VPOS"),
"width": $(this).attr("WIDTH"),
"height": $(this).attr("HEIGHT")
});
In Chrome it goes quit fast, between 30 and 40 seconds, but in Internet Explorer it can take up to 360 seconds.
It's for a project where old newspapers are loaded and you can search the text from those newspapers. The newspapers are in a directory and are loaded dynamically. In this test I'm doing I'm using newspapers from october 1926, containing 308 pages and over 1.250.000 words.
Is there a better/faster way to achieve this?

Is there a better/faster way to achieve this?
Yes: Do it on a server, not in the browser. This also has the advantage that you can do it once and reuse the information.
But assuming for some reason that's not possible:
The first thing you can do is stop making several million unnecessary function calls by only doing $(this) once per loop:
.....each(function () {
var $this = $(this);
words.push({
"page": pagenumber,
"content": word,
"hpos": $this.attr("HPOS"),
"vpos": $this.attr("VPOS"),
"width": $this.attr("WIDTH"),
"height": $this.attr("HEIGHT")
});
});
Normally repeatedly doing that isn't a big deal (though I'd avoid it anyway), but if you're doing this one and a quarter million times...
And if all of those attributes really are attributes, you can avoid the call entirely by cutting jQuery out of the middle:
.....each(function () {
words.push({
"page": pagenumber,
"content": word,
"hpos": this.getAttribute("HPOS"),
"vpos": this.getAttribute("VPOS"),
"width": this.getAttribute("WIDTH"),
"height": this.getAttribute("HEIGHT")
});
});
jQuery's attr function is great and smooths over all sorts of cross-browser hassles with certain attributes, but I don't think any of those four needs special handling, not even on IE, so you can just use DOM's getAttribute directly.
The next thing is that some JavaScript engines execute push more slowly than assigning to the end of the array. These two statements do the same thing:
myarray.push(entry);
// and
myarray[myarray.length] = entry;
But other engines process push as fast or faster than the assignment (it is, after all, a ripe target for optimization). So you might look at whether IE does push more slowly and, if so, switch to using assignment instead.

Related

JavaScript : Fastest way to insert DOM element in sort order

so I've got several (~30) async calls returning with data (~25 records per call), which I want to display in a specific order.
Currently, the page waits for everything to load, sorts a single internal array, then adds the DOM elements (each item of data is applied to an HTML template/string which is effectively concatenated and added once to the parent element's innerHTML).
I'd LIKE the data to be inserted with each dataset (as it comes back)... but that implies that I need a different way to handle the sort/ordering.
Approaches I've considered:
Ideally, mirror the DOM in some sort of B-Tree, so that INSERT operations just traverse the tree looking for the correct element to insertBefore/insertAfter... since I've yet to see any library to address this need, it seems like I'd end up writing a bit of code.
manually iterate the DOM looking for the element to insertAfter... seems tediously slow, relatively speaking.
just use jQuery.sort(fn) after loading each dataset... seems hacky at best (given the number of times it'd be run, and the volume of DOM manipulation), but by far the easiest to implement code-wise, since it's like 5 lines.
I considered some sort of buffer queue between the async returns and the DOM manipulation, but there's far too much that I DON'T know about JS and threading to feel comfortable with that method.
I like the idea of inserting directly into the sorted slot, but I am aware that DOM manipulation can be "slow" (depending on how it's done, etc - and I am by no means a JS guru - thus asking here). The idea of the buffer queue with a separate reader/DOM handling seemed like it might provide a reasonable compromise between responsiveness and the DOM manipulation pains, but that's all theoretical for me at this point... and all said and done, if it ends up being more hassle than it's worth, I'll either leave as-is, or just go the lazy route of jQ.sort the DOM.
your knowledgeable advise would be greatly appreciated :)
Thanks

I'd go with Option 2. The DOM is just objects in a tree structure, so there's no need for a separate abstract tree other than if you want one. You can associate data directly with the elements via attributes or expando properties (if doing the latter, beware of naming conflicts, pick something very specific to your project) — the latter have the advantage of speed and not being limited to strings.
Searching through a list of DOM elements isn't markedly slow at all, nor is it much work.
Example inserting random numbers in divs within a container div:
var container= document.getElementById("container");
function insertOne() {
// Our random number
var num = Math.floor(Math.random() * 1000);
// Convenient access to child elements
var children = Array.prototype.slice.call(container.children);
// Find the one we should insert in front of
var before = children.find(function(element) {
return element.__x__random > num;
});
// Create the div
var div = document.createElement('div');
div.innerHTML = num;
div.__x__random = num;
// Insert (if `before` is null, it becomes an append)
container.insertBefore(div, before);
}
// Do it every 250ms
var timer = setInterval(insertOne, 250);
// Stop after 5 seconds
setTimeout(function() {
clearInterval(timer);
}, 5000);
<div id="container"></div>

Why is "this" more effective than a saved selector?

I was doing this test case to see how much using the this selector speeds up a process. While doing it, I decided to try out pre-saved element variables as well, assuming they would be even faster. Using an element variable saved before the test appears to be the slowest, quite to my confusion. I though only having to "find" the element once would immensely speed up the process. Why is this not the case?
Here are my tests from fastest to slowest, in case anyone can't load it:
1
$("#bar").click(function(){
$(this).width($(this).width()+100);
});
$("#bar").trigger( "click" );
2
$("#bar").click(function(){
$("#bar").width($("#bar").width()+100);
});
$("#bar").trigger( "click" );
3
var bar = $("#bar");
bar.click(function(){
bar.width(bar.width()+100);
});
bar.trigger( "click" );
4
par.click(function(){
par.width(par.width()+100);
});
par.trigger( "click" );
I'd have assumed the order would go 4, 3, 1, 2 in order of which one has to use the selector to "find" the variable more often.
UPDATE: I have a theory, though I'd like someone to verify this if possible. I'm guessing that on click, it has to reference the variable, instead of just the element, which slows it down.

Fixed test case: http://jsperf.com/this-vs-thatjames/10
TL;DR: Number of click handlers executed in each test grows because the element is not reset between tests.
The biggest problem with testing for micro-optimizations is that you have to be very very careful with what you're testing. There are many cases where the testing code interferes with what you're testing. Here is an example from Vyacheslav Egorov of a test that "proves" multiplication is almost instantaneous in JavaScript because the testing loop is removed entirely by the JavaScript compiler:
// I am using Benchmark.js API as if I would run it in the d8.
Benchmark.prototype.setup = function() {
function multiply(x,y) {
return x*y;
}
};
var suite = new Benchmark.Suite;
suite.add('multiply', function() {
var a = Math.round(Math.random()*100),
b = Math.round(Math.random()*100);
for(var i = 0; i < 10000; i++) {
multiply(a,b);
}
})
Since you're already aware there is something counter-intuitive going on, you should pay extra care.
First of all, you're not testing selectors there. Your testing code is doing: zero or more selectors, depending on the test, a function creation (which in some cases is a closure, others it is not), assignment as the click handler and triggering of the jQuery event system.
Also, the element you're testing on is changing between tests. It's obvious that the width in one test is more than the width in the test before. That isn't the biggest problem though. The problem is that the element in one test has X click handlers associated. The element in the next test has X+1 click handlers.
So when you trigger the click handlers for the last test, you also trigger the click handlers associated in all the tests before, making it much slower than tests made earlier.
I fixed the jsPerf, but keep in mind that it still doesn't test just the selector performance. Still, the most important factor that skewes the results is eliminated.
Note: There are some slides and a video about doing good performance testing with jsPerf, focused on common pitfalls that you should avoid. Main ideas:
don't define functions in the tests, do it in the setup/preparation phase
keep the test code as simple as possible
compare things that do the same thing or be upfront about it
test what you intend to test, not the setup code
isolate the tests, reset the state after/before each test
no randomness. mock it if you need it
be aware of browser optimizations (dead code removal, etc)

You don't really test the performance between the different techniques.
If you look at the output of the console for this modified test:
http://jsperf.com/this-vs-thatjames/8
You will see how many event listeners are attached to the #bar object.
And you will see that they are not removed at the beginning for each test.
So the following tests will always become slower as the previous ones because the trigger function has to call all the previous callbacks.

Some of this increase in slowness is because the object reference is already found in memory, so the compiler doesn't have to go looking in memory for the variable
$("#bar").click(function(){
$(this).width($(this).width()+100); // Only has to check the function call
}); // each time, not search the whole memory
as opposed to
var bar = $("#bar");
...
bar.click(function(){
bar.width(bar.width()+100); // Has to search the memory to find it
}); // each time it is used
As zerkms said, dereferencing (having to look up the memory reference as I describe above) has some but little effect on the performance
Thus the main source of slowness in difference for the tests you have performed is the fact that the DOM is not reset between each function call. In actuality, a saved selector performs just about as fast as this

Looks like the performance results you're getting has nothing to do with the code. If you look at these edited tests, you can see that having the same code in two of the tests (first and last) yield totally different results.

I don't know, but if I had to guess I would say it is due to concurrency and multithreading.
When you do $(...) you call the jQuery constructor and create a new object that gets stored in the memory. However, when you reference to an existing variable you do not create a new object (duh).
Although I have no source to quote I believe that every javascript event gets called in its own thread so events don't interfere with eachother. By this logic the compiler would have to get a lock on the variable in order to use it, which might take time.
Once again, I am not sure. Very interesting test btw!

Mootools Selectors $ or $$. Which is more performant

I am curious as which selector would be quicker for the browser to look up. I am interested in general rather than specific browsers.
$('accountDetailsTable').getElements('.toggler')
This will look for accountDetailsTable as if you were using document.getElementById('accountDetailsTable'); and then look for .toggler inside of the element.
$$('.toggler')
Whereas this one will return all the selectors directly. But ultimately they will both give me the same result.
So which one would be quicker? How could I test this?

performance of selectors will differ between browsers significantly. eg. with querySelector and querySelectorAll (QSA) vs without. When without QSA, with getElementsByClassName or without.
the amount of work that the selector engine needs to do will vary.
you can write this differently. 3 ways to do so, not just 2:
1. $('accountDetailsTable').getElements('.toggler')
anatomy of the above is:
get element (1 call).
call method getElements from element prototype (or element directly in old IE)
look for all childNodes that match the class selector
this will be consistent in browsers in that it gets a root node and calls methods off it. if no QSA, it will go to getElementsByClassName or walk all childNodes and filter the className property until it has a list of matches. Performance wise, this will be worse in modern browsers because it needs to chain whereas method 3 will be a direct result.
because of how selectors work in JS, unlike in CSS - it's from left to right and more qualified selectors improve performance, having something like .toggler means when there is an older browser, it needlessly needs to consider every single DOMnode in the tree (except for text nodes). always qualify the selectors when you can, i.e. div.toggler or a.toggler.
2. $$('.toggler')
if QSA is available, rely on the browser to return stuff (no extra hacks here like :not, or :contains or :has or ! (reverse combinators).
if NO QSA, it will parse the expression via the Slick parser and basically internally degrade to something like in case 1 above. The biggest difference here is the lack of context - it will run against the whole document as $$ internally does something like document.getElements('.toggler') so there are a lot more nodes to consider. Always anchor your queries to a stable upper-most common node. or pass it into the query string for qualification like in case 3
once again, this will be improved for performance by making it more qualified eg:
$$('a.toggler')
3. $$('#accountDetailsTable .toggler')
Similar to case 2 but when QSA is available, it will be faster.
When QSA it is not there, it will run against the context of the #accountDetailsTable node so it will be better than case 2.
Making this more qualified will make a difference:
$$('#accountDetailsTable td.control > a.toggler')
big discount: it will depend on how big your DOM is, how many matches are found and returned. on a simple DOM, the expected performance order may differ.
Performance optimisations of selectors are increasingly irrelevant these days.
The days of SlickText comparisons of frameworks are over and application performance has have little to do with selector speed, or you are doing something wrong.
If you do your job right, you won't need to be constantly selecting elements. You can cache things, reuse, be smart and reduce DOM lookups to a minimum.
Events etc can be attached through smart event delegation, where appropriate, completely negating the need to select and add events to multiple nodes etc - use common sense and don't get hung up on theoretical performance benchmarks. Instead, use a profiler and test your actual app and see where you lose time / CPU cycles.
You can run a simple selector like that over 50000 times in the space of a second. That's micro-benchmarking and does not measure actual real world performance inside your APP, DOM, browser etc.
More thoughts on benchmarking performance and premature optimisations: http://www.youtube.com/watch?v=65-RbBwZQdU
Careful
Write for correctness and readability first. Even if you can get extra performance by doing something, don't do it at the expense of readability if it's not critical / reused all the time. Selectors like .toggler, .new, .button etc tend to be too generic and can be re-used across the app in different parts of the DOM. You need to qualify selectors to also ensure your intended functionality will continue working when it gets moved into a different page / widget / DOM and your project gets an influx of new developers.
just my two cents, I know you already accepted the answer.

My internal MooTools knowledge got kind of rusty over the last two years, but generally speaking I expect the $$-selector to be much faster, since it has to run through the elements only once.
Why not give it a try? See this JSfiddle:
Rough HTML:
<div id="accountDetailsTable">
<div id=".toggler">1</div>
<div id=".toggler">2</div>
<div id=".toggler">3</div>
</div>
JScript (please don't comment on the bad code, its just to demonstrate the functionality):
window.addEvent('domready', function() {
var iter = 50000;
var tstart = new Date().getTime();
for (var i=0;i<iter;i++) {
var x = $('accountDetailsTable').getElements('.toggler');
}
var tend = new Date().getTime();
var tdur = tend - tstart;
console.log('$: ' + tdur + ' ms');
var tstart = new Date().getTime();
for (var i=0;i<iter;i++) {
var x = $$('.toggler');
}
var tend = new Date().getTime();
var tdur = tend - tstart;
console.log('$$: ' + tdur + ' ms');
});
That said, my tests with about 50000 iterations lead to roughly this results:
$('accountDetailsTable').getElements('.toggler') : ~ 4secs
$$('.toggler') : ~ 2secs
Results will vary across browsers, elements and much more, so this is only a rough approximation.
This said, you'll probably feel only a difference if you have that many selectors in such a short period. Yes, performance should be thought about, if you only have a few selectors in your application, it shouldn't really matter.
I prefer the $$(), not because of the better performance, but because of better readability.

very slow filters with couchDB even with erlang

I have a database (couchDB) with about 90k documents in it. The documents are very simple like this:
{
"_id": "1894496e-1c9e-4b40-9ba6-65ffeaca2ccf",
"_rev": "1-2d978d19-3651-4af9-a8d5-b70759655e6a",
"productName": "Cola"
}
now I want one day to sync this database with a mobile device. Obviously 90k docs shouldn't go to the phone all at once. This is why I wrote filter functions. These are supposed to filter by "productName". At first in Javascript later in Erlang to gain performance. These Filter functions look like this in JavaScript:
{
"_id": "_design/local_filters",
"_rev": "11-57abe842a82c9835d63597be2b05117d",
"filters": {
"by_fanta": "function(doc, req){ if(doc.productName == 'Fanta'){ return doc;}}",
"by_wasser": "function(doc, req){if(doc.productName == 'Wasser'){ return doc;}}",
"by_sprite": "function(doc, req){if(doc.productName == 'Sprite'){ return doc;}}"
}
}
and like this in Erlang:
{
"_id": "_design/erlang_filter",
"_rev": "74-f537ec4b6508cee1995baacfddffa6d4",
"language": "erlang",
"filters": {
"by_fanta": "fun({Doc}, {Req}) -> case proplists:get_value(<<\"productName\">>, Doc) of <<\"Fanta\">> -> true; _ -> false end end.",
"by_wasser": "fun({Doc}, {Req}) -> case proplists:get_value(<<\"productName\">>, Doc) of <<\"Wasser\">> -> true; _ -> false end end.",
"by_sprite": "fun({Doc}, {Req}) -> case proplists:get_value(<<\"productName\">>, Doc) of <<\"Sprite\">> -> true; _ -> false end end."
}
}
To keep it simple there is no query yet but a "hardcoded" string. The filter all work. The problem is they are way to slow. I wrote a testprogram first in Java later in Perl to test the time it takes to filter the documents. Here one of my Perl scripts:
$dt = DBIx::Class::TimeStamp->get_timestamp();
$content = get("http://127.0.0.1:5984/mobile_product_test/_changes?filter=local_filters/by_sprite");
$dy = DBIx::Class::TimeStamp->get_timestamp() - $dt;
$dm = $dy->minutes();
$dz = $dy->seconds();
#contArr = split("\n", $content);
$arraysz = #contArr;
$arraysz = $arraysz - 3;
$\="\n";
print($dm.':'.$dz.' with '.$arraysz.' Elements (JavaScript)');
And now the sad part. These are the times I get:
2:35 with 2 Elements (Erlang)
2:40 with 10000 Elements (Erlang)
2:38 with 30000 Elements (Erlang)
2:31 with 2 Elements (JavaScript)
2:40 with 10000 Elements (JavaScript)
2:51 with 30000 Elements (JavaScript)
btw these are Minutes:Seconds. The number is the number of elements returned by the filter and the database had 90k Elements in it. The big surprise was that the Erlang filter was not faster at all.
To request all elements only takes 9 seconds. And creating views about 15. But it is not possible for my use on a phone to transfer all documents (volume and security reasons).
Is there a way to filter on a view to get a performance increase?
Or is something wrong with my erlang filter functions (I'm not surprised by the times for the JavaScript filters).
EDIT:
As pointed out by pgras the reason why this is slow is posted in the answer to this Question. To have the erlang filters run faster I need to go a "layer" below and program the erlang directly into the database and not as a _design document. But I dont'r really know where to start and how to do this. Any tips would be helpful.

This has been a while since I asked this question. But I thought I would come back to it and share what we ended up doing to solve this.
So the short answer is filter speed can't really be improved.
The reason is behind the way filters work. If you check your database changes. They are here:
http://<ip>:<port>/<databaseName>/_changes
This document contains all changes belonging to your database. If you do anything in your database new lines are just added. When one now wants to use a filter the filter is parsed from json to the specified language and used for every line in this file. To be clear as far as I am aware the parsing is done for each line as well. This is not very efficient and can't be changed.
So I personally think for most use cases filter are to slow and can't be used. This means we have to find a way around this. I do not imply that I have a general solution. I can just say that for us here it was possible to use views instead of filter. Views generate trees internally and are as fast as light compared to filter. A Simple filter is also stored in design document and could look like this:
{
"_id": "_design/all",
"language": "javascript",
"views": {
"fantaView": {
"map": "function(doc) { \n if (doc.productName == 'Fanta') \n emit(doc.locale, doc)\n} "
}
}
}
Where fantaView is the name for the view. I guess the function is self explanatory. So this is what we did I hope it helps someone if he runs into a similar issue.

I may be wrong but filter functions should return boolean values so try to change one to:
function(doc, req){ return doc.productName === 'Fanta';}
It may solve your performance problem...
Edit:
Here is an explanation about why it is slow (at least with JavaScript)...
One solution would be to use a view to select the ids of the documents to sync and then to start the sync by specifying the doc_ids to sync.
For example the view would be:
function(doc){
emit(doc.productName, doc._id)
}
You could call the view with _design/docs/_view/by_producName?key="Fanta"
And then start the replication with the found doc ids...

In general couchDB filters are slow. Others have already explained why they are slow. What I found was that the only reasonable way to use filters are to use the "since". Otherwise in a reasonably large database (mine has 47k docs, and they are complex docs) filters don't work. We learnt this the hard way by migrating from dev to prod [few hundred docs to ~47k docs]. We also changed design to a query a view and because we required a continuous feed like behaviour, we used Spring's #Scheduled

Contextualizing jQuery

I've got a fairly large site, with a lot of jQuery code for lots of different pages. We're talking about 1000 lines of fairly well optimized code (excluding plugins).
I know jQuery is fairly good at ignoring listeners for page elements that don't exist, but it still has to test their existence when the page loads. I'm also creating a load of vars (including decent sized arrays and objects), but only a few of them are used on each page.
My Question is: What's the best method of cutting down the amount of work each page has to do?
However, I do NOT want to split up the code into separate files. I want to keep all my code in 1 place and for the sake of efficiency I only want to call the server to download JS once (it's only 30kb, smaller than most images).
I've thought of several ways so far:
Put all my code chunks into named functions, and have each page call the functions it needs from inline <script> tags.
Have each page output a variable as the pageID, and for each chunk of have an if statement: if (pageID = 'about' || pageID = 'contact') {code...}
Give each page (maybe the body tag) a class or ID that can be used to identify the chunks that need executing: if ($('.about').length || $('.contact').length) {code...}
Combine 1 and 2 (or 1 and 3), so that each page outputs a variable, and the if statements are all together and call the functions: if (pageID = 'about') {function calls...}
Any other ideas? Or which is the best/most efficient of those?

Your first option will be fastest (by a minute margin).
You'll need to remember to call the functions from the correct pages.
However, don't bother.
Unless you've measured a performance impact in a profiler, there is no need to optimize this much.

I would argue that you are taking more of a performance hit for downloading the 30k then you will ever see from the code execution. That said, you could always test your url to determine the page and run all setup methods through a bootloader that determines the correct functions to run/ events to bind at load time. Something like the following maybe:
$(function(){
var page_methods = {
home : [meth_1, meth_2],
about : [meth_3, meth_2]
},
page = location.pathname.match(/\/(\w)$/)[1],
i = 0,
meth;
for ( ; meth = page_methods[ page ][ i++ ] ; ){
meth();
}
});

Develop Reference

JavaScript is the programming language of the Web.