Finding the Indices of the Differences using Damerau–Levenshtein Method - javascript

https://stackoverflow.com/a/11958496/379650 provides a very good function for calculating the Damerau–Levenshtein distance, however, I would like to be able to find the index of each of the differences given in terms of either the first or second string.
I am open to some other method better than Damerau–Levenshtein distance (if there is one), but it seemed like the most logical choice.
Example:
//Indices given in terms of the first string
ld('abc','abc');//[] no mistakes
ld('abc','abd');//[2]
ld('abc','aad');//[1,2]
ld('abc','ac');//[1]

Related

Realm-JS: Performant way to find the index of an element in sorted results list

I am searching for a perfomant way to find the index of a given realm-object in a sorted results list.
I am aware of this similar question, which was answered with using indexOf, so my current solution looks like this:
const sortedRecords = realm.objects('mySchema').sorted('time', true) // 'time' property is a timestamp
// grab element of interest by id (e.g. 123)
const item = realm.objectForPrimaryKey('mySchema','123')
// find index of that object in my sorted results list
const index = sortedRecords.indexOf(item)
My basic concern here is performance for lager datasets. Is the indexOf implementation of a realm-list improved for this in any way, or is it the same as from a JavaScript array? I know there is the possibility to create indexed properties, would indexing the time property improve the performance in this case?
Note:
In the realm-js api documentation, the indexOf section does not reference to Array.prototype.indexOf, as other sections do. This made me optimistic it's an own implementation, but it's not stated clearly.
Realm query methods return a Results object which is quite different from an Array object, the main difference is that the first one can change over time even without calling methods on it: adding and/or deleting record to the source schema can result in a change to Results object.
The only common thing between Results.indexOf and Array.indexOf is the name of the method.
Once said that is easy to also say that it makes no sense to compare the efficiency of the two methods.
In general, a problem common to all indexOf implementations is that they need a sequential scan and in the worst case (i.e. the not found case) a full scan is required. The wort implemented indexOf executed against 10 elements has no impact on program performances while the best implemented indexOf executed against 1M elements can have a severe impact on program performances. When possible it's always a good idea avoiding to use indexOf on large amounts of data.
Hope this helps.

Is it a bad idea to use indexOf inside loops?

I was studying big O notation for a technical interview and then I realized that javascript's indexOf method may have a time complexity of O(N) as it traverses through each element of an array and returns the index where its found.
We also know that a time complexity of O(n^2) (n square) is not a good performance measure for larger data.
So is it a bad idea to use indexOf inside loops? In javascript, its common to see code where indexOf method is being used inside loops, may be to measure equality or to prepare some object.
Rather than arrays, should we prefer objects wherever necessary as they provide lookup with constant time performance O(1).
Any suggestions will be appreciated.
It can be a bad idea to use indexOf inside loops especially if the dataStructure you are searching through is quite large.
One work around for this is to have a hash table or dictionary containing the index of every item which you can generate in O(N) time by looping through the data structure and updating it every time you add to the data structure.
If you push something on the end of the data structure it will take O(1) Time to update this table and the worst case scenario is if you push something to the beginning of the data structure it will take O(N).
In most scenarios it will be worth it as getting the index will be O(1) Time.
To be honest, tl;dr. But, I did some speed tests of the various ways of checking for occurrences in a string (if that is your goal for using indexOf. If you are actually trying to get the position of the match, I personally don't know how to help you there). The ways I tested were:
.includes()
.match()
.indexOf()
(There are also the variants such as .search(), .lastIndexOf(), etc. Those I have not tested).
Here is the test:
var test = 'test string';
console.time('match');
console.log(test.match(/string/));
console.timeEnd('match');
console.time('includes');
console.log(test.includes('string'));
console.timeEnd('includes');
console.time('indexOf');
console.log(test.indexOf('string') !== 0);
console.timeEnd('indexOf');
I know they are not loops, but show you that all are basically the same speed. And honestly, each do different things, depending on what you need (do you want to search by RegEx? Do you need to be pre ECMAScript 2015 compatible? etc. - I have not even listed all of them) is it really necessary to analyze it this much?
From my tests, sometimes indexOf() would win, sometimes one of the other ones would win.
based on the browser, the indexOf has different implementations (using graphs, trees, ...). So, the time complexity for each indexOf also differs.
Though, what is clear is that implementing indexOf to have O(n) would be so naive and I don't think there is a browser to have it implements like a simple loop. Therefore, using indexOf in a for loop is not the same as using 2 nested for loops.
So, this one:
// could be O(n*m) which m is so small
// could be O(log n)
// or any other O(something) that is for sure smaller than O(n^2)
console.time('1')
firstArray.forEach(item => {
secondArray.indexOf(item)
})
console.time('1')
is different than:
// has O(n^2)
console.time('2')
firstArray.forEach(item => {
secondArray.forEach(secondItem => {
// extra things to do here
})
})
console.time('2')

How to make min() choose two numbers

I would like min() to choose the two smallest numbers in a list, is this possible and how should I do it?
actually, Math.min only gives you the single smallest of a list, so you need to simply sort them all and grab the quantity you need:
list = [7,4,5,4,2,23,4,6,4,6];
smalls = list.sort(function(a,b){return a-b}).slice(0,2);
alert(smalls); // shows "2, 4"
Sorting is a little inefficient as it takes up O(n log n) time. You could do the same thing with a single pass. Just keep to variables that are initialized to empty (or negative inifinity) or something.
For every element encountered, check against the two variables and update accordingly.
It is pretty simple but it takes up only linear time O(n).
Search for a documentation on "Math" (it is a javascript object with many useful math related functions and variables).
Hint.

Why is looping through an Array so much faster than JavaScript's native `indexOf`?

Why is looping through an Array so much faster than JavaScript's native indexOf? Is there an error or something that I'm not accounting for? I expected native implementations would be faster.
For Loop While Loop indexOf
Chrome 10.0 50,948,997 111,272,979 12,807,549
Firefox 3.6 9,308,421 62,184,430 2,089,243
Opera 11.10 11,756,258 49,118,462 2,335,347
http://jsben.ch/#/xm2BV
5 years from then, lot of changes happened in browsers. Now, indexOf performance has increased and is definitely better than any other custom alternative.
Chrome Version 49.0.2623.87 (64-bit)
Ok, looking at the other benchmarks here I am scratching my head at the way that most developers seem to do their benchmarking.
Apologies, but the way it is done leads to horribly wrong conclusions, so I have to go a bit meta and give a comment on the answers provided.
What is wrong with the other benchmarks here
Measuring where to find element 777 in an array that never changes, always leading to index 117 seems so inappropriate for obvious reasons, that I have trouble explaining why. You can't reasonably extrapolate anything from such an overly specific benchmark! The only analogy I can come up with is performing anthropological research on one person, and then calling the findings a generalized overview of the entire culture of the country that this person lives in. The other benchmarks aren't much better.
Even worse: the accepted answer is an image without a link to the benchmark that was used, so we have no way to control if the code for that benchmark is correct (I hope it is a screenshot to a jsperf link that was originally in the question and later edited out in favour of the new jsben.ch link). It's not even an explanation of the original question: why one performs better than the other (a highly debatable statement to begin with).
First, you should know that not all benchmarking sites are created equal - some can add significant errors to certain types of measurements due to their own framework interfering with the timing.
Now, we are supposed to be comparing the performance of different ways to do linear search on an array. Think about the algorithm itself for a second:
look at a value for a given index into an array.
compare the value to another value.
if equal, return the index
if it is not equal, move to the next index and compare the next value.
That's the whole linear search algorithm, right?
So some of the linked benchmarks compare sorted and unsorted arrays (sometimes incorrectly labeled "random", despite being in the same order each iteration - relevant XKCD). It should be obvious that this does not affect the above algorithm in any way - the comparison operator does not see that all values increase monotonically.
Yes, ordered vs unsorted arrays can matter, when comparing the performance of linear search to binary or interpolation search algorithms, but nobody here is doing that!
Furthermore, all benchmarks shown use a fixed length array, with a fixed index into it. All that tells you is how quickly indexOf finds that exact index for that exact length - as stated above, you cannot generalise anything from this.
Here is the result of more-or-less copying the benchmark linked in the question to perf.zone (which is more reliable than jsben.ch), but with the following modifications:
we pick a random value of the array each run, meaning we assume each element is as likely to be picked as any other
we benchmark for 100 and for 1000 elements
we compare integers and short strings.
https://run.perf.zone/view/for-vs-while-vs-indexof-100-integers-1516292563568
https://run.perf.zone/view/for-vs-while-vs-indexof-1000-integers-1516292665740
https://run.perf.zone/view/for-vs-while-vs-indexof-100-strings-1516297821385
https://run.perf.zone/view/for-vs-while-vs-indexof-1000-strings-1516293164213
Here are the results on my machine:
https://imgur.com/a/fBWD9
As you can see, the result changes drastically depending on the benchmark and the browser being used, and each of the options wins in at least one of the scenarios: cached length vs uncached length, while loop vs for-loop vs indexOf.
So there is no universal answer here, and this will surely change in the future as browsers and hardware changes as well.
Should you even be benchmarking this?
It should be noted that before you proceed to build benchmarks, you should determine whether or not the linear search part is a bottleneck to begin with! It probably isn't, and if it is, the better strategy is probably to use a different data structure for storing and retrieving your data anyway, and/or a different algorithm.
That is not to say that this question is irrelevant - it is rare, but it can happen that linear search performance matters; I happen to have an example of that: establishing the speed of constructing/searching through a prefix trie constructed through nested objects (using dictionary look-up) or nested arrays (requiring linear search).
As can be seen this github comment, the benchmarks involve various realistic and best/worst-case payloads on various browsers and platforms. Only after going through all that do I draw conclusions about expected performance. In my case, for most realistic situations the linear search through an array is faster than dictionary look-up, but worst-case performance is worse to the point of freezing the script (and easy to construct), so the implementation was marked as as an "unsafe" method to signal to others that they should think about the context the code would be used.
Jon J's answer is also a good example of taking a step back to think about the real problem.
What to do when you do have to micro-benchmark
So let's assume we know that we did our homework and established that we need to optimize our linear search.
What matters then is the eventual index at which we expect to find our element (if at all), the type of data being searched, and of course which browsers to support.
In other words: is any index equally likely to be found (uniform distribution), or is it more likely to be centered around the middle (normal distribution)? Will be find our data at the start or near the end? Is our value guaranteed to be in the array, or only a certain percentage of the time? What percentage?
Am I searching an array of strings? Objects Numbers? If they're numbers, are they floating point values or integers? Are we trying to optimize for older smartphones, up-to-date laptops, or school desktops stuck with IE10?
This is another important thing: do not optimize for the best-case performance, optimize for realistic worst-case performance. If you are building a web-app where 10% of your customers use very old smart phones, optimize for that; their experience will be one that is unbearable with bad performance, while the micro-optimization is wasted on the newest generation flagship phones.
Ask yourself these questions about the data you are applying linear search to, and the context within which you do it. Then make test-cases fitting for these criteria, and test them on the browsers/hardware that represents the targets you are supporting.
Probably because the actual indexOf implementation is doing a lot more than just looping through the array. You can see the Firefox internal implementation of it here:
https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Array/indexOf
There are several things that can slow down the loop that are there for sanity purposes:
The array is being re-cast to an Object
The fromIndex is being cast to a Number
They're using Math.max instead of a ternary
They're using Math.abs
indexOf does a bunch of type-checking and validation that the for loop and while loop ignore.
Here's the indexOf algorithm:
https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Array/indexOf
Edit: My guess is indexOf is faster for big arrays because it caches the length of the array before looping through it.
Run the test one more time with the edits I've made.
I've increased the size of the array, and made the index you're searching for larger as well. It seems in large arrays indexOf may be a faster choice.
http://jsben.ch/#/xm2BV
EDIT: Based on more tests, indexOf seems to run faster than a for loop in the version of Safari I'm using (5.0.3) and slower in just about everything else.
It might be worth noting that if all you are trying to do is keep a list of items and check for existence (e.g. avoid adding duplicate IDs to an array), it would be far faster to keep an OBJECT with keys that reflect each ID. If you think I'm wrong, compare the following with an array + indexOf. We are talking 181ms for the object method vs. 1 MINUTE for the array indexOf method.
var objs = []
var i_uid = {} // method 1
var a_uid = [] // method 2
var total_count = 100000, idLen = 5
var ts, te, cObj = 0
// method 1
ts = new Date()
while (cObj < total_count) {
var u = uid(idLen),
o = {
uid: u,
text: 'something',
created: new Date()
}
if (!i_uid[u]) { // ensure unique uids only
objs.push(o)
i_uid[u] = cObj // current array position as placeholder
cObj++
}
else {
console.log('unique violation [duplicate uid', u, ']')
}
}
te = new Date()
console.log('loaded ' + total_count + ' with object method in', (te - ts), 'ms')
i_uid = {} // free-up
cObj = 0 // reset
objs = [] // reset
// method 2
ts = new Date()
while (cObj < total_count) {
var u = uid(idLen),
o = {
uid: u,
text: 'something',
created: new Date()
}
if (a_uid.indexOf(u) == -1) { // ensure unique uids only
objs.push(o)
a_uid.push(u)
cObj++
}
else {
console.log('unique violation [duplicate uid', u, ']')
}
}
te = new Date()
console.log('loaded ' + total_count + ' with array + indexOf method in', (te - ts), 'ms')
function uid(l) {
var t = '',
p = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789',
pl = p.length
for (var i = 0; i < l; i++)
t += p.charAt(Math.floor(Math.random() * pl))
return t
}

JavaScript sort with numbers

The following program (taken from a tutorial) prints the numbers in an array in order from lowest to highest. In this case, the result will be 2,4,5,13,31
My question relates to the paramaters "a" and "b" for the function compareNumbers. When the function is called in numArray.sort(compareNumbers) what numbers will be the parameters a and b for the function. Does it just move along the array. For example, start with a=13 and b=2? After that, does the function run again comparing a=2 and b=31? or would it next compare a=31 and b=4?
Can someone please explain how that part works and also how it manages to sort them from lowest to highest? I don`t see how the function manages to do the necessary calculations on the numbers in the array.
function compareNumbers(a,b) {
return a - b;
}
var numArray = [13,2,31,4,5];
alert(numArray.sort(compareNumbers));
The particular pairs that get passed in depend on the sorting algorithm being used. As the algorithm tries to go about sorting the range, it needs to be able to compare pairs of values to determine their ordering. Whenever this happens, it will call your function to get that comparison.
Because of this, without inside knowledge about how the sorting algorithm works, you cannot predict what pairs will get compared. The choice of algorithm will directly determine what elements get compared and in what order.
Interestingly, though, you can actually use the comparison function to visualize how the sort works or to reverse-engineer the sorting algorithm! The website sortviz.org has many visualizations of sorting algorithms generated by passing custom comparators into sorting functions that track the positions of each element. If you take a look, you can see how differently each algorithm moves its elements around.
Even more interestingly, you can use comparison functions as offensive weapons! Some sorting algorithms, namely quicksort, have particular inputs that can cause them to run much more slowly than usual. In "A Killer Adversary for Quicksort," the author details how to use a custom comparator to deliberate construct a bad input for a sorting algorithm.
Hope this helps!
The two parameters will be elements of your array. The system will compare enough pairs to be able to sort them correctly. Nothing else is guaranteed.
There are lots of things the sort method could be doing under the hood; see, e.g., http://en.wikipedia.org/wiki/Sorting_algorithm for some of them. Most Javascript implementations probably use some variant of either quicksort or mergesort.
(Here are super-brief descriptions of those. Quicksort is: pick an element in the array, rearrange the array to put everything smaller than that in front of everything larger, then sort the "smaller" and "larger" bits. Mergesort is: sort the first half of the array, sort the second half of the array, and then merge the two sorted halves. In both cases you need to sort smaller arrays, which you do with the same algorithm until you get to arrays so small that sorting them is trivial. In both cases, good practical implementations do all sorts of clever things I haven't mentioned.)
It will be called for all pairs of a,b that sorting algorithm need to get all array sorted. Check out http://en.wikipedia.org/wiki/Sorting_algorithm for brief list of sorting algorithms.
When you pass a function to Array.sort(), it expects two parameters and returns a numerical value.
If you return a negative value, the first parameter will be placed before the second parameter in the array.
If you return a positive value, the first parameter will be placed after the second parameter in the array.
If you return 0, they will stay in their current position.
By doing return a - b;, you are returning a negative number if a is less than b (2 - 13 = -11), a positive number if b is less than a (13 - 2 = 11), and zero if they are even (13 - 13 = 0).
As far as which numbers are compared in what order, I believe that is up to the javascript engine to determine.
Check out the documentation on javascript array sorting at the MDC Doc Center for more detailed information.
https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Array/sort
(BTW, I always check the MDC Doc Center for any questions about how javascript works, they have the best information on the language AFAIK.)

Categories

Resources