String sequence similarity / difference ratio in javascript and python - javascript

Say I had a reference string
"abcdabcd"
and a target string
"abcdabEd"
Is there a simple way in javascript and python to get the string sequence similarity ratio?
Example:
"abcdabcd" differs from "abcdabEd" by the character "E" so the ratio of similarity is high but less than 1.0
"bcdabcda" differs from "abcdabEd" greatly because every character at a specific string index is different so the similarity ratio is 0.0
note that the similarity ratio is not how many similar characters are in each string but how similar the sequences are from each other
therefore code like
# python - incorrect for this problem
difflib.SequenceMatcher(None, "bcdabcda", "abcdabEd").ratio()
would be wrong

You can use this general formula, it works with strings or arrays of objects with the same or different lengths:
similarity=#common/(sqrt(nx*ny));
where #common are the common occurrences (in this case the number of matching characters);
nx is the length of the array of objects x (or the string called x);
ny is the length of the array of objects y (or the string called y).
If the length of the strings is the same that formula reduces to the simple case:
similarity=#common/n;
where:
n=nx=ny.
In python this formula for similarity of strings (considering the order of characters, as you want) can be written as:
from math import sqrt
def similarity(x, y):
n=min(len(x), len(y))
common=0
for i in range(n):
if (x[i]==y[i]):
common+=1
return common/sqrt(len(x)*len(y))
and in javascript it's analogous.

how bout
float(sum([a==b for a,b in zip(my_string1,my_string2)]))/len(my_string1)
>>> s1,s2 = "abcdabcd","abcdabEd"
>>> print float(sum([a==b for a,b in zip(s1,s2)]))/len(s1)
0.875

Related

How to generate a GUID with a custom alphabet, that behaves similar to an MD5 hash (in JavaScript)?

I am wondering how to generate a GUID given an input string, such that the same input string results in the same GUID (sort of like an MD5 hash). The problem with MD5 hashes is they just guarantee low collision rate, rather than uniqueness. Instead I would like something like this:
guid('v1.0.0') == 1231231231123123123112312312311231231231
guid('v1.0.1') == 6154716581615471658161547165816154716581
guid('v1.0.2') == 1883939319188393931918839393191883939319
How would you go about implementing this sort of thing (ideally in JavaScript)? Is it even possible to do? I am not sure where to start. Things like the uuid module don't take a seed string, and they don't let you use a custom format/alphabet.
I am not looking for the canonical UUID format, but rather a GUID, ideally one made up of just integers.
What you would need is define a one-to-one mapping of text strings (such as "v1.0.0") onto 40 digit long strings (such as "123123..."). This is also known as a bijection, although in your case an injection (a simple one-to-one mapping from inputs to outputs, not necessarily onto) may be enough. As you note, hash functions don't necessarily ensure this mapping, but there are other possibilities, such as full-period linear congruential generators (if they take a seed that you can map one-to-one onto input string values), or other reversible functions.
However, if the set of possible input strings is larger than the set of possible output strings, then you can't map all input strings one-to-one with all output strings (without creating duplicates), due to the pigeonhole principle.
For example, you can't generally map all 120-character strings one-to-one with all 40-digit strings unless you restrict the format of the 120-character strings in some way. However, your problem of creating 40-digit output strings can be solved if you can accept limiting input strings to no more than 1040 values (about 132 bits), or if you can otherwise exploit redundancy in the input strings so that they are guaranteed to compress losslessly to 40 decimal digits (about 132 bits) or less, which may or may not be possible. See also this question.
The algorithm involves two steps:
First, transform the string to a BigInt by building up the string's charCodeAt() values similarly to the stringToInt method given in another answer. Throw an error if any charCodeAt() is 0x80 or greater, or if the resulting BigInt is equal to or greater than BigInt(alphabet_length)**BigInt(output_length).
Then, transform the integer to another string by taking the mod of the BigInt and the output alphabet's size and replacing each remainder with the corresponding character in the output alphabet, until the BigInt reaches 0.
One approach would be to use the method from that answer:
/*
* uuid-timestamp (emitter)
* UUID v4 based on timestamp
*
* Created by tarkh
* tarkh.com (C) 2020
* https://stackoverflow.com/a/63344366/1261825
*/
const uuidEmit = () => {
// Get now time
const n = Date.now();
// Generate random
const r = Math.random(); // <- swap this
// Stringify now time and generate additional random number
const s = String(n) + String(~~(r*9e4)+1e4);
// Form UUID and return it
return `${s.slice(0,8)}-${s.slice(8,12)}-4${s.slice(12,15)}-${[8,9,'a','b'][~~(r*3)]}${s.slice(15,18)}-${s.slice(s.length-12)}`;
};
// Generate 5 UUIDs
console.log(`${uuidEmit()}
${uuidEmit()}
${uuidEmit()}
${uuidEmit()}
${uuidEmit()}`);
And simply swap out the Math.random() call to a different random function which can take your seed value. (There are numerous algorithms out there for creating a seedable random method, so I won't try prescribing a particular one).
Most random seeds expect numeric, so you could convert a seed string to an integer by just adding up the character values (multiplying each by 10^position so you'll always get a unique number):
const stringToInt = str =>
Array.prototype.slice.call(str).reduce((result, char, index) => result += char.charCodeAt(0) * (10**(str.length - index)), 0);
console.log(stringToInt("v1.0.0"));
console.log(stringToInt("v1.0.1"));
console.log(stringToInt("v1.0.2"));
If you want to generate the same extract string every time, you can take a similar approach to tarkh's uuidEmit() method but get rid of the bits that change:
const strToInt = str =>
Array.prototype.slice.call(str).reduce((result, char, index) => result += char.charCodeAt(0) * (10**(str.length - index)), 0);
const strToId = (str, len = 40) => {
// Generate random
const r = strToInt(str);
// Multiply the number by some things to get it to the right number of digits
const rLen = `${r}`.length; // length of r as a string
// If you want to avoid any chance of collision, you can't provide too long of a string
// If a small chance of collision is okay, you can instead just truncate the string to
// your desired length
if (rLen > len) throw new Error('String too long');
// our string length is n * (r+m) + e = len, so we'll do some math to get n and m
const mMax = 9; // maximum for the exponent, too much longer and it might be represented as an exponent. If you discover "e" showing up in your string, lower this value
let m = Math.floor(Math.min(mMax, len / rLen)); // exponent
let n = Math.floor(len / (m + rLen)); // number of times we repeat r and m
let e = len - (n * (rLen + m)); // extra to pad us to the right length
return (new Array(n)).fill(0).map((_, i) => String(r * (i * 10**m))).join('')
+ String(10**e);
};
console.log(strToId("v1.0.0"));
console.log(strToId("v1.0.1"));
console.log(strToId("v1.0.2"));
console.log(strToId("v1.0.0") === strToId("v1.0.0")); // check they are the same
console.log(strToId("v1.0.0") === strToId("v1.0.1")); // check they are different
Note, this will only work with smaller strings, (probably about 10 characters top) but it should be able to avoid all collisions. You could tweak it to handle larger strings (remove the multiplying bit from stringToInt) but then you risk collisions.
I suggest using MD5...
Following the classic birthday problem, all things being equal, the odds of 2 people sharing a birthday out of a group of 23 people is ( see https://en.wikipedia.org/wiki/Birthday_problem )...
For estimating MD5 collisions, I'm going to simplify the birthday problem formula, erring in the favor of predicting a higher chance of a collision...
Note though that whereas in the birthday problem, a collision is a positive result, in the MD5 problem, a collision is a negative result, and therefore providing higher than expected collision odds provides a conservative estimate of the chance of a MD5 collision. Plus this higher predicted chance can in some way be considered a fudge factor for any uneven distribution in the MD5 output, although I do not believe there is anyway to quantify this without a God computer...
An MD5 hash is 16 bytes long, resulting in a range of 256^16 possible values. Assuming that the MD5 algorithm is generally uniform in its results, lets suppose we create one quadrillion (ie, a million billion or 10^15) unique strings to run through the hash algorithm. Then using the modified formula (to ease the collision calculations and to add a conservative fudge factor), the odds of a collision are...
So, after 10^15 or one quadrillion unique input strings, the estimated odds of a hash collision are on par with the odds of winning the Powerball or the Mega Millions Jackpot (which are on order of 1 in ~300,000,000 per https://www.engineeringbigdata.com/odds-winning-powerball-grand-prize-r/ ).
Note too that 256^16 is 340282366920938463463374607431768211456, which is 39 digits, falling within the desired range of 40 digits.
So, suggest using the MD5 hash ( converting to BigInt ), and if you do run into a collision, I will be more than glad to spot you a lottery ticket, just to have a chance to tap into your luck and split the proceeds...
( Note: I used https://keisan.casio.com/calculator for the calculations. )
While UUID v4 is just used for random ID generation, UUID v5 is more like a hash for a given input string and namespace. It's perfect for what you describe.
As you already mentioned, You can use this npm package:
npm install uuid
And it's pretty easy to use.
import {v5 as uuidv5} from 'uuid';
// use a UUIDV4 as a unique namespace for your application.
// you can generate one here: https://www.uuidgenerator.net/version4
const UUIDV5_NAMESPACE = '...';
// Finally, provide the input and namespace to get your unique id.
const uniqueId = uuidv5(input, namespace);

String comparison - Javascript

I'm trying to get my head around string comparisons in Javascript
function f(str){
return str[0] < str[str.length -1]
}
f("a+"); // false
In ASCII: 'a' == 97, '+' == 43
Am I correct in thinking my test: f(str) is based on ASCII values above?
You don't need a function or a complicated test pulling a string apart for this. Just do 'a' < '+' and learn from what happens. Or, more simply, check the char's charcode using 'a'.charCodeAt(0).
You are almost right. It is based on unicode code units (not code points, this is the 16-bit encoded version), not ascii on values.
From the ECMAScript 2015 specification:
If both px and py are Strings, then
If py is a prefix of px, return false. (A String value p is a prefix of String value q if q can be the result of concatenating p and some other String r. Note that any String is a prefix of itself, because r may be the empty String.)
If px is a prefix of py, return true.
Let k be the smallest nonnegative integer such that the code unit at index k within px is different from the code unit at index k within py. (There must be such a k, for neither String is a prefix of the other.)
Let m be the integer that is the code unit value at index k within px.
Let n be the integer that is the code unit value at index k within py.
If m < n, return true. Otherwise, return false.
Note2
The comparison of Strings uses a simple lexicographic ordering on
sequences of code unit values. There is no attempt to use the more
complex, semantically oriented definitions of character or string
equality and collating order defined in the Unicode specification.
Therefore String values that are canonically equal according to the
Unicode standard could test as unequal. In effect this algorithm
assumes that both Strings are already in normalized form. Also, note
that for strings containing supplementary characters, lexicographic
ordering on sequences of UTF-16 code unit values differs from that on
sequences of code point values.
Basically it means that string comparison is based on a lexicographical order of "code units", which is the numeric value of unicode characters.
JavaScript engines are allowed to use either UCS-2 or UTF-16 (which is the same for most practical purposes).
So, technically, your function is based on UTF-16 values and you were comparing 0x0061 and 0x002B.

Working with string (array?) of bits of an unspecified length

I'm a javascript code monkey, so this is virgin territory for me.
I have two "strings" that are just zeros and ones:
var first = "00110101011101010010101110100101010101010101010";
var second = "11001010100010101101010001011010101010101010101";
I want to perform a bitwise & (which I've never before worked with) to determine if there's any index where 1 appears in both strings.
These could potentially be VERY long strings (in the thousands of characters). I thought about adding them together as numbers, then converting to strings and checking for a 2, but javascript can't hold precision in large intervals and I get back numbers as strings like "1.1111111118215729e+95", which doesn't really do me much good.
Can I take two strings of unspecified length (they may not be the same length either) and somehow use a bitwise & to compare them?
I've already built the loop-through-each-character solution, but 1001^0110 would strike me as a major performance upgrade. Please do not give the javascript looping solution as an answer, this question is about using bitwise operators.
As you already noticed yourself, javascript has limited capabilities if it's about integer values. You'll have to chop your strings into "edible" portions and work your way through them. Since the parseInt() function accepts a base, you could convert 64 characters to an 8 byte int (or 32 to a 4 byte int) and use an and-operator to test for set bits (if (a & b != 0))
var first = "00110101011101010010101110100101010101010101010010001001010001010100011111",
second = "10110101011101010010101110100101010101010101010010001001010001010100011100",
firstInt = parseInt(first, 2),
secondInt = parseInt(second, 2),
xorResult = firstInt ^ secondInt, //524288
xorString = xorResult.toString(2); //"10000000000000000000"

Memory Size: What's Smallest Between String or Array

var string = '';
var array = [];
for(var i = 0; i < 10000; i++){
string += '0';
array.push(0);
}
Which one would be smaller? When/where is the breakpoint between the two?
Note: The numbers are always 1 digit.
Creating the array is about 50% faster than creating the string.
Based on the answer here, you can roughly calculate the size of different data-types in JavaScript.
The equations used, pertaining directly to your question, to calculate the size in bytes:
string = string.length * 2
number = 8
Based on this, the size of your array variable would depend on the content-type being placed in it. As you're inserting numeric values, each offset would be 8 bytes, so:
array[number] = array.length * 8
With these equations, the sizes are:
string = 20000
array = 80000
If you were to use array.push('0') instead (i.e. use strings), the sizes of string and array should be roughly equal.
References:
The String Type - EMCAScript Language Specification:
The String type is the set of all finite ordered sequences of zero or more 16-bit unsigned integer values.
The Number Type - EMCAScript Language Specification:
The Number type has exactly 18437736874454810627 (that is, 264−253+3) values, representing the double-precision 64-bit format IEEE 754 values as specified in the IEEE Standard for Binary Floating-Point Arithmetic
To store small numbers in an array, best way is to use a Int8Array.
(https://developer.mozilla.org/en-US/docs/Web/API/Int8Array).
The array will be faster always.
With the string, each time you append, the runtime has to allocate space for the new string, and then throw away the last version of the string.
With the array, it's just extending a linked list.
http://en.wikipedia.org/wiki/Linked_list
On the other hand, the string will probably consume less memory since all the data will be in a single contiguous block of RAM, whereas the array will have the data and all the linked-list pointers too.

Javascript array sort speed affected by string length?

Just wondering, I have seen diverging opinions on this subject.
If you take an array of strings, say 1000 elements and use the sort method. Which one would be faster? An array in which the strings are 100 characters long or one in which the strings are only 3 characters long?
I tried to test but I have a bug with Firebug at the moment and Date() appears too random.
Thank you!
It depends what the strings contain, if they contain different characters, the rest of the string doesn't have to be checked for comparison so it doesn't matter.
For example, "abc" < "bca" Here only the first character had to be checked.
You can read the specs for this: http://ecma-international.org/ecma-262/5.1/#sec-11.8.5
Specifically:
Else, both px and py are Strings
If py is a prefix of px, return false. (A String value p is a prefix of String value
q if q can be the result of concatenating p and some other String r. Note that any
String is a prefix of itself, because r may be the empty String.)
If px is a prefix of py, return true.
Let k be the smallest nonnegative integer such that the character at position k within px is
different from the character at position k within py. (There must be such a k, for neither
String is a prefix of the other.)
Let m be the integer that is the code unit value for the character at position k within
px.
Let n be the integer that is the code unit value for the character at position k within
py.
If m < n, return true. Otherwise, return false.
It really depends on how different the strings are, but I guess the differences would be minimal due to the fact that what's called to do the comparison is way slower than actually comparing the strings.
But then again, modern browsers use some special optimizations for sort, so they cut some comparisons to speed things up. And this would happen more often sorting an array of short strings.
And FYI, if you want to make some benchmark, use a reliable tool like jsPerf.

Categories

Resources