Best way to compare 2 similar strings? [closed] - javascript

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Original Question:
I have a lot of products with various names, I have two variations of
the names I need compared (Basically finding out if these two strings
are the same products). I don't want any false flags, does anyone have
recommendations on how I can achieve this?
Here is a product example:
Canon 50mm f/1.2L vs Canon EF 50mm f/1.2L USM Lens
There are other variations, but this will be the typical difference.
Is there any easy functionality I could implement to get a certain
answer? Only thing I can think of is maybe splitting the strings and
comparing and say if x matches a, b, or c.
My original question was a bit vague. The end goal is to be able to compare two strings and see how similar they are - e.g. 0%, 50%, or 100% similar. In this scenario I am using lens products from different sources, they use similar names - however I have no product sku/id for proper comparison.
The string score plugin has solved my issue, providing a value of how similar these products are.

In the bioinformatics word and I believe in other domains, this kind of pattern matching/searching algorithm is called fuzzy search.
There is a nodeJS module called string_score for it. Essentially you feed the API with 2 pieces of string and it returns you a score of how similar they are.
Example:
var test = require('string_score');
var match_percent = "Canon EF 50mm f/1.2L USM Lens".score("Canon 50mm f/1.2L");
console.log("Match score= " + match_percent);
Output:
Match score= 0.7938133874239354
Using the score as a baseline for comparison. You can say if it has a score of equip or over 80 then it is a match.
More Example:
var score = 0;
score = "hello world".score("he");
console.log("Match score => " + score);
score = "hello world".score("hel");
console.log("Match score => " + score);
score = "hello world".score("hell");
console.log("Match score => " + score);
score = "hello world".score("hello");
console.log("Match score => " + score);
<script type="text/javascript" src="//cdnjs.cloudflare.com/ajax/libs/string_score/0.1.10/string_score.min.js"></script>
References:
String_score: https://github.com/joshaven/string_score

You have to think about how would you recognize if two strings are the same product yourself, just by reading them.
Based solely on the examples you provided, it seems that the way to tell two strings representing a product are the same is if every word (a token separated by spaces) from the shorter string is contained in the longer string.
You might also want to ignore capitalization.
Something like this should work for the basic usage:
const tokens = s => s.toLowerCase().split(/\s+/g);
const sameProducts = (s1, s2) => {
const s1Tokens = tokens(s1);
const s2Tokens = tokens(s2);
const [shorterTokens, longerTokens] = s1Tokens.length > s2Tokens.length
? [s2Tokens, s1Tokens]
: [s1Tokens, s2Tokens];
return shorterTokens.every(st => longerTokens.includes(st));
}
console.log(
sameProducts(
'Canon 50mm f/1.2L',
'Canon EF 50mm f/1.2L USM Lens'
)
)
This code would have quadratic time complexity because the most expensive operation means that, for every token in the shorter string, you have to iterate through every token in the longer string.
A simple optimization would be to build a Set<token> from the longer string. This would make the operation linear because searching a set is O(1).
const tokens = s => s.toLowerCase().split(/\s+/g);
const sameProducts = (s1, s2) => {
const s1Tokens = tokens(s1);
const s2Tokens = tokens(s2);
const [shorterTokens, longerTokens] = s1Tokens.length > s2Tokens.length
? [s2Tokens, s1Tokens]
: [s1Tokens, s2Tokens];
const longerTokensSet = longerTokens.reduce((s, t) => {
s.add(t);
return s;
}, new Set());
return shorterTokens.every(st => longerTokensSet.has(st));
}
console.log(
sameProducts(
'Canon 50mm f/1.2L',
'Canon EF 50mm f/1.2L USM Lens'
)
)
Now you have to consider, do all tokens have to match? Maybe only tokens corresponding to the brand and focal-length have to match.
If this is the case, you might also want to validate both strings while parsing them and return false immediately if the product strings are invalid.
Here's a rough idea:
const productSet = new Set(['canon'])
const focalLengthsSet = new Set(['50mm']);
const isMeaningful = t => productSet.has(t) || focalLengthsSet.has(t);
const meaningfulTokens = s => s.toLowerCase().split(/\s+/g).filter(isMeaningful);
const validTokens = (tokens, s) => {
const valid = tokens.length === 2; // <-- could do better validation here
console.assert(valid, `Missing token(s) in ${s}`);
return valid;
}
const sameProducts = (s1, s2) => {
const s1Tokens = meaningfulTokens(s1);
if (!validTokens(s1Tokens, s1)) { return false; }
const s2Tokens = meaningfulTokens(s2);
if (!validTokens(s2Tokens, s2)) { return false; }
const [shorterTokens, longerTokens] = s1Tokens.length > s2Tokens.length
? [s2Tokens, s1Tokens]
: [s1Tokens, s2Tokens];
const longerTokensSet = longerTokens.reduce((s, t) => {
s.add(t);
return s;
}, new Set());
return shorterTokens.every(st => longerTokensSet.has(st));
}
console.log(
sameProducts(
'Canon 50mm f/1.3',
'Canon EF 50mm f/1.2'
)
)
console.log(
sameProducts(
'Canon 50mm f/1.3',
'Canon EF f/1.2' // <-- missing focal length
)
)
Now you could consider does every focal length correspond to every product or is it more product-specific?
Do tokens contain logic that explicitly depends on previously matched tokens?
All of the above are just basic approaches and techniques you could use but the actual solution would heavily depend on your exact circumstances.
A common algorithm for measuring string similarity is called the Levenstein distance.
The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
This algorithm would allow you to perhaps match the strings directly if you edit distance threshold is strict enough (although this could provide false positives) or you could even account for misspelled products for example when comparing individual tokens by making sure they are within a specific edit distance from one another.

Related

Javascript - How to know how much string matched in another string?

I have been implementing a simple quiz for English. In that, we need to validate answers, which are entered by users in input field. In the current implementation, I am comparing the correct answer with user's answer exactly. Like,
HTML
<input type="text" id="answer" />
<button onclick="validate()">Validate</button>
Javascript
var question = "Do you like movies?",
answer = "No, I don't like movies.";
function validate() {
var userInput = document.getElementById('answer').value;
if(answer == userInput) {
console.log("correct");
} else {
console.log("wrong");
}
}
But I don't want validate exactly. Like, ignore case sensitive, commas, apostrophe, etc. For example if user enters,
i dont like movies
The answer can be correct. I don't know how start and where to start. Anyone please help.
One option would be to strip out all non-word characters and spaces, and compare the lower-case version of each replaced string:
var question = "Do you like movies?",
answer = "No, I don't like movies.";
const normalize = str => str
.replace(/[^\w ]/g, '')
.toLowerCase();
function validate(userInput) {
const noramlizedInput = normalize(userInput)
const noramlizedAnswer = normalize(answer);
if (noramlizedInput == noramlizedAnswer) {
console.log("correct");
} else {
console.log("wrong");
}
}
validate('No i dont like movies');
validate("NO!!!!! I DON''t like movies.");
Another option would be to loop through all possible substrings of the userInput and figure out which has the most overlap with the desired answer, but that's a whole lot more complicated.
An easier option would be to check to see how many overlapping words there are:
var question = "Do you like movies?",
answer = "No, I don't like movies.";
const normalize = str => str
.replace(/[^\w ]/g, '')
.toLowerCase()
.split(/\s+/)
function validate(userInput) {
const noramlizedInputArr = normalize(userInput);
const noramlizedAnswerArr = normalize(answer);
const overlapCount = noramlizedInputArr.reduce((a, word) => (
a + Number(noramlizedAnswerArr.includes(word))
), 0);
console.log(overlapCount);
if (overlapCount >= 4) {
console.log("correct");
} else {
console.log("wrong");
}
}
validate('No i dont like movies');
validate("NO!!!!! I DON''t like movies.");
validate("i dont like movies.");
validate("Yes I like movies.");
If you are interested in simply catching spelling errors and small variations, a standard metric is called edit distance or Levenshtein distance. This is a count of the minimum number of deletions, insertions, or substitutions you need to change one text into another. Strings like "No I don't like the movies" and "No I don't like the moveys" will have small edit distances.
Here's a quick and dirty recursive edit distance function that will give you an idea:
function validate(text, pattern) {
// some simple preprocessing
let p = pattern.toLowerCase().replace(/[^a-z]+/ig, '')
let t= text.toLowerCase().replace(/[^a-z]+/ig, '')
// memoize recursive algorithm
let matrix = Array.from({length: t.length + 1}, () => [])
function editDistance(text, pattern, i = 0, j = 0){
if(i == text.length && j == pattern.length) return 0
if(i == text.length) return pattern.length - j
if(j == pattern.length) return text.length - i
let choices = [
(matrix[i+1][j+1] || (matrix[i+1][j+1] = editDistance(text, pattern, i+1, j+1))) + (text[i].toLowerCase() === pattern[j].toLowerCase() ? 0 : 1),
(matrix[i+1][j] || (matrix[i+1][j] = editDistance(text, pattern, i+1, j))) + 1,
(matrix[i][j+1] || (matrix[i][j+1] = editDistance(text, pattern, i, j+1))) + 1
]
return Math.min(...choices)
}
return editDistance(t, p)
}
// similar strings have smaller edit distances
console.log(validate("No I dont lik moves","No i dont like movies"))
// a little less similar
console.log(validate("Yes I like movies","No i dont like movies"))
// totally different
console.log(validate("Where is the bathroom","No i dont like movies"))
// careful -- small edit distance !== close meaning
console.log(validate("I do like tacos","I don't like tacos"))
Picking a minimum acceptable distance works pretty well for matching strings with small typos. Of course, if you are trying to gauge user intent, none of these simple hues tics will work. Strings like "I love tacos" and "I loath tacos" have a small edit distance and you can't tell that they mean the opposite without knowledge of the language. If you need to do this level of checking you can try using a service like Watson Conversation that will return user intents to input.

How do I shorten and expand a uuid to a 15 or less characters

Given a uuid(v4) without dashes, how can I shorten it to a 15 or less than 15 characters string? I should also be able to go back to the original uuid from the 15 characters string.
I am trying to shorten it to send it in a flat file and the file format specifies this field to be a 15 characters alphanumeric field. Given that shortened uuid, I should be able to map it back to the original uuid.
Here is what I tried, but definitely not what I wanted.
export function shortenUUID(uuidToShorten: string, length: number) {
const uuidWithoutDashes = uuidToShorten.replace(/-/g , '');
const radix = uuidWithoutDashes.length;
const randomId = [];
for (let i = 0; i < length; i++) {
randomId[i] = uuidWithoutDashes[ 0 | Math.random() * radix];
}
return randomId.join('');
}
As AuxTaco pointed out, if you actually mean "alphanumeric" as in it matches "/^[A-Za-z0-9]{0,15}/" (giving the number of bits of 26 + 26 + 10 = 62), then it is really impossible. You can't fit 3 gallons of water in a gallon bucket without losing something. A UUID is 128-bits, so to convert that to a character space of 62, you'd need at least 22 characters (log[base 62](2^128) == ~22).
If you are more flexible on your charset and just need it 15 unicode characters you can put in a text document, then my answer will help.
Note: First part of this answer, I thought it said length of 16, not 15. The simpler answer won't work. The more complex version below still will.
In order to do so, you'd to use some kind of two-way compression algorithm (similar to an algorithm that is used for zipping files).
However, the problem with trying to compress something like a UUID is you'd probably have lots of collisions.
A UUID v4 is 32 characters long (without dashes). It's hexadecimal, so it's character space is 16 characters (0123456789ABCDEF)
That gives you a number of possible combinations of 16^32, approximately 3.4028237e+38 or 340,282,370,000,000,000,000,000,000,000,000,000,000. To make it recoverable after compression, you'd have to make sure you don't have any collisions (i.e., no 2 UUIDs turn into the same value). That's a lot of possible values (which is exactly why we use that many for UUID, the chance of 2 random UUIDs is only 1 out of that number big number).
To crunch that many possibilities to 16 characters, you'd have to have at least as many possible values. With 16 characters, you'd have to have 256 characters (root 16 of that big number, 256^16 == 16^32`). That's assuming you have an algorithm that'd never create a collision.
One way to ensure you never have collisions would be to convert it from a base-16 number to a base-256 number. That would give you a 1-to-1 relation, ensuring no collisions and making it perfectly reversible. Normally, switching bases is easy in JavaScript: parseInt(someStr, radix).toString(otherRadix) (e.g., parseInt('00FF', 16).toString(20). Unfortunately, JavaScript only does up to a radix of 36, so we'll have to do the conversion ourselves.
The catch with such a large base is representing it. You could arbitrarily pick 256 different characters, throw them in a string, and use that for a manual conversion. However, I don't think there are 256 different symbols on a standard US keyboard, even if you treat upper and lowercase as different glyphs.
A simpler solution would be to just use arbitrary character codes from 0 to 255 with String.fromCharCode().
Another small catch is if we tried to treat that all as one big number, we'd have issues because it's a really big number and JavaScript can't properly represent it exactly.
Instead of that, since we already have hexadecimal, we can just split it into pairs of decimals, convert those, then spit them out. 32 hexadecimal digits = 16 pairs, so that'll (coincidentally) be perfect. (If you had to solve this for an arbitrary size, you'd have to do some extra math and converting to split the number into pieces, convert, then reassemble.)
const uuid = '1234567890ABCDEF1234567890ABCDEF';
const letters = uuid.match(/.{2}/g).map(pair => String.fromCharCode(parseInt(pair, 16)));
const str = letters.join('');
console.log(str);
Note that there are some random characters in there, because not every char code maps to a "normal" symbol. If what you are sending to can't handle them, you'll instead need to go with the array approach: find 256 characters it can handle, make an array of them, and instead of String.fromCharCode(num), use charset[num].
To convert it back, you would just do the reverse: get the char code, convert to hex, add them together:
const uuid = '1234567890ABCDEF1234567890ABCDEF';
const compress = uuid =>
uuid.match(/.{2}/g).map(pair => String.fromCharCode(parseInt(pair, 16))).join('');
const expand = str =>
str.split('').map(letter => ('0' + letter.charCodeAt(0).toString(16)).substr(-2)).join('');
const str = compress(uuid);
const original = expand(str);
console.log(str, original, original.toUpperCase() === uuid.toUpperCase());
For fun, here is how you could do it for any arbitrary input base and output base.
This code is a bit messy because it is really expanded to make it more self-explanatory, but it basically does what I described above.
Since JavaScript doesn't have an infinite level of precision, if you end up converting a really big number, (one that looks like 2.00000000e+10), every number not shown after that e was essentially chopped off and replaced with a zero. To account for that, you'll have to break it up in some way.
In the code below, there is a "simple" way which doesn't account for this, so only works for smaller strings, and then a proper way which breaks it up. I chose a simple, yet somewhat inefficient, approach of just breaking up the string based on how many digits it gets turned into. This isn't the best way (since math doesn't really work like that), but it does the trick (at the cost of needed a smaller charset).
You could imploy a smarter splitting mechanism if you really needed to keep your charset size to a minimum.
const smallStr = '1234';
const str = '1234567890ABCDEF1234567890ABCDEF';
const hexCharset = '0123456789ABCDEF'; // could also be an array
const compressedLength = 16;
const maxDigits = 16; // this may be a bit browser specific. You can make it smaller to be safer.
const logBaseN = (num, n) => Math.log(num) / Math.log(n);
const nthRoot = (num, n) => Math.pow(num, 1/n);
const digitsInNumber = num => Math.log(num) * Math.LOG10E + 1 | 0;
const partitionString = (str, numPartitions) => {
const partsSize = Math.ceil(str.length / numPartitions);
let partitions = [];
for (let i = 0; i < numPartitions; i++) {
partitions.push(str.substr(i * partsSize, partsSize));
}
return partitions;
}
console.log('logBaseN test:', logBaseN(256, 16) === 2);
console.log('nthRoot test:', nthRoot(256, 2) === 16);
console.log('partitionString test:', partitionString('ABCDEFG', 3));
// charset.length should equal radix
const toDecimalFromCharset = (str, charset) =>
str.split('')
.reverse()
.map((char, index) => charset.indexOf(char) * Math.pow(charset.length, index))
.reduce((sum, num) => (sum + num), 0);
const fromDecimalToCharset = (dec, charset) => {
const radix = charset.length;
let str = '';
for (let i = Math.ceil(logBaseN(dec + 1, radix)) - 1; i >= 0; i--) {
const part = Math.floor(dec / Math.pow(radix, i));
dec -= part * Math.pow(radix, i);
str += charset[part];
}
return str;
};
console.log('toDecimalFromCharset test 1:', toDecimalFromCharset('01000101', '01') === 69);
console.log('toDecimalFromCharset test 2:', toDecimalFromCharset('FF', hexCharset) === 255);
console.log('fromDecimalToCharset test:', fromDecimalToCharset(255, hexCharset) === 'FF');
const arbitraryCharset = length => new Array(length).fill(1).map((a, i) => String.fromCharCode(i));
// the Math.pow() bit is the possible number of values in the original
const simpleDetermineRadix = (strLength, originalCharsetSize, compressedLength) => nthRoot(Math.pow(originalCharsetSize, strLength), compressedLength);
// the simple ones only work for values that in decimal are so big before lack of precision messes things up
// compressedCharset.length must be >= compressedLength
const simpleCompress = (str, originalCharset, compressedCharset, compressedLength) =>
fromDecimalToCharset(toDecimalFromCharset(str, originalCharset), compressedCharset);
const simpleExpand = (compressedStr, originalCharset, compressedCharset) =>
fromDecimalToCharset(toDecimalFromCharset(compressedStr, compressedCharset), originalCharset);
const simpleNeededRadix = simpleDetermineRadix(str.length, hexCharset.length, compressedLength);
const simpleCompressedCharset = arbitraryCharset(simpleNeededRadix);
const simpleCompressed = simpleCompress(str, hexCharset, simpleCompressedCharset, compressedLength);
const simpleExpanded = simpleExpand(simpleCompressed, hexCharset, simpleCompressedCharset);
// Notice, it gets a little confused because of a lack of precision in the really big number.
console.log('Original string:', str, toDecimalFromCharset(str, hexCharset));
console.log('Simple Compressed:', simpleCompressed, toDecimalFromCharset(simpleCompressed, simpleCompressedCharset));
console.log('Simple Expanded:', simpleExpanded, toDecimalFromCharset(simpleExpanded, hexCharset));
console.log('Simple test:', simpleExpanded === str);
// Notice it works fine for smaller strings and/or charsets
const smallCompressed = simpleCompress(smallStr, hexCharset, simpleCompressedCharset, compressedLength);
const smallExpanded = simpleExpand(smallCompressed, hexCharset, simpleCompressedCharset);
console.log('Small string:', smallStr, toDecimalFromCharset(smallStr, hexCharset));
console.log('Small simple compressed:', smallCompressed, toDecimalFromCharset(smallCompressed, simpleCompressedCharset));
console.log('Small expaned:', smallExpanded, toDecimalFromCharset(smallExpanded, hexCharset));
console.log('Small test:', smallExpanded === smallStr);
// these will break the decimal up into smaller numbers with a max length of maxDigits
// it's a bit browser specific where the lack of precision is, so a smaller maxDigits
// may make it safer
//
// note: charset may need to be a little bit bigger than what determineRadix decides, since we're
// breaking the string up
// also note: we're breaking the string into parts based on the number of digits in it as a decimal
// this will actually make each individual parts decimal length smaller, because of how numbers work,
// but that's okay. If you have a charset just barely big enough because of other constraints, you'll
// need to make this even more complicated to make sure it's perfect.
const partitionStringForCompress = (str, originalCharset) => {
const numDigits = digitsInNumber(toDecimalFromCharset(str, originalCharset));
const numParts = Math.ceil(numDigits / maxDigits);
return partitionString(str, numParts);
}
const partitionedPartSize = (str, originalCharset) => {
const parts = partitionStringForCompress(str, originalCharset);
return Math.floor((compressedLength - parts.length - 1) / parts.length) + 1;
}
const determineRadix = (str, originalCharset, compressedLength) => {
const parts = partitionStringForCompress(str, originalCharset);
return Math.ceil(nthRoot(Math.pow(originalCharset.length, parts[0].length), partitionedPartSize(str, originalCharset)));
}
const compress = (str, originalCharset, compressedCharset, compressedLength) => {
const parts = partitionStringForCompress(str, originalCharset);
const partSize = partitionedPartSize(str, originalCharset);
return parts.map(part => simpleCompress(part, originalCharset, compressedCharset, partSize)).join(compressedCharset[compressedCharset.length-1]);
}
const expand = (compressedStr, originalCharset, compressedCharset) =>
compressedStr.split(compressedCharset[compressedCharset.length-1])
.map(part => simpleExpand(part, originalCharset, compressedCharset))
.join('');
const neededRadix = determineRadix(str, hexCharset, compressedLength);
const compressedCharset = arbitraryCharset(neededRadix);
const compressed = compress(str, hexCharset, compressedCharset, compressedLength);
const expanded = expand(compressed, hexCharset, compressedCharset);
console.log('String:', str, toDecimalFromCharset(str, hexCharset));
console.log('Neded radix size:', neededRadix); // bigger than normal because of how we're breaking it up... this could be improved if needed
console.log('Compressed:', compressed);
console.log('Expanded:', expanded);
console.log('Final test:', expanded === str);
To use the above specifically to answer the question, you would use:
const hexCharset = '0123456789ABCDEF';
const compressedCharset = arbitraryCharset(determineRadix(uuid, hexCharset));
// UUID to 15 characters
const compressed = compress(uuid, hexCharset, compressedCharset, 15);
// 15 characters to UUID
const expanded = expanded(compressed, hexCharset, compressedCharset);
If there are problematic characters in the arbitrary, you'll have to do something to either filter those out, or hard-code a specific one. Just make sure all of the functions are deterministic (i.e., same result every time).

Look for substring in a string with at most one different character-javascript

I am new in programing and right now I am working on one program. Program need to find the substring in a string and return the index where the chain starts to be the same. I know that for that I can use "indexOf". Is not so easy. I want to find out substrings with at moste one different char.
I was thinking about regular expresion... but not really know how to use it because I need to use regular expresion for every element of the string. Here some code wich propably will clarify what I want to do:
var A= "abbab";
var B= "ba";
var tb=[];
console.log(A.indexOf(B));
for (var i=0;i<B.length; i++){
var D=B.replace(B[i],"[a-z]");
tb.push(A.indexOf(D));
}
console.log(tb);
I know that the substring B and string A are the lowercase letters. Will be nice to get any advice how to make it using regular expresions. Thx
Simple Input:
A B
1) abbab ba
2) hello world
3) banana nan
Expected Output:
1) 1 2
2) No Match!
3) 0 2
While probably theoretically possible, I think it would very complicated to try this kind of search while attempting to incorporate all possible search query options in one long complex regular expression. I think a better approach is to use JavaScript to dynamically create various simpler options and then search with each separately.
The following code sequentially replaces each character in the initial query string with a regular expression wild card (i.e. a period, '.') and then searches the target string with that. For example, if the initial query string is 'nan', it will search with '.an', 'n.n' and 'na.'. It will only add the position of the hit to the list of hits if that position has not already been hit on a previous search. i.e. It ensures that the list of hits contains only unique values, even if multiple query variations found a hit at the same location. (This could be implemented even better with ES6 sets, but I couldn't get the Stack Overflow code snippet tool to cooperate with me while trying to use a set, even with the Babel option checked.) Finally, it sorts the hits in ascending order.
Update: The search algorithm has been updated/corrected. Originally, some hits were missed because the exec search for any query variation would only iterate as per the JavaScript default, i.e. after finding a match, it would start the next search at the next character after the end of the previous match, e.g. it would find 'aa' in 'aaaa' at positions 0 and 2. Now it starts the next search at the next character after the start of the previous match, e.g. it now finds 'aa' in 'aaaa' at positions 0, 1 and 2.
const findAllowingOneMismatch = (target, query) => {
const numLetters = query.length;
const queryVariations = [];
for (let variationNum = 0; variationNum < numLetters; variationNum += 1) {
queryVariations.push(query.slice(0, variationNum) + "." + query.slice(variationNum + 1));
};
let hits = [];
queryVariations.forEach(queryVariation => {
const re = new RegExp(queryVariation, "g");
let myArray;
while ((searchResult = re.exec(target)) !== null) {
re.lastIndex = searchResult.index + 1;
const hit = searchResult.index;
// console.log('found a hit with ' + queryVariation + ' at position ' + hit);
if (hits.indexOf(hit) === -1) {
hits.push(searchResult.index);
}
}
});
hits = hits.sort((a,b)=>(a-b));
console.log('Found "' + query + '" in "' + target + '" at positions:', JSON.stringify(hits));
};
[
['abbab', 'ba'],
['hello', 'world'],
['banana', 'nan'],
['abcde abcxe abxxe xbcde', 'abcd'],
['--xx-xxx--x----x-x-xxx--x--x-x-xx-', '----']
].forEach(pair => {findAllowingOneMismatch(pair[0], pair[1])});

Searching for multiple partial phrases so that one original phrase can not match multiple search phrases

Given a predefined set of phrases, I'd like to perform a search based on user's query. For example, consider the following set of phrases:
index phrase
-----------------------------------------
0 Stack Overflow
1 Math Overflow
2 Super User
3 Webmasters
4 Electrical Engineering
5 Programming Jokes
6 Programming Puzzles
7 Geographic Information Systems
The expected behaviour is:
query result
------------------------------------------------------------------------
s Stack Overflow, Super User, Geographic Information Systems
web Webmasters
over Stack Overflow, Math Overflow
super u Super User
user s Super User
e e Electrical Engineering
p Programming Jokes, Programming Puzzles
p p Programming Puzzles
To implement this behaviour I used a trie. Every node in the trie has an array of indices (empty initially).
To insert a phrase to the trie, I first break it to words. For example, Programming Puzzles has index = 6. Therefore, I add 6 to all the following nodes:
p
pr
pro
prog
progr
progra
program
programm
programmi
programmin
programming
pu
puz
puzz
puzzl
puzzle
puzzles
The problem is, when I search for the query prog p, I first get a list of indices for prog which is [5, 6]. Then, I get a list of indices for p which is [5, 6] as well. Finally, I calculate the intersection between the two, and return the result [5, 6], which is obviously wrong (should be [6]).
How would you fix this?
Key Observation
We can use the fact that two words in a query can match the same word in a phrase only if one query word is a prefix of the other query word (or if they are same). So if we process the query words in descending lexicographic order (prefixes come after their "superwords"), then we can safely remove words from the phrases at the first match. Doing so we left no possibility to match the same phrase word twice. As I said, it is safe because prefixes match superset of phrase words what their "superwords" can match, and pair of query words, where one is not a prefix of the other, always match disjoint set of phrase words.
We don't have to remove words from phrases or the trie "physically", we can do it "virtually".
Implementation of the Algorithm
var PhraseSearch = function () {
var Trie = function () {
this.phraseWordCount = {};
this.children = {};
};
Trie.prototype.addPhraseWord = function (phrase, word) {
if (word !== '') {
var first = word.charAt(0);
if (!this.children.hasOwnProperty(first)) {
this.children[first] = new Trie();
}
var rest = word.substring(1);
this.children[first].addPhraseWord(phrase, rest);
}
if (!this.phraseWordCount.hasOwnProperty(phrase)) {
this.phraseWordCount[phrase] = 0;
}
this.phraseWordCount[phrase]++;
};
Trie.prototype.getPhraseWordCount = function (prefix) {
if (prefix !== '') {
var first = prefix.charAt(0);
if (this.children.hasOwnProperty(first)) {
var rest = prefix.substring(1);
return this.children[first].getPhraseWordCount(rest);
} else {
return {};
}
} else {
return this.phraseWordCount;
}
}
this.trie = new Trie();
}
PhraseSearch.prototype.addPhrase = function (phrase) {
var words = phrase.trim().toLowerCase().split(/\s+/);
words.forEach(function (word) {
this.trie.addPhraseWord(phrase, word);
}, this);
}
PhraseSearch.prototype.search = function (query) {
var answer = {};
var phraseWordCount = this.trie.getPhraseWordCount('');
for (var phrase in phraseWordCount) {
if (phraseWordCount.hasOwnProperty(phrase)) {
answer[phrase] = true;
}
}
var prefixes = query.trim().toLowerCase().split(/\s+/);
prefixes.sort();
prefixes.reverse();
var prevPrefix = '';
var superprefixCount = 0;
prefixes.forEach(function (prefix) {
if (prevPrefix.indexOf(prefix) !== 0) {
superprefixCount = 0;
}
phraseWordCount = this.trie.getPhraseWordCount(prefix);
function phraseMatchedWordCount(phrase) {
return phraseWordCount.hasOwnProperty(phrase) ? phraseWordCount[phrase] - superprefixCount : 0;
}
for (var phrase in answer) {
if (answer.hasOwnProperty(phrase) && phraseMatchedWordCount(phrase) < 1) {
delete answer[phrase];
}
}
prevPrefix = prefix;
superprefixCount++;
}, this);
return Object.keys(answer);
}
function test() {
var phraseSearch = new PhraseSearch();
var phrases = [
'Stack Overflow',
'Math Overflow',
'Super User',
'Webmasters',
'Electrical Engineering',
'Programming Jokes',
'Programming Puzzles',
'Geographic Information Systems'
];
phrases.forEach(phraseSearch.addPhrase, phraseSearch);
var queries = {
's': 'Stack Overflow, Super User, Geographic Information Systems',
'web': 'Webmasters',
'over': 'Stack Overflow, Math Overflow',
'super u': 'Super User',
'user s': 'Super User',
'e e': 'Electrical Engineering',
'p': 'Programming Jokes, Programming Puzzles',
'p p': 'Programming Puzzles'
};
for(var query in queries) {
if (queries.hasOwnProperty(query)) {
var expected = queries[query];
var actual = phraseSearch.search(query).join(', ');
console.log('query: ' + query);
console.log('expected: ' + expected);
console.log('actual: ' + actual);
}
}
}
One can test this code here: http://ideone.com/RJgj6p
Possible Optimizations
Storing the phrase word count in each trie node is not very memory
efficient. But by implementing compressed trie it is possible to
reduce the worst case memory complexity to O(n m), there n is the
number of different words in all the phrases, and m is the total
number of phrases.
For simplicity I initialize answer by adding all the phrases. But
a more time efficient approach is to initialize answer by adding
the phrases matched by the query word matching least number of
phrases. Then intersect with the phrases of the query word matching
second least number of phrases. And so on...
Relevant Differences from the Implementation Referenced in the Question
In trie node I store not only the phrase references (ids) matched by the subtrie, but also the number of matched words in these phrases. So, the result of the match is not only the matched phrase references, but also the number of matched words in these phrases.
I process query words in descending lexicographic order.
I subtract the number of superprefixes (query words of which the current query word is a prefix) from current match results (by using variable superprefixCount), and a phrase is considered matched by the current query word only when the resulting number of matched words in it is greater than zero. As in the original implementation, the final result is the intersection of the matched phrases.
As one can see, changes are minimal and asymptotic complexities (both time and memory) are not changed.
If the set of phrases is defined and does not contain long phrases, maybe you can create not 1 trie, but n tries, where n is the maximum number of words in one phrase.
In i-th trie store i-th word of the phrase. Let's call it the trie with label 'i'.
To process query with m words let's consider the following algorithm:
For each phrase we will store the lowest label of a trie, where the word from this phrase was found. Let's denote it as d[j], where j is the phrase index. At first for each phrase j, d[j] = -1.
Search the first word in each of n tries.
For each phrase j find the label of a trie that is greater than d[j] and where the word from this phrase was found. If there are several such labels, pick the smallest one. Let's denote such label as c[j].
If there is no such index, this phrase can not be matched. You can mark this case with d[j] = n + 1.
If there is such c[j] that c[j] > d[j], than assign d[j] = c[j].
Repeat for every word left.
Every phrase with -1 < d[j] < n is matched.
This is not very optimal. To improve performance you should store only usable values of d array. After first word, store only phrases, matched with this word. Also, instead of assignment d[j] = n + 1, delete index j. Process only already stored phrase indexes.
You can solve it as a Graph Matching Problem in a Bipartite Graph.
For each document, query pair define the graph:
G=(V,E) Where
V = {t1 | for each term t1 in the query} U { t2 | for each term t2 in the document}
E = { (t1,t2) | t1 is a match for t2 }
Intuitively: you have a vertex for each term in the query, a vertex for each term in the document, and an edge between a document term and a query term, only if the query term matches the document term. You have already solved this part with your trie.
You got yourself a bipartite graph, there are only edges between the "query vertices" and the "document vertices" (and not between two vertices of the same type).
Now, invoke a matching problem for bipartite graph, and get an optimal matching {(t1_1,t2_1), ... , (t1_k,t2_k)}.
Your algorithm should return a document d for a query q with m terms in the query, if (and only if) all m terms are satisfied, which means - you have maximal matching where k=m.
In your example, the graph for query="prog p", and document="Programming Jokes", you will get the bipartite graph with the matching: (or with Programming,p matched - doesn't matter which)
And, for the same query, and document="Programming Puzzles", you will get the bipartite graph with the matching:
As you can see, for the first example - there is no matching that covers all the terms, and you will "reject" the document. For the 2nd example - you were able to match all terms, and you will return it.
For performance issues, you can do the suggested algorithm only on a subset of the phrases, that were already filtered out by your initial approach (intersection of documents that have matching for all terms).
After some thought I came up with a similar idea to dened's - in addition to the index of a matched phrase, each prefix will refer to how many words it is a prefix of in that phrase - then that number can be reduced in the query process by the number of its superfixes among other query words, and the returned results include only those with at least the same number of matched words as the query.
We can implement an additional small tweak to avoid large cross-checks by adding (for the English language) a maximum of approximately 26 choose 2 + 26 choose 3 and even an additional 26 choose 4 special elements to the trie that refer to ordered first-letter intersections. When a phrase is inserted, the special elements in the trie referring to the 2 and 3 first-letter combinations will receive its index. Then match results from larger query words can be cross-checked against these. For example, if our query is "Geo i", the match results for "Geo" would be cross-checked against the special trie element, "g-i", which hopefully would have significantly less match results than "i".
Also, depending on the specific circumstances, large cross-checks could at times be more efficiently handled in parallel (for example, via a bitset &).

How to generate short uid like "aX4j9Z" (in JS)

For my web application (in JavaScript) I want to generate short guids (for different objects - that are actually different types - strings and arrays of strings)
I want something like "aX4j9Z" for my uids (guids).
So these uids should be lightweight enough for web transfer and js string processing and quite unique for not a huge structure (not more than 10k elements). By saying "quite unique" I mean that after the generation of the uid I could check whether this uid does already exist in the structure and regenerate it if it does.
See #Mohamed's answer for a pre-packaged solution (the shortid package). Prefer that instead of any other solutions on this page if you don't have special requirements.
A 6-character alphanumeric sequence is pretty enough to randomly index a 10k collection (366 = 2.2 billion and 363 = 46656).
function generateUID() {
// I generate the UID from two parts here
// to ensure the random number provide enough bits.
var firstPart = (Math.random() * 46656) | 0;
var secondPart = (Math.random() * 46656) | 0;
firstPart = ("000" + firstPart.toString(36)).slice(-3);
secondPart = ("000" + secondPart.toString(36)).slice(-3);
return firstPart + secondPart;
}
UIDs generated randomly will have collision after generating ~ √N numbers (birthday paradox), thus 6 digits are needed for safe generation without checking (the old version only generates 4 digits which would have a collision after 1300 IDs if you don't check).
If you do collision checking, the number of digits can be reduced 3 or 4, but note that the performance will reduce linearly when you generate more and more UIDs.
var _generatedUIDs = {};
function generateUIDWithCollisionChecking() {
while (true) {
var uid = ("0000" + ((Math.random() * Math.pow(36, 4)) | 0).toString(36)).slice(-4);
if (!_generatedUIDs.hasOwnProperty(uid)) {
_generatedUIDs[uid] = true;
return uid;
}
}
}
Consider using a sequential generator (e.g. user134_item1, user134_item2, …) if you require uniqueness and not unpredictability. You could "Hash" the sequentially generated string to recover unpredictability.
UIDs generated using Math.random is not secure (and you shouldn't trust the client anyway). Do not rely on its uniqueness or unpredictability in mission critical tasks.
Update 08/2020:
shortid has been deprecated in favor of nanoid which is smaller and faster:
Small. 108 bytes (minified and gzipped). No dependencies. Size Limit controls the size.
Fast. It is 40% faster than UUID.
Safe. It uses cryptographically strong random APIs. Can be used in clusters.
Compact. It uses a larger alphabet than UUID (A-Za-z0-9_-). So ID size was reduced from 36 to 21 symbols.
Portable. Nano ID was ported to 14 programming languages.
import { nanoid } from 'nanoid'
// 21 characters (default)
// ~149 billion years needed, in order to have a 1% probability of at least one collision.
console.log(nanoid()) //=> "V1StGXR8_Z5jdHi6B-myT"
// 11 characters
// ~139 years needed, in order to have a 1% probability of at least one collision.
console.log(nanoid(11)) //=> "bdkjNOkq9PO"
More info here : https://zelark.github.io/nano-id-cc/
Old answer
There is also an awesome npm package for this : shortid
Amazingly short non-sequential url-friendly unique id generator.
ShortId creates amazingly short non-sequential url-friendly unique ids. Perfect for url shorteners, MongoDB and Redis ids, and any other id users might see.
By default 7-14 url-friendly characters: A-Z, a-z, 0-9, _-
Non-sequential so they are not predictable.
Supports cluster (automatically), custom seeds, custom alphabet.
Can generate any number of ids without duplicates, even millions per day.
Perfect for games, especially if you are concerned about cheating so you don't want an easily guessable id.
Apps can be restarted any number of times without any chance of repeating an id.
Popular replacement for Mongo ID/Mongoose ID.
Works in Node, io.js, and web browsers.
Includes Mocha tests.
Usage
var shortid = require('shortid');
console.log(shortid.generate()); //PPBqWA9
Here is a one liner, but it gives only lowercase letters and numbers:
var uuid = Math.random().toString(36).slice(-6);
console.log(uuid);
Get a simple counter to start from 100000000, convert the number into radix 36.
(100000000).toString(36); //1njchs
(2100000000).toString(36); //yqaadc
You can comfortably have 2 billion elegant unique ids, just like YouTube
The following generates 62^3 (238,328) unique values of 3 characters provided case sensitivity is unique and digits are allowed in all positions. If case insensitivity is required, remove either upper or lower case characters from chars string and it will generate 35^3 (42,875) unique values.
Can be easily adapted so that first char is always a letter, or all letters.
No dobut it can be optimised, and could also refuse to return an id when the limit is reached.
var nextId = (function() {
var nextIndex = [0,0,0];
var chars = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'.split('');
var num = chars.length;
return function() {
var a = nextIndex[0];
var b = nextIndex[1];
var c = nextIndex[2];
var id = chars[a] + chars[b] + chars[c];
a = ++a % num;
if (!a) {
b = ++b % num;
if (!b) {
c = ++c % num;
}
}
nextIndex = [a, b, c];
return id;
}
}());
var letters = 'abcdefghijklmnopqrstuvwxyz';
var numbers = '1234567890';
var charset = letters + letters.toUpperCase() + numbers;
function randomElement(array) {
with (Math)
return array[floor(random()*array.length)];
}
function randomString(length) {
var R = '';
for(var i=0; i<length; i++)
R += randomElement(charset);
return R;
}
This is an old question and there are some good answers, however I notice that we are in 2022 and we can use ES6 and if you don't like to depend on 3rd party libs. Here is a solution for you.
I implemented a very simple generator using the build-in functions that JavaScript offers to us these days. We will use Crypto.getRandomValues() and Uint8Array() so check the code below
const hashID = size => {
const MASK = 0x3d
const LETTERS = 'abcdefghijklmnopqrstuvwxyz'
const NUMBERS = '1234567890'
const charset = `${NUMBERS}${LETTERS}${LETTERS.toUpperCase()}`.split('')
const bytes = new Uint8Array(size)
crypto.getRandomValues(bytes)
return bytes.reduce((acc, byte) => `${acc}${charset[byte & MASK]}`, '')
}
console.log({id: hashID(6)})
This implementation uses these characters: [A-Z], [a-z], [0-9] in total they are 62 characters, if we add _ and - it will complete up to 64 characters like this:
const hashID = size => {
const MASK = 0x3d
const LETTERS = 'abcdefghijklmnopqrstuvwxyz'
const NUMBERS = '1234567890'
const charset = `${NUMBERS}${LETTERS}${LETTERS.toUpperCase()}_-`.split('')
const bytes = new Uint8Array(size)
crypto.getRandomValues(bytes)
return bytes.reduce((acc, byte) => `${acc}${charset[byte & MASK]}`, '')
}
console.log(`id: ${hashID(6)}`)
Note:
It will take around 2 days in order to have a 1% probability of at least one collision for 1000 IDs generated per hour with ID length of 6 characters. Keep this in mind when it is implemented into your project.
This will generate a sequence of unique values. It improves on RobG's answer by growing the string length when all values have been exhaused.
var IdGenerator = (function () {
var defaultCharset = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890!##$%^&*()_-+=[]{};:?/.>,<|".split("");
var IdGenerator = function IdGenerator(charset) {
this._charset = (typeof charset === "undefined") ? defaultCharset : charset;
this.reset();
};
IdGenerator.prototype._str = function () {
var str = "",
perm = this._perm,
chars = this._charset,
len = perm.length,
i;
for (i = 0; i < len; i++) {
str += chars[perm[i]];
}
return str;
};
IdGenerator.prototype._inc = function () {
var perm = this._perm,
max = this._charset.length - 1,
i;
for (i = 0; true; i++) {
if (i > perm.length - 1) {
perm.push(0);
return;
} else {
perm[i]++;
if (perm[i] > max) {
perm[i] = 0;
} else {
return;
}
}
}
};
IdGenerator.prototype.reset = function () {
this._perm = [];
};
IdGenerator.prototype.current = function () {
return this._str();
};
IdGenerator.prototype.next = function () {
this._inc();
return this._str();
};
return IdGenerator;
}).call(null);
Usage:
var g = new IdGenerator(),
i;
for (i = 0; i < 100; i++) {
console.log(g.next());
}
This gist contains the above implementation and a recursive version.
just randomly generate some strings:
function getUID(len){
var chars = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789',
out = '';
for(var i=0, clen=chars.length; i<len; i++){
out += chars.substr(0|Math.random() * clen, 1);
}
// ensure that the uid is unique for this page
return getUID.uids[out] ? getUID(len) : (getUID.uids[out] = out);
}
getUID.uids = {};
You can shorten a GUID to 20 printable ASCII characters without losing information or the uniqueness of the GUID.
Jeff Atwood blogged about that years ago:
Equipping our ASCII Armor
This solution combines Math.random() with a counter.
Math.random() should give about 53 bits of entropy (compared with UUIDv4's 128), but when combined with a counter should give plenty enough uniqueness for a temporary ID.
let _id_counter = 0
function id() {
return '_' + (_id_counter++).toString(36) + '_' + Math.floor(Math.random() * Number.MAX_SAFE_INTEGER).toString(36)
}
console.log(Array.from({length: 100}).map(() => id()))
Features:
Simple implementation
Output of about 13 chars
Case-insensitive
Safe for use as HTML id and React key
Not suitable for database storage
You can use the md5 algorithm for generating a random string. md5 is the node package
var randomChars = Math.random().toString(36).replace(/[^a-z]+/g, '').substr(0, 2);
var shortUrl = md5(originalUrl + randomChars + new Date()).substring(0, 5).toString();
console.log(shortUrl);
This will generate unique string every time.

Categories

Resources