Find same position in two strings that are slightly different

Find same position in two strings that are slightly different - javascript

I have two strings where the first is a master string and the second is a slave string. They both contain similar values except that the slave will have characters added or removed.
I need to find the character offset from the master string in the slave string for each character of the master string.
I'm currently using a percentage as the algorithm for finding the similar offset in the slave string.
For example;
const master = 'The chicken is blue, but not really a chicken';
const slave = 'This large bird is blue, but is really a dog';
function slaveOffset(m, offset, s): number {
return Math.floor(s.length * (offset / m.length));
}
console.log(slaveOffset(master, 15, slave)); // prints 12
When translating the position 15 from the master (which reads "The chicken is ") the slave position is calculated as 12. Which reads as "This large b" because using percentage is not at all accurate (doesn't take into account added or removed characters).
The correct value should have been 18 (which reads as "The large bird is"), because the master offset ends at "is".
I need an algorithm for slaveOffset() that can handle added and removed characters and find the most likely slave offset. It does not need to be overly accurate but should solve the problem of large deviations caused by character changes.

This is a classic problem in computer science, typically called "data comparison" or simply "diff". The most common algorithms apply Longest Common Subsequence techniques, but in the general case this is a NP-hard problem so various heuristics are applied to get a "good enough" outcome, often tuned by a human in the loop.
Look up some diff algorithms to get some ideas.
In your case you probably want to start with the heuristic of "where does the slave string start to differ from the master and where does it become the same again". The strings match for the first two characters, but the next time you get a sequence of more than 3 characters matching is at the characters , i and s. The points become markers that you can use in your slaveOffset function.

Related

Can one encrypt an n-digit number, returning a unique n-digit number?

StackOverflow warns me that I may be down-voted for this question, but I'd appreciate your not doing so, as I post this simply to try to understand a programming exercise I've been posed with, and over which I've been puzzling a while now.
I'm doing some javascript coding exercises and one of the assignments was to devise an "encryption function", encipher, which would encrypt a 4-digit number by multiplying it by a number sufficiently low such that none of its digits exceeds 9, so that a 4-digit number is returned. Thus
encipher(0204)
might yield
0408
where the multiplier would have been 2. -- This is very basic material, simply to practice the Javascript. -- But as far as I can see, the numbers returned can never be deciphered (which is the next part of the exercise). Even if you store a dictionary internal to encipher, along the lines of
{'0408':'2'}, etc
so that you could do a lookup on 0408 and return 0204, these entries could not be assured to be unique. If one for example were to get the number 9999 to be deciphered, one would never know whether the original number was 9999 (multiplied by 1), 3333 (multiplied by 3) or 1111 (multiplied by 9). Is that correct? I realise this is a fairly silly and artificial problem, but I'm trying to understand if the instructions to the exercise are not quite right, or if I'm missing something. Here is the original problem:
Now, let's add one more level of security. After changing the position of the digits, we will multiply each member by a number whose multiplication does not exceed 10. (If it is higher than 10, we will get a two-digit multiplication and the code will no longer be 4 values). Now, implement in another function the decrypter (), which will receive as an argument an encrypted code (and correspondingly multiplied in the section above and return the decrypted code.
Leaving the exercise behind, I'm just curious whether there exists any way to "encrypt" (when I say "encrypt", I mean at a moderate javascript level, as I'm not a cryptography expert) an n-digit number and return a unique n-digit number?
Thanks for any insights. --

encrypt a 4-digit number by multiplying it by a number sufficiently low such that none of its digits exceeds 9, so that a 4-digit number is returned
If your input is 9999, there is no integer other than 1 or 0 that you can multiply your input by and get a positive number with a maximum of 4 digits. Therefore, there is no solution that involves only integer multiplication. However, integer multiplication can be used as part of an algorithm such as rotating digits (see below).
If instead you're looking for some sort of bijective algorithm (one that uniquely maps A to B and B to A), you can look at something like rotating the digits left or right, reversing the order of the digits, or using a unique mapping of each individual digit to another. Those can also be mixed.
Examples
Rotate
1234 -> 2341
Reverse
1234 -> 4321
Remap digits e.g. 2 mapped to 8, 3 mapped to 1
2323 -> 8181
Note that none of these are cryptographically sound methods to encrypt information, but they do seem to more-or-less meet the objectives of the exercise.

Given a dictionary and a list of letters, make a program learn to generate valid words | Javascript

I'm working on a big machine learning/nlp project and I'm stuck at a small part of it. (PM me, if you want to know what I'm working on exactly.)
I try to code a program in Javascript that learns to generate valid words, only by using all letters of the alphabet.
What I have is a database of 500K different words. It's a big JS object, structured like this (the words are german):
database = {
"um": {id: 1, word: "um", freq: 10938},
"oder": {id: 2, word: "oder", freq: 10257},
"Er": {id: 3, word: "Er", freq: 9323},
...
}
"freq" means frequency obviously. (Maybe this value sometimes gets important but I currently don't use it, so just ignore it.)
The way my program currently works is:
In the first iteration, it generates a completely random word between 2 and 13 letters long and searches for it in the database. If it's there, every letter in the word gets a good rating, if it's not there, they get a bad rating. Also the word length gets rated. If the word is valid, its word length gets a good rating, if it's not, its word length gets a bad rating.
In the iterations after that first one, it doesn't generate a word with random letters and a random word length. It uses probabilities based on the ratings of the letters and the word length.
For example, let's say it found the words "the", "so" and "if" after the first 100 iterations. So the letters "t", "h", "e" and the letters "s", "o", and the letters "i", "f" are good rated, and the word length of 2 and 3 is also good rated. So the word generated in the next iteration will more likely contain these good rated letters than bad rated letters.
Of course, the program also checks if the currently generated word already was generated and if so, then this word doesn't get rated again and it generates a new one.
In theory it should learn the optimal letter frequency and the optimal word-length-frequency by its own and sometimes only generate valid words.
Yeah. Of course this doesn't work. It gets better for the first few iterations, but as soon as it has found all the 2-lettered words it gets worse. I think my whole way how I do this is wrong. I've actually tried it out and have a (not so beautiful) graph after 5000 iterations for you:
Red line: wrong words generated
Green line: right words generated
Yeah. What is the problem here? Am I doing machine learning wrong? And do you have a solution? Some algorithm or trie system?
PS: I'm aware of this, but it's not in JS, I don't understand it and I can't comment on it.

An alternative method would be to use a Markov Model.
Start by counting up the letter frequencies and also word length frequencies in your dictionary. Then, to create a word:
Pick a weighted random number (see below) between 1 and the maximum existing word length. That's how many letters you're going to generate.
For each letter in the word, pick a weighted random letter and add it to the word.
That's an order-0 Markov model. It's based on the frequency of letters that occur in the corpus. It will probably give you results that are similar to the system you have.
You'll get better results from an order-1 Markov model, where instead of computing letter frequencies, you compute bigram (two-letter permutations) frequencies. So to pick the first letter, you choose only from the bigrams that are used to begin words. For subsequent letters, you choose a letter that follows the previously generated letter. That's going to give you somewhat better results than an order-0 model.
An order-2 model is surprisingly effective. See my blog post, Shakespeare vs. Markov, for an example.
A weighted random number is a number selected "at random," but skewed to reflect some distribution. In the English language, for example, the letter 'e' occurs approximately 12.7% of the time. 't' occurs 9.06% of the time, etc. See https://en.wikipedia.org/wiki/Letter_frequency. So you'd want your weighted random number generator's output to approximate that distribution. Or, in your case, you'd want it to approximate the distribution in your corpus. See Weighted random numbers for an example of how that's done.

Search string for numbers

I have a javascript chat bot where a person can type into an input box any question they like and hope to get an accurate answer. I can do this but I know I'm going about this all wrong because I don't know what position the number will appear in the sentence. If a person types in exactly:
what's the square root of 5 this works fine.
If he types in things like this it doesn't.
what is the square root of the 5
the square root of 5 is what
do you know what the square root of 5 is
etc
I need to be able to determine where the number appears in the sentence then do the calculation from there. Note the line below is part of a bigger working chatbot. In the line below I'm just trying to be able to answer any square root question regardless of where the number appears in the sentence. I also know there are many pitfalls with an open ended input box where a person can type anything such as spelling errors etc. This is just for entertainment not a serious scientific project. :)
if(
(word[0]=="what's") &&
(word[1]=="the") &&
(word[2]=="square") &&
(word[3]=="root") &&
(word [4]=="of") &&
(input.search(/\d{1,10}/)!=-1) &&
(num_of_words==6)
){
var root= word[5];
if(root<0){
document.result.result.value = "The square root of a negative number is not possible.";
}else{
word[5] = Math.sqrt(root);
word[5] = Math.round(word[5]*100)/100
document.result.result.value = "The square root of "+ root +" is "+ word[5] +".";
}
return true;
}
Just to be clear the bot is written using "If statemments" for a reason. If the input in this case doesn't include the words "what" and "square root" and "some number" the line doesn't trigger and is answered further down by the bot with a generic "I don't know type of response". So I'm hoping any answer will fit the format I am using. Be kind, I'm new here. I like making bots but I'm not much of a programmer. Thanks.

You can do this using a regular expression.
"What is the sqrt of 5?".match(/\d+/g);
["5"]
The output is an array containing all of the numbers found in the string. If you have more than one, like "Is the square root of 100 10?" then it will be
"Is the square root of 100 10?".match(/\d+/g);
["100", "10"]
so you can pick what number you want to use in your script.

You will need to use regular expressions, if you do not know what they are you should look them up as it would take too long to explain them in this response. Here is a useful website for regular expressions in JavaScript https://developer.mozilla.org/en/JavaScript/Guide/Regular_Expressions.
Assuming that you know regular expressions you should first search for numbers and store all the numbers that you find, if no numbers are found print out an error message. As an added note, you may what to consider searching for mathematical constants such as pi or e. This should work
nums = someString./\d+|pi|e/gi
The next part is going to be hard but to boil it down to is core concept, you need to look for key words such as 'square root', 'times', or 'plus'. You should do this word by word going left to right. For example if a user inputs
What is 5 plus 3 minus 8?
you should detect the plus before the minus, while if this is inputted
What is 5 minus 3 plus 8?
You should detect the minus before the plus.
For operations that uses two numbers you need to take the first two numbers that you found and do the operation and replace the two numbers with the result. I am trying to use reverse polish notation if you do not quite understand what I am trying to do, look it up if do not know what it is.
I hope I understood your question correctly and provided some help to coming to a solution because what you asked is very hard but seems like fun. Good luck. Also as a warning I am not considering order of operations in my response.

Can i generate unique text 1 lacs time in javascript through any method who does not need any framework?

i have a word from A to Z. all word should in small latter (Capital not include) and 1 to 9 (included all special word who can be used in email address (just for a test)).
how i can generate unique 1 lacs text who never repeat itself. can anyone solve this puzzle.
i want a another thing that all words should not more then 10 char and not should minimum 6 char long

Put the characters in an array. Copy the array as the source of a new line. Randomly slice words from the array and put them in the line (use Math.random() * array.length | 0). Keep going for the required number of words.
You can also just use a string and charAt(index) if you only want single characters, but you have to keep cutting out the character that you select which is likely less efficient than using array.slice.
Whatever suits though, since performance is likely irrelevant.

Looking for a better javascript text-matching scoring system

I've been using String Score for a lot of projects. It's great for sorting lists, like names, countries, etc.
Right now, I'm working on a project where I want to match a term against a bigger set of text, not just a few words. Like, a paragraph.
Given the following two strings:
string1 = "I want to eat.";
string2 = "I want to eat. Let's go eat. All this talk about eating is making me hungry. Ready to eat?";
I'd like the term eat to return string2 as higher than string1. However, string1 scores higher:
string1.score('eat');
> 0.5261904761904762
string2.score('eat');
> 0.4477777777777778
Maybe I'm wrong in thinking string2 should score higher, and I'd love to hear arguments for that logic, if that is your logic. Otherwise, any ideas on a more contextual javascript matching algorithm?

If the score is not taking into account repetitions then only one occurrence of "eat" in string2 adds to the score so the other occurrences of "eat" are treated as unmatched garbage which counts against in the total score.
Many string similarity metrics behave this way, e.g. in Edit distance the more non-matching characters the lower the score and repetitions are treated as non-matching.
It's not clear to me from reading the source what algo it is using, but the score variables
var total_character_score = 0,
start_of_string_bonus,
abbreviation_score,
fuzzies=1,
final_score;
don't seem to take into account multiple repetitions.
If you want multiple occurrences to count, then it sounds like what you want is not a string-similarity algo, but a fuzzy match algo so you can find the number of matches.
Maybe yeti witch will work for you.

Develop Reference

JavaScript is the programming language of the Web.