Finding duplicates within strings - javascript

I have been handed a project at work where I need to find duplicate pairings from multiple rows within a dataset. While the data set is much larger, the main portion revolves around the date of a training, the location of a training, and the names of the trainers. So every row of data has a date, a location, and then a comma separated list of names:
Date Location Names
1/13/2014 Seattle A, B, D
1/16/2014 Dallas C, D, E
1/20/2014 New York A, D
1/23/2014 Dallas C, E
1/27/2014 Seattle B, D
1/30/2014 Houston C, A, F
2/3/2014 Washington DC D, A, F
2/6/2014 Phoenix B, E
2/10/2014 Seattle C, B
2/13/2014 Miami A, B, E
2/17/2014 Miami C, D
2/20/2014 New York B, E, F
2/24/2014 Houston A, B, F
My goal is to be able to find rows with similar pairings of names. One example would be to know that A & B were in paired in Seattle on 1/13, Miami on 2/13, and Houston on 2/24, even though the third name is different in each occurrence. So instead of just simply finding duplicates among the entire string of names, I would also like to find pairings among partial segments of the “Names” column.
Is this possible to do within Excel or would I need to use a programming language to accomplish the task?
While I can manually do this, it represents a lot of time that could be used towards other things. If there was a way that I could automate this it would make this portion of my task a lot simpler.
Thank you in advance for any assistance or advice on a way forward.

You can do it with VBA. The solution below assumes
Your data is on the active sheet in columns A:C
You results will be output in columns E:G
The output will be a list sorted by pairs, and then by dates, so you can easily see where pairs repeated.
The routine assumes no more than three trainers at a time, but could be modified add more possible combinations.
Cities with just a single trainer will be ignored.
The routine uses a Class module to gather the information, and two Collections to process the data. It also makes use of the feature that collections will not allow addition of two items with the same key.
Class Module
Rename the Class Module: cPairs
Option Explicit
Private pTrainer1 As String
Private pTrainer2 As String
Private pCity As String
Private pDT As Date
Public Property Get Trainer1() As String
Trainer1 = pTrainer1
End Property
Public Property Let Trainer1(Value As String)
pTrainer1 = Value
End Property
Public Property Get Trainer2() As String
Trainer2 = pTrainer2
End Property
Public Property Let Trainer2(Value As String)
pTrainer2 = Value
End Property
Public Property Get City() As String
City = pCity
End Property
Public Property Let City(Value As String)
pCity = Value
End Property
Public Property Get DT() As Date
DT = pDT
End Property
Public Property Let DT(Value As Date)
pDT = Value
End Property
Regular Module
Option Explicit
Option Compare Text
Public cP As cPairs, colP As Collection
Public colCityPairs As Collection
Public vSrc As Variant
Public vRes() As Variant
Public rRes As Range
Public I As Long, J As Long
Public V As Variant
Public sKey As String
Sub FindPairs()
vSrc = Range("A1", Cells(Rows.Count, "C").End(xlUp))
Set colP = New Collection
Set colCityPairs = New Collection
'Collect Pairs
For I = 2 To UBound(vSrc)
V = Split(Replace(vSrc(I, 3), " ", ""), ",")
If UBound(V) >= 1 Then
'sort the pairs
SingleBubbleSort V
Select Case UBound(V)
Case 1
AddPairs V(0), V(1)
Case 2
AddPairs V(0), V(1)
AddPairs V(0), V(2)
AddPairs V(1), V(2)
End Select
End If
Next I
ReDim vRes(0 To colCityPairs.Count, 1 To 3)
vRes(0, 1) = "Date"
vRes(0, 2) = "Location"
vRes(0, 3) = "Pairs"
For I = 1 To colCityPairs.Count
With colCityPairs(I)
vRes(I, 1) = .DT
vRes(I, 2) = .City
vRes(I, 3) = .Trainer1 & ", " & .Trainer2
End With
Next I
Set rRes = Range("E1").Resize(UBound(vRes, 1) + 1, UBound(vRes, 2))
With rRes
.EntireColumn.Clear
.Value = vRes
With .Rows(1)
.HorizontalAlignment = xlCenter
.Font.Bold = True
End With
.Sort key1:=.Columns(3), order1:=xlAscending, key2:=.Columns(1), order2:=xlAscending, _
Header:=xlYes
.EntireColumn.AutoFit
V = VBA.Array(vbYellow, vbGreen)
J = 0
For I = 2 To rRes.Rows.Count
If rRes(I, 3) = rRes(I - 1, 3) Then
.Rows(I).Interior.Color = .Rows(I - 1).Interior.Color
Else
J = J + 1
.Rows(I).Interior.Color = V(J Mod 2)
End If
Next I
End With
End Sub
Sub AddPairs(T1, T2)
Set cP = New cPairs
With cP
.Trainer1 = T1
.Trainer2 = T2
.City = vSrc(I, 2)
.DT = vSrc(I, 1)
sKey = .Trainer1 & "|" & .Trainer2
On Error Resume Next
colP.Add cP, sKey
If Err.Number = 457 Then
Err.Clear
colCityPairs.Add colP(sKey), sKey & "|" & colP(sKey).DT & "|" & colP(sKey).City
colCityPairs.Add cP, sKey & "|" & .DT & "|" & .City
Else
If Err.Number <> 0 Then Stop
End If
On Error GoTo 0
End With
End Sub
Sub SingleBubbleSort(TempArray As Variant)
'copied directly from support.microsoft.com
Dim Temp As Variant
Dim I As Integer
Dim NoExchanges As Integer
' Loop until no more "exchanges" are made.
Do
NoExchanges = True
' Loop through each element in the array.
For I = LBound(TempArray) To UBound(TempArray) - 1
' If the element is greater than the element
' following it, exchange the two elements.
If TempArray(I) > TempArray(I + 1) Then
NoExchanges = False
Temp = TempArray(I)
TempArray(I) = TempArray(I + 1)
TempArray(I + 1) = Temp
End If
Next I
Loop While Not (NoExchanges)
End Sub

Ok. I got bored and did this whole thing in Python code. I assume you are familiar with the language; however, you should be able to get the following piece of code to work on any computer with Python installed.
I have made a few assumptions. For instance, I have used your example input as definite input.
A few things which will mess up the program:
Not entering with case sensitivity. Beware of capital letters etc.
Having a inputfile which has the following row: "Date Location Names". Just remove and keep straight facts in the file. I got lazy and do not bother adjusting this.
A ton of other small stuff. Just do what the program asks you to do and dont enter funky input.
About program:
Revolves around using a dictionary with person names as keys. The values in the dictionary is a set with tuples containing the places they've been during what date. By then comparing these sets and getting the intersection, we can find the answer.
Kinda messy since I took this as Python practice. Have not coded in Python for a while and I got a thrill out of doing it all without utilizing objects. Just follow the "instructions" and keep the inputfile, which stores all information, in the same folder as the piece of code are running.
As a side note, you might want to check that the program yields correct output.
If you have any questions, feel free to contact me.
def readWord(line, stringIndex):
word = ""
while(line[stringIndex] != " "):
word += line[stringIndex]
stringIndex += 1
return word, stringIndex
def removeSpacing(line, stringIndex):
while(line[stringIndex] == " "):
stringIndex += 1
return stringIndex
def readPeople(line, stringIndex):
lineSize = len(line)
people = []
while(stringIndex < lineSize):
people.append(line[stringIndex])
stringIndex += 3
return people
def readLine(travels, line):
stringIndex = 0
date, stringIndex = readWord(line, stringIndex)
stringIndex = removeSpacing(line, stringIndex)
location, stringIndex = readWord(line, stringIndex)
stringIndex = removeSpacing(line, stringIndex)
people = readPeople(line, stringIndex)
for person in people:
if(person not in travels.keys()):
travels[person] = set()
travels[person].add((date, location))
return travels
def main():
f = open(input("Enter filename (must be in same folder as this program code. For instance, name could be: testDocument.txt\n\n"))
travels = dict()
for line in f:
travels = readLine(travels, line)
print("\n\n\n\n PROGRAM RUNNING \n \n")
while(True):
persons = []
userInput = "empty"
while(userInput):
userInput = input("Enter person name (Type Enter to finish typing names): ")
if(userInput):
persons.append(userInput)
output = travels[persons[0]]
for person in persons[1:]:
output = output.intersection(travels[person])
print("")
for hit in output:
print(hit)
print("\nFINISHED WITH ONE RUN. STARTING NEW ONE\n")

Related

Emojis to/from codepoints in Javascript

In a hybrid Android/Cordova game that I am creating I let users provide an identifier in the form of an Emoji + an alphanumeric - i.e. 0..9,A..Z,a..z - name. For example
🙋‍️Stackoverflow
Server-side the user identifiers are stored with the Emoji and Name parts separated with only the Name part requiried to be unique. From time-to-time the game displays a "league table" so the user can see how well they are performing compared to other players. For this purpose the server sends back a sequence of ten "high score" values consisting of Emoji, Name and Score.
This is then presented to the user in a table with three columns - one each for Emoji, Name and Score. And this is where I have hit a slight problem. Initially I had quite naively assumed that I could figure out the Emoji by simply looking at handle.codePointAt(0). When it dawned on me that an Emoji could in fact be a sequence of one or more 16 bit Unicode values I changed my code as follows
Part 1:Dissecting the user supplied "handle"
var i,username,
codepoints = [],
handle = "🙋‍️StackOverflow",
len = handle,length;
while ((i < len) && (255 < handle.codePointAt(i)))
{codepoints.push(handle.codePointAt(i));i += 2;}
username = handle.substring(codepoints.length + 1);
At this point I have the "disssected" handle with
codepoints =  [128587, 8205, 65039];
username = 'Stackoverflow;
A note of explanation for the i += 2 and the use of handle.length above. This article suggests that
handle.codePointAt(n) will return the code point for the full surrogate pair if you hit the leading surrogate. In my case since the Emoji has to be first character the leading surrogates for the sequence of 16 bit Unicodes for the emoji are at 0,2,4....
From the same article I learnt that String.length in Javascript will return the number of 16 bit code units.
Part II - Re generating the Emojis for the "league table"
Suppose the league table data squirted back to the app by my servers has the entry {emoji: [128583, 8205, 65039],username:"Stackexchange",points:100} for the emoji character 🙇‍️. Now here is the bothersome thing. If I do
var origCP = [],
i = 0,
origEmoji = '🙇‍️',
origLen = origEmoji.length;
while ((i < origLen) && (255 < origEmoji.codePointAt(i))
{origCP.push(origEmoji.codePointAt(i);i += 2;}
I get
origLen = 5, origCP = [128583, 8205, 65039]
However, if I regenerate the emoji from the provided data
var reEmoji = String.fromCodePoint.apply(String,[128583, 8205, 65039]),
reEmojiLen = reEmoji.length;
I get
reEmoji = '🙇‍️'
reEmojiLen = 4;
So while reEmoji has the correct emoji its reported length has mysteriously shrunk down to 4 code units in place of the original 5.
If I then extract code points from the regenerated emoji
var reCP = [],
i = 0;
while ((i < reEmojiLen) && (255 < reEmoji.codePointAt(i))
{reCP.push(reEmoji.codePointAt(i);i += 2;}
which gives me
reCP = [128583, 8205];
Even curioser, origEmoji.codePointAt(3) gives the trailing surrogate pair value of 9794 while reEmoji.codePointAt(3) gives the value of the next full surrogate pair 65039.
I could at this point just say
Do I really care?
After all, I just want to show the league table emojis in a separate column so as long as I am getting the right emoji the niceties of what is happening under the hood do not matter. However, this might well be stocking up problems for the future.
Can anyone here shed any light on what is happening?
emojis are more complicated than just single chars, they come in "sequences", e.g. a zwj-sequence (combine multiple emojis into one image) or a presentation sequence (provide different variations of the same symbol) and some more, see tr51 for all the nasty details.
If you "dump" your string like this
str = "🙋‍️StackOverflow"
console.log(...[...str].map(x => x.codePointAt(0).toString(16)))
you'll see that it's actually an (incorrectly formed) zwj-sequence wrapped in a presentation sequence.
So, to slice emojis accurately, you need to iterate the string as an array of codepoints (not units!) and extract plane 1 CPs (>0xffff) + ZWJ's + variation selectors. Example:
function sliceEmoji(str) {
let res = ['', ''];
for (let c of str) {
let n = c.codePointAt(0);
let isEmoji = n > 0xfff || n === 0x200d || (0xfe00 <= n && n <= 0xfeff);
res[1 - isEmoji] += c;
}
return res;
}
function hex(str) {
return [...str].map(x => x.codePointAt(0).toString(16))
}
myStr = "🙋‍️StackOverflow"
console.log(sliceEmoji(myStr))
console.log(sliceEmoji(myStr).map(hex))

String.split function is not working properly with some text

I have string parser in node.js. Input string comes from telegram channel.
Now I have serious problem with String.split function.
It works with some types of text but it doesn't work with some other texts.
When I receive not processed string in telegram, I just copy and send it in the channel again.
In this case, parser processes it well.
Is there any advise for this issue?
let teams = [];
teamSeps =[" vs ", " v ", " - ", " x " ,"-", " -"];
for(let i = 0; i< teamSeps.length; i++){
teams = newTip.Match.toLowerCase().split(teamSeps[i]);
if(teams.length === 2) break;
}
newTip.Home = teams[0].trim();
newTip.Away = teams[1].trim();
return;
Instead of adding multiple options with optional spaces on either side of -, you can use a single regex with some alternation.
/\s*-\s*|\s+(?:vs|v|x)\s+/
\s*-\s*: Allows optional space around -
\s+(?:vs|v|x)\s+: Allows at least one space around vs or v or x (Otherwise, if there is a x or v in the string, it will split)
function customSplit(str) {
return str.split(/\s*-\s*|\s+(?:vs|v|x)\s+/);
}
console.log(customSplit("Man United vs Man City"))
console.log(customSplit("France - Croatia"))
console.log(customSplit("Belgium-England"))
console.log(customSplit("Liverpool x Spurs"))

Look for substring in a string with at most one different character-javascript

I am new in programing and right now I am working on one program. Program need to find the substring in a string and return the index where the chain starts to be the same. I know that for that I can use "indexOf". Is not so easy. I want to find out substrings with at moste one different char.
I was thinking about regular expresion... but not really know how to use it because I need to use regular expresion for every element of the string. Here some code wich propably will clarify what I want to do:
var A= "abbab";
var B= "ba";
var tb=[];
console.log(A.indexOf(B));
for (var i=0;i<B.length; i++){
var D=B.replace(B[i],"[a-z]");
tb.push(A.indexOf(D));
}
console.log(tb);
I know that the substring B and string A are the lowercase letters. Will be nice to get any advice how to make it using regular expresions. Thx
Simple Input:
A B
1) abbab ba
2) hello world
3) banana nan
Expected Output:
1) 1 2
2) No Match!
3) 0 2
While probably theoretically possible, I think it would very complicated to try this kind of search while attempting to incorporate all possible search query options in one long complex regular expression. I think a better approach is to use JavaScript to dynamically create various simpler options and then search with each separately.
The following code sequentially replaces each character in the initial query string with a regular expression wild card (i.e. a period, '.') and then searches the target string with that. For example, if the initial query string is 'nan', it will search with '.an', 'n.n' and 'na.'. It will only add the position of the hit to the list of hits if that position has not already been hit on a previous search. i.e. It ensures that the list of hits contains only unique values, even if multiple query variations found a hit at the same location. (This could be implemented even better with ES6 sets, but I couldn't get the Stack Overflow code snippet tool to cooperate with me while trying to use a set, even with the Babel option checked.) Finally, it sorts the hits in ascending order.
Update: The search algorithm has been updated/corrected. Originally, some hits were missed because the exec search for any query variation would only iterate as per the JavaScript default, i.e. after finding a match, it would start the next search at the next character after the end of the previous match, e.g. it would find 'aa' in 'aaaa' at positions 0 and 2. Now it starts the next search at the next character after the start of the previous match, e.g. it now finds 'aa' in 'aaaa' at positions 0, 1 and 2.
const findAllowingOneMismatch = (target, query) => {
const numLetters = query.length;
const queryVariations = [];
for (let variationNum = 0; variationNum < numLetters; variationNum += 1) {
queryVariations.push(query.slice(0, variationNum) + "." + query.slice(variationNum + 1));
};
let hits = [];
queryVariations.forEach(queryVariation => {
const re = new RegExp(queryVariation, "g");
let myArray;
while ((searchResult = re.exec(target)) !== null) {
re.lastIndex = searchResult.index + 1;
const hit = searchResult.index;
// console.log('found a hit with ' + queryVariation + ' at position ' + hit);
if (hits.indexOf(hit) === -1) {
hits.push(searchResult.index);
}
}
});
hits = hits.sort((a,b)=>(a-b));
console.log('Found "' + query + '" in "' + target + '" at positions:', JSON.stringify(hits));
};
[
['abbab', 'ba'],
['hello', 'world'],
['banana', 'nan'],
['abcde abcxe abxxe xbcde', 'abcd'],
['--xx-xxx--x----x-x-xxx--x--x-x-xx-', '----']
].forEach(pair => {findAllowingOneMismatch(pair[0], pair[1])});

Is there a way to measure string similarity in Google BigQuery

I'm wondering if anyone knows of a way to measure string similarity in BigQuery.
Seems like would be a neat function to have.
My case is i need to compare the similarity of two urls as want to be fairly sure they refer to the same article.
I can find examples using javascript so maybe a UDF is the way to go but i've not used UDF's at all (or javascript for that matter :) )
Just wondering if there may be a way using existing regex functions or if anyone might be able to get me started with porting the javascript example into a UDF.
Any help much appreciated, thanks
EDIT: Adding some example code
So if i have a UDF defined as:
// distance function
function levenshteinDistance (row, emit) {
//if (row.inputA.length <= 0 ) {var myresult = row.inputB.length};
if (typeof row.inputA === 'undefined') {var myresult = 1};
if (typeof row.inputB === 'undefined') {var myresult = 1};
//if (row.inputB.length <= 0 ) {var myresult = row.inputA.length};
var myresult = Math.min(
levenshteinDistance(row.inputA.substr(1), row.inputB) + 1,
levenshteinDistance(row.inputB.substr(1), row.inputA) + 1,
levenshteinDistance(row.inputA.substr(1), row.inputB.substr(1)) + (row.inputA[0] !== row.inputB[0] ? 1 : 0)
) + 1;
emit({outputA: myresult})
}
bigquery.defineFunction(
'levenshteinDistance', // Name of the function exported to SQL
['inputA', 'inputB'], // Names of input columns
[{'name': 'outputA', 'type': 'integer'}], // Output schema
levenshteinDistance // Reference to JavaScript UDF
);
// make a test function to test individual parts
function test(row, emit) {
if (row.inputA.length <= 0) { var x = row.inputB.length} else { var x = row.inputA.length};
emit({outputA: x});
}
bigquery.defineFunction(
'test', // Name of the function exported to SQL
['inputA', 'inputB'], // Names of input columns
[{'name': 'outputA', 'type': 'integer'}], // Output schema
test // Reference to JavaScript UDF
);
Any i try test with a query such as:
SELECT outputA FROM (levenshteinDistance(SELECT "abc" AS inputA, "abd" AS inputB))
I get error:
Error: TypeError: Cannot read property 'substr' of undefined at line 11, columns 38-39
Error Location: User-defined function
It seems like maybe row.inputA is not a string perhaps or for some reason string functions not able to work on it. Not sure if this is a type issue or something funny about what utils the UDF is able to use by default.
Again any help much appreciated, thanks.
Ready to use shared UDFs - Levenshtein distance:
SELECT fhoffa.x.levenshtein('felipe', 'hoffa')
, fhoffa.x.levenshtein('googgle', 'goggles')
, fhoffa.x.levenshtein('is this the', 'Is This The')
6 2 0
Soundex:
SELECT fhoffa.x.soundex('felipe')
, fhoffa.x.soundex('googgle')
, fhoffa.x.soundex('guugle')
F410 G240 G240
Fuzzy choose one:
SELECT fhoffa.x.fuzzy_extract_one('jony'
, (SELECT ARRAY_AGG(name)
FROM `fh-bigquery.popular_names.gender_probabilities`)
#, ['john', 'johnny', 'jonathan', 'jonas']
)
johnny
How-to:
https://medium.com/#hoffa/new-in-bigquery-persistent-udfs-c9ea4100fd83
If you're familiar with Python, you can use the functions defined by fuzzywuzzy in BigQuery using external libraries loaded from GCS.
Steps:
Download the javascript version of fuzzywuzzy (fuzzball)
Take the compiled file of the library: dist/fuzzball.umd.min.js and rename it to a clearer name (like fuzzball)
Upload it to a google cloud storage bucket
Create a temp function to use the lib in your query (set the path in OPTIONS to the relevant path)
CREATE TEMP FUNCTION token_set_ratio(a STRING, b STRING)
RETURNS FLOAT64
LANGUAGE js AS """
return fuzzball.token_set_ratio(a, b);
"""
OPTIONS (
library="gs://my-bucket/fuzzball.js");
with data as (select "my_test_string" as a, "my_other_string" as b)
SELECT a, b, token_set_ratio(a, b) from data
Levenshtein via JS would be the way to go. You can use the algorithm to get absolute string distance, or convert it to a percentage similarity by simply calculating abs(strlen - distance / strlen).
The easiest way to implement this would be to define a Levenshtein UDF that takes two inputs, a and b, and calculates the distance between them. The function could return a, b, and the distance.
To invoke it, you'd then pass in the two URLs as columns aliased to 'a' and 'b':
SELECT a, b, distance
FROM
Levenshtein(
SELECT
some_url AS a, other_url AS b
FROM
your_table
)
Below is quite simpler version for Hamming Distance by using WITH OFFSET instead of ROW_NUMBER() OVER()
#standardSQL
WITH Input AS (
SELECT 'abcdef' AS strings UNION ALL
SELECT 'defdef' UNION ALL
SELECT '1bcdef' UNION ALL
SELECT '1bcde4' UNION ALL
SELECT '123de4' UNION ALL
SELECT 'abc123'
)
SELECT 'abcdef' AS target, strings,
(SELECT COUNT(1)
FROM UNNEST(SPLIT('abcdef', '')) a WITH OFFSET x
JOIN UNNEST(SPLIT(strings, '')) b WITH OFFSET y
ON x = y AND a != b) hamming_distance
FROM Input
I did it like this:
CREATE TEMP FUNCTION trigram_similarity(a STRING, b STRING) AS (
(
WITH a_trigrams AS (
SELECT
DISTINCT tri_a
FROM
unnest(ML.NGRAMS(SPLIT(LOWER(a), ''), [3,3])) AS tri_a
),
b_trigrams AS (
SELECT
DISTINCT tri_b
FROM
unnest(ML.NGRAMS(SPLIT(LOWER(b), ''), [3,3])) AS tri_b
)
SELECT
COUNTIF(tri_b IS NOT NULL) / COUNT(*)
FROM
a_trigrams
LEFT JOIN b_trigrams ON tri_a = tri_b
)
);
Here is a comparison to Postgres's pg_trgm:
select trigram_similarity('saemus', 'seamus');
-- 0.25 vs. pg_trgm 0.272727
select trigram_similarity('shamus', 'seamus');
-- 0.5 vs. pg_trgm 0.4
I gave the same answer on How to perform trigram operations in Google BigQuery?
I couldn't find a direct answer to this, so I propose this solution, in standard SQL
#standardSQL
CREATE TEMP FUNCTION HammingDistance(a STRING, b STRING) AS (
(
SELECT
SUM(counter) AS diff
FROM (
SELECT
CASE
WHEN X.value != Y.value THEN 1
ELSE 0
END AS counter
FROM (
SELECT
value,
ROW_NUMBER() OVER() AS row
FROM
UNNEST(SPLIT(a, "")) AS value ) X
JOIN (
SELECT
value,
ROW_NUMBER() OVER() AS row
FROM
UNNEST(SPLIT(b, "")) AS value ) Y
ON
X.row = Y.row )
)
);
WITH Input AS (
SELECT 'abcdef' AS strings UNION ALL
SELECT 'defdef' UNION ALL
SELECT '1bcdef' UNION ALL
SELECT '1bcde4' UNION ALL
SELECT '123de4' UNION ALL
SELECT 'abc123'
)
SELECT strings, 'abcdef' as target, HammingDistance('abcdef', strings) as hamming_distance
FROM Input;
Compared to other solutions (like this one), it takes two strings (of the same length, following the definition for hamming distance) and outputs the expected distance.
bigquery similarity standardsql hammingdistance
While I was looking for the answer Felipe above, I worked on my own query and ended up with two versions, one which I called string approximation and another string resemblance.
The first is looking at the shortest distance between letters of source string and test string and returns a score between 0 and 1 where 1 is a complete match. It will always score based on the longest string of the two. It turns out to return similar results to the Levensthein distance.
#standardSql
CREATE OR REPLACE FUNCTION `myproject.func.stringApproximation`(sourceString STRING, testString STRING) AS (
(select avg(best_result) from (
select if(length(testString)<length(sourceString), sourceoffset, testoffset) as ref,
case
when min(result) is null then 0
else 1 / (min(result) + 1)
end as best_result,
from (
select *,
if(source = test, abs(sourceoffset - (testoffset)),
greatest(length(testString),length(sourceString))) as result
from unnest(split(lower(sourceString),'')) as source with offset sourceoffset
cross join
(select *
from unnest(split(lower(testString),'')) as test with offset as testoffset)
) as results
group by ref
)
)
);
The second is a variation of the first, where it will look at sequences of matching distances, so that a character matching at equal distance from the character preceding or following it will count as one point. This works quite well, better than string approximation but not quite as well as I would like to (see example output below).
#standarSql
CREATE OR REPLACE FUNCTION `myproject.func.stringResemblance`(sourceString STRING, testString STRING) AS (
(
select avg(sequence)
from (
select ref,
if(array_length(array(select * from comparison.collection intersect distinct
(select * from comparison.before))) > 0
or array_length(array(select * from comparison.collection intersect distinct
(select * from comparison.after))) > 0
, 1, 0) as sequence
from (
select ref,
collection,
lag(collection) over (order by ref) as before,
lead(collection) over (order by ref) as after
from (
select if(length(testString) < length(sourceString), sourceoffset, testoffset) as ref,
array_agg(result ignore nulls) as collection
from (
select *,
if(source = test, abs(sourceoffset - (testoffset)), null) as result
from unnest(split(lower(sourceString),'')) as source with offset sourceoffset
cross join
(select *
from unnest(split(lower(testString),'')) as test with offset as testoffset)
) as results
group by ref
)
) as comparison
)
)
);
Now here is a sample of result:
#standardSQL
with test_subjects as (
select 'benji' as name union all
select 'benjamin' union all
select 'benjamin alan artis' union all
select 'ben artis' union all
select 'artis benjamin'
)
select name, quick.stringApproximation('benjamin artis', name) as approxiamtion, quick.stringResemblance('benjamin artis', name) as resemblance
from test_subjects
order by resemblance desc
This returns
+---------------------+--------------------+--------------------+
| name | approximation | resemblance |
+---------------------+--------------------+--------------------+
| artis benjamin | 0.2653061224489796 | 0.8947368421052629 |
+---------------------+--------------------+--------------------+
| benjamin alan artis | 0.6078947368421053 | 0.8947368421052629 |
+---------------------+--------------------+--------------------+
| ben artis | 0.4142857142857142 | 0.7142857142857143 |
+---------------------+--------------------+--------------------+
| benjamin | 0.6125850340136053 | 0.5714285714285714 |
+---------------------+--------------------+--------------------+
| benji | 0.36269841269841263| 0.28571428571428575|
+----------------------------------------------------------------
Edited: updated the resemblance algorithm to improve results.
Try Flookup for Google Sheets... it's definitely faster than Levenshtein distance and it calculates percentage similarities right out of the box.
One Flookup function you might find useful is this:
FUZZYMATCH (string1, string2)
Parameter Details
string1: compares to string2.
string2: compares to string1.
The percentage similarity is then calculated based on these comparisons. Both parameters can be ranges.
I'm currently trying to optimise it for large data sets so you feedback would be very welcome.
Edit: I'm the creator of Flookup.

Doing assignment in VBscript now.. Need to give positions of each "e" in a string

I've done this in JavaScript but needless to say I can't just swap it over.
In Jscript I used this:
var estr = tx_val
index = 0
positions = []
while((index = estr.indexOf("e", index + 1)) != -1)
{
positions.push(index);
}
document.getElementById('ans6').innerHTML = "Locations of 'e' in string the are: "
+ positions;
I tried using the same logic with VBS terms, ie join, I also tried using InStr. I'm just not sure how to yank out that 'e'... Maybe I'll try replacing it with another character.
Here is what I tried with VBScript. I tried using InStr and replace to yank out the first occurance of 'e' in each loop and replace it with an 'x'. I thought that maybe this would make the next loop through give the location of the next 'e'. -- When I don't get a subscript out of range 'i' error, I only get one location back from the script and its 0.
(6) show the location of each occurence of the character "e" in the string "tx_val" in the span block with id="ans6"
countArr = array()
countArr = split(tx_val)
estr = tx_val
outhtml = ""
positions = array()
i=0
for each word in countArr
i= i+1
positions(i) = InStr(1,estr,"e",1)
estr = replace(estr,"e","x",1,1)
next
document.getElementById("ans6").innerHTML = "E is located at: " & positions
What can I do that is simpler than this and works? and thank you in advance, you all help a lot.
EDIT AGAIN: I finally got it working right. I'm not 100% how. But I ran through the logic in my head a few dozen times before I wrote it and after a few kinks it works.
local = ""
simon = tx_val
place=(InStr(1,simon,"e"))
i=(len(simon))
count = tx_val
do
local = (local & " " & (InStr((place),simon,"e")))
place = InStr((place+1),simon,"e")
count = (InStr(1,simon,"e"))
loop while place <> 0
document.getElementById("ans6").innerHTML= local
InStr has slightly different parameters to indexOf:
InStr([start, ]string, searchValue[, compare])
start: The index at which to start searching
string: The string to search
searchValue: The string to search for
Also note that Visual Basic indexes strings beginning at 1 so all the input and return index values are 1 more than the original JavaScript.
You can try split(). For example a simple string like this:
string = "thisismystring"
Split on "s", so we have
mystring = Split(string,"s")
So in the array mystring, we have
thi i my tring
^ ^ ^ ^
[0] [1] [2] [3]
All you have to do is check the length of each array item using Len(). For example, item 0 has length of 3 (thi), so the "s" is at position 4 (which is index 3). Take note of this length, and do for the next item. Item 1 has length of 1, so we add it to 4, to get 5, and so on.
#Update, here's an example using vbscript
thestring = "thisismystring"
delimiter="str"
mystring = Split(thestring,delimiter)
c=0
For i=0 To UBound(mystring)-1
c = c + Len(mystring(i)) + Len(delimiter)
WScript.Echo "index of s: " & c - Len(delimiter)
Next
Trial:
C:\test> cscript //nologo test.vbs
index of str is: 8

Categories

Resources