How to convert selected columns of a CSV to Javascript Array - javascript

This is a small part of the CSV :-
"LEADIN","Y","0.003","0.002","3","4.27","584.99","699.59","1162.36","1587.05","4.31","1","80","Small Rutting","Small Rutting","17.8","53.71785592","-2.56060898","173.1","FALSE","","04/11/2021","09:43:27","MFV_01","PG68BCU","HG","Y","1"
"LEADIN","Y","0.008","0.007","1.42","4.41","413.34","1237.43","306.49","2743.2","4.44","1","90","Small Rutting","Small Rutting","21.7","53.71789703","-2.56059787","172.9","FALSE","","04/11/2021","09:43:27","MFV_01","PG68BCU","HG","Y","6"
"LEADIN","Y","0.013","0.012","2.02","2.6","654.11","611.97","693.14","883.1","2.77","1","70","Small Rutting","Small Rutting","25.3","53.71794075","-2.56058166","172.7","FALSE","","04/11/2021","09:43:27","MFV_01","PG68BCU","HG","Y","11"
"LEADIN","Y","0.018","0.017","1.26","6.34","478.49","1054.13","337.17","3550.75","6.34","1","100","Small Rutting","Large Radius","29.8","53.7179844","-2.56056205","172.5","FALSE","","04/11/2021","09:43:27","MFV_01","PG68BCU","HG","Y","16"
"LEADIN","Y","0.023","0.022","5.72","2.6","682.96","1180.03","1959.48","1558.87","5.72","2","100","Short Radius - Single Rut","Small Rutting","34","53.71802785","-2.56053799","172.3","FALSE","","04/11/2021","09:43:27","MFV_01","PG68BCU","HG","Y","21"
"LEADIN","Y","0.028","0.027","2.76","2.29","734.58","959.17","1120.54","1196.8","2.95","2","60","Small Rutting","Small Rutting","37.8","53.71807003","-2.56050743","172.2","FALSE","","04/11/2021","09:43:27","MFV_01","PG68BCU","HG","Y","26"
"LEADIN","Y","0.033","0.032","1.88","2.7","758.48","738.18","812.85","1119.24","2.79","1","90","Small Rutting","Small Rutting","39.8","53.71811095","-2.56047369","171.9","FALSE","","04/11/2021","09:43:27","MFV_01","PG68BCU","HG","Y","31"
"LEADIN","Y","0.038","0.037","2.85","4.13","1124.35","1150.24","1531.35","2762.81","4.21","1","90","Small Rutting","Small Rutting","40.3","53.71815122","-2.56043949","171.7","FALSE","","04/11/2021","09:43:27","MFV_01","PG68BCU","HG","Y","36"
"LEADIN","Y","0.043","0.042","9.58","3.92","1861.02","1210.96","10202.89","2443.48","9.58","2","100","Large Radius","Small Rutting","41.4","53.71819101","-2.56040444","171.4","FALSE","","04/11/2021","09:43:27","MFV_01","PG68BCU","HG","Y","41"
I want column number 16 &17 in array form.

By doing a quick google search, I found some examples of CSVtoarray function, like here : https://www.30secondsofcode.org/js/s/csv-to-array
const CSVToArray = (data, delimiter = ',', omitFirstRow = false) =>
data
.slice(omitFirstRow ? data.indexOf('\n') + 1 : 0)
.split('\n')
.map(v => v.split(delimiter));
I think you can easily adapt it to solve your request.

Related

get pairs of string from a string

I want from this string :
'Paris , Bruxelles , Amsterdam , Berlin'
Get this result in an array :
['Paris_Bruxelles' , 'Bruxelles_Amsterdam' , 'Amsterdam_Berlin' ]
Can anyone help me please ?
You could split the string and slice the array and get the pairs.
var string = 'Paris , Bruxelles , Amsterdam , Berlin',
array = string.split(/\s*,\s*/),
result = array.slice(1).map((s, i) => [array[i], s].join('_'));
console.log(result);
Basically in functional languages you are expected to use a zipWith function, where the accepted answer mimics that in JS with side effects.
However you may also mimic Haskell's pattern matching in JS and come up with a recursive solution without any side effects
var cities = "Paris , Bruxelles , Amsterdam , Berlin , Bayburt".split(/\s*,\s*/),
pairs = ([f,s,...rest]) => s == void 0 ? []
: [f + "_" + s].concat(pairs([s,...rest]));
console.log(pairs(cities));

Truncate Text with Pattern

I want to truncate text in a pattern, this is a function to highlight text from an array containing matched indexes and text, but I want to truncate the text which doesn't include the part with match, see code below
const highlight = (matchData, text) => {
var result = [];
var matches = [].concat(matchData);
var pair = matches.shift();
for (var i = 0; i < text.length; i++) {
var char = text.charAt(i);
if (pair && i == pair[0]) {
result.push("<u>");
}
result.push(char);
if (pair && i == pair[1]) {
result.push("</u>");
truncatedIndex = i;
pair = matches.shift();
}
}
return result.join("");
};
console.log(
highlight(
[[23, 29], [69, 74]],
"Some text that doesn't include the main thing, the main thing is the result, you may know I meant that"
)
);
// This returns the highlighted HTML - Result will be => "Some text that doesn't <u>include</u> the main thing, the main thing is the <u>result</u>, you may know I meant that"
But this returns whole text, I want to truncate other texts in the range, I want to truncate other text but not in range of 20 characters before and after the result so the text can be clean as well as understandable. Like
"... text that doesn't <u>include</u> the main thing ... the <u>result</u> you may know I ..."
I can't find out a way to make that. Help is appreciated.
I've modified your function considerably, to make it easier to understand, and so that it works...
Instead of using an array of arrays, which I find cumbersome to deal with, I modified to use an array of objects. The objects are simple:
{
start: 23,
end: 30
}
Basically, it just adds names to the indices you had previously.
The code should be relatively easy to follow. Here's a line-by-line explanation:
Armed with the new structure, you can use a simple substring command to snip the appropriate piece of text.
Since we're in a loop, and I don't want two sets of ellipses between matches, I check to see if we're on the first pass through and only add an ellipses before the match on the first pass.
The text before the piece we've snipped is the 20 characters before the start of the match, or the number of characters to the beginning of the string. Math.max() provides a easy way of getting the highest index available.
The text after the piece we've snippet is the 20 characters after the end of the match, or the number of characters to the end of the string. Math.min() provides a easy way of getting the lowest index available.
Concatentating them together, we get the match's new text. I'm using template literals to make that easier to read than a bunch of + " " + and whatnot.
const highlight = (matches, text) => {
let newText = '';
matches.forEach((match) => {
const piece = text.substring(match.start, match.end);
const preEllipses = newText.length === 0 ? '... ' : '';
const textBefore = text.substring(Math.max(0, match.start - 20), match.start);
const textAfter = text.substring(match.end, Math.min(text.length - 1, match.end + 20));
newText += `${preEllipses}${textBefore}<u>${piece}</u>${textAfter} ... `;
});
return newText.trim();
}
// Sample Usage
const result = highlight([{ start: 23, end: 30 }, { start: 69, end: 75 }], "Some text that doesn't include the main thing, the main thing is the result, you may know I meant that");
console.log(result);
document.getElementById("output").innerHTML = result;
// Result will be => "... e text that doesn't <u>include</u> the main thing, the ... e main thing is the <u>result</u>, you may know I mea ..."
<div id="output"></div>
Note that I am using simple string concatenation here, rather than putting parts into an array and using join. Modern JavaScript engines optimize string concatenation extremely well, to the point where it makes the most sense to just use it. See e.g., Most efficient way to concatenate strings in JavaScript?, and Dr. Axel Rauschmayer's post on 2ality.
Note
There's an update below that I think shows a better version of this same idea. But this is where it started.
Original Version
Here's another attempt, building a more flexible solution out of reusable parts.
const intoPairs = (xs) =>
xs .slice (1) .map ((x, i) => [xs[i], x])
const splitAtIndices = (indices, str) =>
intoPairs (indices) .map (([a, b]) => str .slice (a, b))
const alternate = Object.assign((f, g) => (xs, {START, MIDDLE, END} = alternate) =>
xs .map (
(x, i, a, pos = i == 0 ? START : i == a.length - 1 ? END : MIDDLE) =>
i % 2 == 0 ? f (x, pos) : g (x, pos)
),
{START: {}, MIDDLE: {}, END: {}}
)
const wrap = (before, after) => (s) => `${before}${s}${after}`
const truncate = (count) => (s, pos) =>
pos == alternate.START
? s .length <= count ? s : '... ' + s .slice (-count)
: pos == alternate.END
? s .length <= count ? s : s .slice (0, count) + ' ...'
: // alternate.MIDDLE
s .length <= (2 * count) ? s : s .slice (0, count) + ' ... ' + s .slice (-count)
const highlighter = (f, g) => (ranges, str, flip = ranges[0][0] == 0) =>
alternate (flip ? g : f, flip ? f : g) (
splitAtIndices ([...(flip ? [] : [0]), ...ranges .flat() .sort((a, b) => a - b), str.length], str)
) .join ('')
const highlight = highlighter (truncate (20), wrap('<u>', '</u>'))
#output {padding: 0 1em;} #input {padding: .5em 1em 0;} textarea {width: 50%; height: 3em;} button, input {vertical-align: top; margin-left: 1em;}
<div id="input"> <textarea id="string">Some text that doesn't include the main thing, the main thing is the result, you may know I meant that</textarea> <input type="text" id="indices" value="[23, 30], [69, 75]"/> <button id="run">Highlight</button></div><h4>Output</h4><div id="output"></div> <script>document.getElementById('run').onclick = (evt) => { const str = document.getElementById('string').value; const idxString = document.getElementById('indices').value; const idxs = JSON.parse(`[${idxString}]`); const result = highlight(idxs, str); console.clear(); document.getElementById('output').innerHTML = ''; setTimeout(() => { console.log(result); document.getElementById('output').innerHTML = result; }, 300)}</script>
This involves the helper functions intoPairs, splitAtIndices alternate, wrap and truncate. I think they are best show by examples:
intoPairs (['a', 'b', 'c', 'd']) //=> [['a', 'b'], ['b', 'c'], ['c', 'd']]
splitAtIndices ([0, 3, 7, 15], 'abcdefghijklmno') //=> ["abc", "defg", "hijklmno"]
// ^ ^ ^ ^ `---' `----' `--------'
// | | | | | | |
// 0 3 7 15 0 - 3 4 - 7 8 - 15
alternate (f, g) ([a, b, c, d, e, ...]) //=> [f(a), g(b), f(c), g(d), f(e), ...]
wrap ('<div>', '</div>') ('foo bar baz') //=> '<div>foo bar baz</div>
//chars---+ input---+ position---+ output--+
// | | | |
// V V V V
truncate (10) ('abcdefghijklmnop', ~START~) //=> '... ghijklmnop'
truncate (10) ('abcdefghijklmnop', ~END~) //=> 'abcdefghij ...'
truncate (10) ('abcdefghijklmnop', ~MIDDLE~) //=> 'abcdefghijklmnop'
truncate (10) ('abcdefghijklmnopqrstuvwxyz', ~MIDDLE~) //=> 'abcdefghij ... qrstuvwxyz'
All of these are potentially reusable, and I personally have intoPairs and wrap in my general utility library.
truncate is the only complex one, and that is mostly because it does triple duty, handling the first string, the last string, and all the others in three distinct manners. You first supply a count and the you give a string as well as the position (START, MIDDLE, END, stored as properties of alternate.) For the first string, it includes an ellipsis (...) and the last count characters. For the last one, it includes the first count characters and an ellipsis. For the middle ones, if the length is shorter than double count, it returns the whole thing; otherwise it includes the first count characters, an ellipsis and the last count characters. This behavior might be different from what you want; if so,
The main function is highlighter. It accepts two functions. The first one is how you want to handle the non-highlighted sections. The second is for the highlighted ones. It returns the style function you were looking for, one that accepts an array of two-element arrays of numbers (the ranges) and your input string, returning a string with the highlighted ranges and the non-highlighted ranges.
We use it to generate the highlight function by passing it truncate (20) and wrap('<u>', '</u>').
The intermediate forms might make it clearer what's going on.
We start with these indices:
[[23, 30], [69, 75]]]
and our 103-character string,
"Some text that doesn't include the main thing, the main thing is the result, you may know I meant that"
First we flatten the ranges, prepending a zero if the first range doesn't start there and appending the last index of the string, to get this:
[0, 23, 30, 69, 75, 102]
We pass that to splitAtIndices, along with our string, to get
[
"Some text that doesn't ",
"include",
" the main thing, the main thing is the ",
"result",
", you may know I meant that"
]
Then we map the appropriate functions over each of these strings to get
[
"... e text that doesn't ",
"<u>include</u>",
" the main thing, the main thing is the ",
"<u>result</u>",
", you may know I mea ..."
]
and join those together to get our final results:
"... e text that doesn't <ul>include</ul> the main thing, the main thing is the <ul>result</ul>, you may know I mea ..."
I like the flexibility this offers. It's easy to alter the highlight strategy as well as how you handle the unhighlighted parts -- just pass a different function to highlighter. It's also a useful breakdown of the work into reusable parts.
But there are two things I don't like.
First, I'm not thrilled with the handling of middle unhighlighted sections. Of course it's easy to change; but I don't know what would be appropriate. You might, for instance, want to change the doubling applied to the count there. Or you might have an entirely different idea.
Second, truncate is dependent upon alternate. We have to somehow pass signals from alternate to the two functions supplied to it to let them know where we are. My first pass involved passing the index and the entire array (the Array.prototype.map signature) to those functions. But that felt too coupled. We could make START, MIDDLE, and END into module-local properties, but then alternate and truncate would not be reusable. I'm not going to go back and try it now, but I think a better solution might be to pass four functions to highlighter: the function for the highlighted sections, and one each for start, middle, and end positions of the non-highlighted ones.
Update
I did go ahead and try that alternative I mentioned, and I think this version is cleaner, with all the complexity located in the single function highlighter:
const intoPairs = (xs) =>
xs .slice (1) .map ((x, i) => [xs[i], x])
const splitAtIndices = (indices, str) =>
intoPairs (indices) .map (([a, b]) => str .slice (a, b))
const wrap = (before, after) => (s) => `${before}${s}${after}`
const truncateStart = (count) => (s) =>
s .length <= count ? s : '... ' + s .slice (-count)
const truncateMiddle = (count) => (s) =>
s .length <= (2 * count) ? s : s .slice (0, count) + ' ... ' + s .slice (-count)
const truncateEnd = (count) => (s) =>
s .length <= count ? s : s .slice (0, count) + ' ...'
const highlighter = (highlight, start, middle, end) =>
(ranges, str, flip = ranges[0][0] == 0) =>
splitAtIndices ([...(flip ? [] : [0]), ...ranges .flat() .sort((a, b) => a - b), str.length], str)
.map (
(s, i, a) =>
(flip
? (i % 2 == 0 ? highlight : i == a.length - 1 ? end : middle)
: (i == 0 ? start : i % 2 == 1 ? highlight : i == a.length - 1 ? end : middle)
) (s)
) .join ('')
const highlight = highlighter (
wrap('<u>', '</u>'),
truncateStart(20),
truncateMiddle(20),
truncateEnd(20)
)
console .log (
highlight (
[[23, 30], [69, 75]],
"Some text that doesn't include the main thing, the main thing is the result, you may know I meant that"
)
)
console .log (
highlight (
[[23, 30], [86, 92]],
"Some text that doesn't include the main thing, because you see, the main thing is the result, you may know I meant that"
)
)
There is some real complexity built into highlighter, but I think it's fairly intrinsic to the problem. On each iteration, we have to choose one of our four functions based on the index, the length of the array, and whether the first range started at zero. This bit here simply chooses the function based on all that:
(flip
? (i % 2 == 0 ? highlight : i == a.length - 1 ? end : middle)
: (i == 0 ? start : i % 2 == 1 ? highlight : i == a.length - 1 ? end : middle)
)
where the flip boolean simply reports whether the first range starts at 0, a is the array of substrings to handle., and i is the current index in the array. If you see a cleaner way of choosing the function, I'd love to know.
If we wanted to write a gloss for this sort of highlighting, we could easily write
const truncatingHighlighter = (count, start, end) =>
highlighter (
wrapp(start, end),
truncateStart(count),
truncateMiddle(count),
truncateEnd(count)
)
const highlight = truncatingHighlighter (20, '<u>', '</u>')
I definitely think this is a superior solution.

Convert an array to tab-delimited Mailchimp file

I need to convert a JSON array to a tab-delimited version so that I can save it as a .txt file so it can be uploaded on Mailchimp.
I would need result like this:
"Date","Pupil","Grade"
"25 May","Bloggs, Fred","C"
"25 May","Doe, Jane","B"
"15 July","Bloggs, Fred","A"
I'm not sure if whether this helps or not, but you can follow this structure and make any other small details to match your desired output ( I have no idea where should date and grade come from, but this is just an example ):
var json = '...your json string here...',
objects = JSON.parse( json ),
output = [],
finalString = '';
for ( let item in objects )
output.push([
new Date,
objects[ item ].lastName + ', ' + objects[ item ].firstName,
objects[ item ].gender
]);
Update:
You need to save each array inside of that output along a new line for your final string to save as .txt:
output.forEach( v => finalString += v.join( "\t" ) + "\n" )

JavaScript - Matching alphanumeric patterns with RegExp

I'm new to RegExp and to JS in general (Coming from Python), so this might be an easy question:
I'm trying to code an algebraic calculator in Javascript that receives an algebraic equation as a string, e.g.,
string = 'x^2 + 30x -12 = 4x^2 - 12x + 30';
The algorithm is already able to break the string in a single list, with all values on the right side multiplied by -1 so I can equate it all to 0, however, one of the steps to solve the equation involves creating a hashtable/dictionary, having the variable as key.
The string above results in a list eq:
eq = ['x^2', '+30x', '-12', '-4x^2', '+12x', '-30'];
I'm currently planning on iterating through this list, and using RegExp to identify both variables and the respective multiplier, so I can create a hashTable/Dictionary that will allow me to simplify the equation, such as this one:
hashTable = {
'x^2': [1, -4],
'x': [30, 12],
' ': [-12]
}
I plan on using some kind of for loop to iter through the array, and applying a match on each string to get the values I need, but I'm quite frankly, stumped.
I have already used RegExp to separate the string into the individual parts of the equation and to remove eventual spaces, but I can't imagine a way to separate -4 from x^2 in '-4x^2'.
You can try this
(-?\d+)x\^\d+.
When you execute match function :
var res = "-4x^2".match(/(-?\d+)x\^\d+/)
You will get res as an array : [ "-4x^2", "-4" ]
You have your '-4' in res[1].
By adding another group on the second \d+ (numeric char), you can retrieve the x power.
var res = "-4x^2".match(/(-?\d+)x\^(\d+)/) //res = [ "-4x^2", "-4", "2" ]
Hope it helps
If you know that the LHS of the hashtable is going to be at the end of the string. Lets say '4x', x is at the end or '-4x^2' where x^2 is at end, then we can get the number of the expression:
var exp = '-4x^2'
exp.split('x^2')[0] // will return -4
I hope this is what you were looking for.
function splitTerm(term) {
var regex = /([+-]?)([0-9]*)?([a-z](\^[0-9]+)?)?/
var match = regex.exec(term);
return {
constant: parseInt((match[1] || '') + (match[2] || 1)),
variable: match[3]
}
}
splitTerm('x^2'); // => {constant: 1, variable: "x^2"}
splitTerm('+30x'); // => {constant: 30, variable: "x"}
splitTerm('-12'); // => {constant: -12, variable: undefined}
Additionally, these tool may help you analyze and understand regular expressions:
https://regexper.com/
https://regex101.com/
http://rick.measham.id.au/paste/explain.pl

Is there a way to measure string similarity in Google BigQuery

I'm wondering if anyone knows of a way to measure string similarity in BigQuery.
Seems like would be a neat function to have.
My case is i need to compare the similarity of two urls as want to be fairly sure they refer to the same article.
I can find examples using javascript so maybe a UDF is the way to go but i've not used UDF's at all (or javascript for that matter :) )
Just wondering if there may be a way using existing regex functions or if anyone might be able to get me started with porting the javascript example into a UDF.
Any help much appreciated, thanks
EDIT: Adding some example code
So if i have a UDF defined as:
// distance function
function levenshteinDistance (row, emit) {
//if (row.inputA.length <= 0 ) {var myresult = row.inputB.length};
if (typeof row.inputA === 'undefined') {var myresult = 1};
if (typeof row.inputB === 'undefined') {var myresult = 1};
//if (row.inputB.length <= 0 ) {var myresult = row.inputA.length};
var myresult = Math.min(
levenshteinDistance(row.inputA.substr(1), row.inputB) + 1,
levenshteinDistance(row.inputB.substr(1), row.inputA) + 1,
levenshteinDistance(row.inputA.substr(1), row.inputB.substr(1)) + (row.inputA[0] !== row.inputB[0] ? 1 : 0)
) + 1;
emit({outputA: myresult})
}
bigquery.defineFunction(
'levenshteinDistance', // Name of the function exported to SQL
['inputA', 'inputB'], // Names of input columns
[{'name': 'outputA', 'type': 'integer'}], // Output schema
levenshteinDistance // Reference to JavaScript UDF
);
// make a test function to test individual parts
function test(row, emit) {
if (row.inputA.length <= 0) { var x = row.inputB.length} else { var x = row.inputA.length};
emit({outputA: x});
}
bigquery.defineFunction(
'test', // Name of the function exported to SQL
['inputA', 'inputB'], // Names of input columns
[{'name': 'outputA', 'type': 'integer'}], // Output schema
test // Reference to JavaScript UDF
);
Any i try test with a query such as:
SELECT outputA FROM (levenshteinDistance(SELECT "abc" AS inputA, "abd" AS inputB))
I get error:
Error: TypeError: Cannot read property 'substr' of undefined at line 11, columns 38-39
Error Location: User-defined function
It seems like maybe row.inputA is not a string perhaps or for some reason string functions not able to work on it. Not sure if this is a type issue or something funny about what utils the UDF is able to use by default.
Again any help much appreciated, thanks.
Ready to use shared UDFs - Levenshtein distance:
SELECT fhoffa.x.levenshtein('felipe', 'hoffa')
, fhoffa.x.levenshtein('googgle', 'goggles')
, fhoffa.x.levenshtein('is this the', 'Is This The')
6 2 0
Soundex:
SELECT fhoffa.x.soundex('felipe')
, fhoffa.x.soundex('googgle')
, fhoffa.x.soundex('guugle')
F410 G240 G240
Fuzzy choose one:
SELECT fhoffa.x.fuzzy_extract_one('jony'
, (SELECT ARRAY_AGG(name)
FROM `fh-bigquery.popular_names.gender_probabilities`)
#, ['john', 'johnny', 'jonathan', 'jonas']
)
johnny
How-to:
https://medium.com/#hoffa/new-in-bigquery-persistent-udfs-c9ea4100fd83
If you're familiar with Python, you can use the functions defined by fuzzywuzzy in BigQuery using external libraries loaded from GCS.
Steps:
Download the javascript version of fuzzywuzzy (fuzzball)
Take the compiled file of the library: dist/fuzzball.umd.min.js and rename it to a clearer name (like fuzzball)
Upload it to a google cloud storage bucket
Create a temp function to use the lib in your query (set the path in OPTIONS to the relevant path)
CREATE TEMP FUNCTION token_set_ratio(a STRING, b STRING)
RETURNS FLOAT64
LANGUAGE js AS """
return fuzzball.token_set_ratio(a, b);
"""
OPTIONS (
library="gs://my-bucket/fuzzball.js");
with data as (select "my_test_string" as a, "my_other_string" as b)
SELECT a, b, token_set_ratio(a, b) from data
Levenshtein via JS would be the way to go. You can use the algorithm to get absolute string distance, or convert it to a percentage similarity by simply calculating abs(strlen - distance / strlen).
The easiest way to implement this would be to define a Levenshtein UDF that takes two inputs, a and b, and calculates the distance between them. The function could return a, b, and the distance.
To invoke it, you'd then pass in the two URLs as columns aliased to 'a' and 'b':
SELECT a, b, distance
FROM
Levenshtein(
SELECT
some_url AS a, other_url AS b
FROM
your_table
)
Below is quite simpler version for Hamming Distance by using WITH OFFSET instead of ROW_NUMBER() OVER()
#standardSQL
WITH Input AS (
SELECT 'abcdef' AS strings UNION ALL
SELECT 'defdef' UNION ALL
SELECT '1bcdef' UNION ALL
SELECT '1bcde4' UNION ALL
SELECT '123de4' UNION ALL
SELECT 'abc123'
)
SELECT 'abcdef' AS target, strings,
(SELECT COUNT(1)
FROM UNNEST(SPLIT('abcdef', '')) a WITH OFFSET x
JOIN UNNEST(SPLIT(strings, '')) b WITH OFFSET y
ON x = y AND a != b) hamming_distance
FROM Input
I did it like this:
CREATE TEMP FUNCTION trigram_similarity(a STRING, b STRING) AS (
(
WITH a_trigrams AS (
SELECT
DISTINCT tri_a
FROM
unnest(ML.NGRAMS(SPLIT(LOWER(a), ''), [3,3])) AS tri_a
),
b_trigrams AS (
SELECT
DISTINCT tri_b
FROM
unnest(ML.NGRAMS(SPLIT(LOWER(b), ''), [3,3])) AS tri_b
)
SELECT
COUNTIF(tri_b IS NOT NULL) / COUNT(*)
FROM
a_trigrams
LEFT JOIN b_trigrams ON tri_a = tri_b
)
);
Here is a comparison to Postgres's pg_trgm:
select trigram_similarity('saemus', 'seamus');
-- 0.25 vs. pg_trgm 0.272727
select trigram_similarity('shamus', 'seamus');
-- 0.5 vs. pg_trgm 0.4
I gave the same answer on How to perform trigram operations in Google BigQuery?
I couldn't find a direct answer to this, so I propose this solution, in standard SQL
#standardSQL
CREATE TEMP FUNCTION HammingDistance(a STRING, b STRING) AS (
(
SELECT
SUM(counter) AS diff
FROM (
SELECT
CASE
WHEN X.value != Y.value THEN 1
ELSE 0
END AS counter
FROM (
SELECT
value,
ROW_NUMBER() OVER() AS row
FROM
UNNEST(SPLIT(a, "")) AS value ) X
JOIN (
SELECT
value,
ROW_NUMBER() OVER() AS row
FROM
UNNEST(SPLIT(b, "")) AS value ) Y
ON
X.row = Y.row )
)
);
WITH Input AS (
SELECT 'abcdef' AS strings UNION ALL
SELECT 'defdef' UNION ALL
SELECT '1bcdef' UNION ALL
SELECT '1bcde4' UNION ALL
SELECT '123de4' UNION ALL
SELECT 'abc123'
)
SELECT strings, 'abcdef' as target, HammingDistance('abcdef', strings) as hamming_distance
FROM Input;
Compared to other solutions (like this one), it takes two strings (of the same length, following the definition for hamming distance) and outputs the expected distance.
bigquery similarity standardsql hammingdistance
While I was looking for the answer Felipe above, I worked on my own query and ended up with two versions, one which I called string approximation and another string resemblance.
The first is looking at the shortest distance between letters of source string and test string and returns a score between 0 and 1 where 1 is a complete match. It will always score based on the longest string of the two. It turns out to return similar results to the Levensthein distance.
#standardSql
CREATE OR REPLACE FUNCTION `myproject.func.stringApproximation`(sourceString STRING, testString STRING) AS (
(select avg(best_result) from (
select if(length(testString)<length(sourceString), sourceoffset, testoffset) as ref,
case
when min(result) is null then 0
else 1 / (min(result) + 1)
end as best_result,
from (
select *,
if(source = test, abs(sourceoffset - (testoffset)),
greatest(length(testString),length(sourceString))) as result
from unnest(split(lower(sourceString),'')) as source with offset sourceoffset
cross join
(select *
from unnest(split(lower(testString),'')) as test with offset as testoffset)
) as results
group by ref
)
)
);
The second is a variation of the first, where it will look at sequences of matching distances, so that a character matching at equal distance from the character preceding or following it will count as one point. This works quite well, better than string approximation but not quite as well as I would like to (see example output below).
#standarSql
CREATE OR REPLACE FUNCTION `myproject.func.stringResemblance`(sourceString STRING, testString STRING) AS (
(
select avg(sequence)
from (
select ref,
if(array_length(array(select * from comparison.collection intersect distinct
(select * from comparison.before))) > 0
or array_length(array(select * from comparison.collection intersect distinct
(select * from comparison.after))) > 0
, 1, 0) as sequence
from (
select ref,
collection,
lag(collection) over (order by ref) as before,
lead(collection) over (order by ref) as after
from (
select if(length(testString) < length(sourceString), sourceoffset, testoffset) as ref,
array_agg(result ignore nulls) as collection
from (
select *,
if(source = test, abs(sourceoffset - (testoffset)), null) as result
from unnest(split(lower(sourceString),'')) as source with offset sourceoffset
cross join
(select *
from unnest(split(lower(testString),'')) as test with offset as testoffset)
) as results
group by ref
)
) as comparison
)
)
);
Now here is a sample of result:
#standardSQL
with test_subjects as (
select 'benji' as name union all
select 'benjamin' union all
select 'benjamin alan artis' union all
select 'ben artis' union all
select 'artis benjamin'
)
select name, quick.stringApproximation('benjamin artis', name) as approxiamtion, quick.stringResemblance('benjamin artis', name) as resemblance
from test_subjects
order by resemblance desc
This returns
+---------------------+--------------------+--------------------+
| name | approximation | resemblance |
+---------------------+--------------------+--------------------+
| artis benjamin | 0.2653061224489796 | 0.8947368421052629 |
+---------------------+--------------------+--------------------+
| benjamin alan artis | 0.6078947368421053 | 0.8947368421052629 |
+---------------------+--------------------+--------------------+
| ben artis | 0.4142857142857142 | 0.7142857142857143 |
+---------------------+--------------------+--------------------+
| benjamin | 0.6125850340136053 | 0.5714285714285714 |
+---------------------+--------------------+--------------------+
| benji | 0.36269841269841263| 0.28571428571428575|
+----------------------------------------------------------------
Edited: updated the resemblance algorithm to improve results.
Try Flookup for Google Sheets... it's definitely faster than Levenshtein distance and it calculates percentage similarities right out of the box.
One Flookup function you might find useful is this:
FUZZYMATCH (string1, string2)
Parameter Details
string1: compares to string2.
string2: compares to string1.
The percentage similarity is then calculated based on these comparisons. Both parameters can be ranges.
I'm currently trying to optimise it for large data sets so you feedback would be very welcome.
Edit: I'm the creator of Flookup.

Categories

Resources