Is there a way to measure string similarity in Google BigQuery - javascript

I'm wondering if anyone knows of a way to measure string similarity in BigQuery.
Seems like would be a neat function to have.
My case is i need to compare the similarity of two urls as want to be fairly sure they refer to the same article.
I can find examples using javascript so maybe a UDF is the way to go but i've not used UDF's at all (or javascript for that matter :) )
Just wondering if there may be a way using existing regex functions or if anyone might be able to get me started with porting the javascript example into a UDF.
Any help much appreciated, thanks
EDIT: Adding some example code
So if i have a UDF defined as:
// distance function
function levenshteinDistance (row, emit) {
//if (row.inputA.length <= 0 ) {var myresult = row.inputB.length};
if (typeof row.inputA === 'undefined') {var myresult = 1};
if (typeof row.inputB === 'undefined') {var myresult = 1};
//if (row.inputB.length <= 0 ) {var myresult = row.inputA.length};
var myresult = Math.min(
levenshteinDistance(row.inputA.substr(1), row.inputB) + 1,
levenshteinDistance(row.inputB.substr(1), row.inputA) + 1,
levenshteinDistance(row.inputA.substr(1), row.inputB.substr(1)) + (row.inputA[0] !== row.inputB[0] ? 1 : 0)
) + 1;
emit({outputA: myresult})
}
bigquery.defineFunction(
'levenshteinDistance', // Name of the function exported to SQL
['inputA', 'inputB'], // Names of input columns
[{'name': 'outputA', 'type': 'integer'}], // Output schema
levenshteinDistance // Reference to JavaScript UDF
);
// make a test function to test individual parts
function test(row, emit) {
if (row.inputA.length <= 0) { var x = row.inputB.length} else { var x = row.inputA.length};
emit({outputA: x});
}
bigquery.defineFunction(
'test', // Name of the function exported to SQL
['inputA', 'inputB'], // Names of input columns
[{'name': 'outputA', 'type': 'integer'}], // Output schema
test // Reference to JavaScript UDF
);
Any i try test with a query such as:
SELECT outputA FROM (levenshteinDistance(SELECT "abc" AS inputA, "abd" AS inputB))
I get error:
Error: TypeError: Cannot read property 'substr' of undefined at line 11, columns 38-39
Error Location: User-defined function
It seems like maybe row.inputA is not a string perhaps or for some reason string functions not able to work on it. Not sure if this is a type issue or something funny about what utils the UDF is able to use by default.
Again any help much appreciated, thanks.

Ready to use shared UDFs - Levenshtein distance:
SELECT fhoffa.x.levenshtein('felipe', 'hoffa')
, fhoffa.x.levenshtein('googgle', 'goggles')
, fhoffa.x.levenshtein('is this the', 'Is This The')
6 2 0
Soundex:
SELECT fhoffa.x.soundex('felipe')
, fhoffa.x.soundex('googgle')
, fhoffa.x.soundex('guugle')
F410 G240 G240
Fuzzy choose one:
SELECT fhoffa.x.fuzzy_extract_one('jony'
, (SELECT ARRAY_AGG(name)
FROM `fh-bigquery.popular_names.gender_probabilities`)
#, ['john', 'johnny', 'jonathan', 'jonas']
)
johnny
How-to:
https://medium.com/#hoffa/new-in-bigquery-persistent-udfs-c9ea4100fd83

If you're familiar with Python, you can use the functions defined by fuzzywuzzy in BigQuery using external libraries loaded from GCS.
Steps:
Download the javascript version of fuzzywuzzy (fuzzball)
Take the compiled file of the library: dist/fuzzball.umd.min.js and rename it to a clearer name (like fuzzball)
Upload it to a google cloud storage bucket
Create a temp function to use the lib in your query (set the path in OPTIONS to the relevant path)
CREATE TEMP FUNCTION token_set_ratio(a STRING, b STRING)
RETURNS FLOAT64
LANGUAGE js AS """
return fuzzball.token_set_ratio(a, b);
"""
OPTIONS (
library="gs://my-bucket/fuzzball.js");
with data as (select "my_test_string" as a, "my_other_string" as b)
SELECT a, b, token_set_ratio(a, b) from data

Levenshtein via JS would be the way to go. You can use the algorithm to get absolute string distance, or convert it to a percentage similarity by simply calculating abs(strlen - distance / strlen).
The easiest way to implement this would be to define a Levenshtein UDF that takes two inputs, a and b, and calculates the distance between them. The function could return a, b, and the distance.
To invoke it, you'd then pass in the two URLs as columns aliased to 'a' and 'b':
SELECT a, b, distance
FROM
Levenshtein(
SELECT
some_url AS a, other_url AS b
FROM
your_table
)

Below is quite simpler version for Hamming Distance by using WITH OFFSET instead of ROW_NUMBER() OVER()
#standardSQL
WITH Input AS (
SELECT 'abcdef' AS strings UNION ALL
SELECT 'defdef' UNION ALL
SELECT '1bcdef' UNION ALL
SELECT '1bcde4' UNION ALL
SELECT '123de4' UNION ALL
SELECT 'abc123'
)
SELECT 'abcdef' AS target, strings,
(SELECT COUNT(1)
FROM UNNEST(SPLIT('abcdef', '')) a WITH OFFSET x
JOIN UNNEST(SPLIT(strings, '')) b WITH OFFSET y
ON x = y AND a != b) hamming_distance
FROM Input

I did it like this:
CREATE TEMP FUNCTION trigram_similarity(a STRING, b STRING) AS (
(
WITH a_trigrams AS (
SELECT
DISTINCT tri_a
FROM
unnest(ML.NGRAMS(SPLIT(LOWER(a), ''), [3,3])) AS tri_a
),
b_trigrams AS (
SELECT
DISTINCT tri_b
FROM
unnest(ML.NGRAMS(SPLIT(LOWER(b), ''), [3,3])) AS tri_b
)
SELECT
COUNTIF(tri_b IS NOT NULL) / COUNT(*)
FROM
a_trigrams
LEFT JOIN b_trigrams ON tri_a = tri_b
)
);
Here is a comparison to Postgres's pg_trgm:
select trigram_similarity('saemus', 'seamus');
-- 0.25 vs. pg_trgm 0.272727
select trigram_similarity('shamus', 'seamus');
-- 0.5 vs. pg_trgm 0.4
I gave the same answer on How to perform trigram operations in Google BigQuery?

I couldn't find a direct answer to this, so I propose this solution, in standard SQL
#standardSQL
CREATE TEMP FUNCTION HammingDistance(a STRING, b STRING) AS (
(
SELECT
SUM(counter) AS diff
FROM (
SELECT
CASE
WHEN X.value != Y.value THEN 1
ELSE 0
END AS counter
FROM (
SELECT
value,
ROW_NUMBER() OVER() AS row
FROM
UNNEST(SPLIT(a, "")) AS value ) X
JOIN (
SELECT
value,
ROW_NUMBER() OVER() AS row
FROM
UNNEST(SPLIT(b, "")) AS value ) Y
ON
X.row = Y.row )
)
);
WITH Input AS (
SELECT 'abcdef' AS strings UNION ALL
SELECT 'defdef' UNION ALL
SELECT '1bcdef' UNION ALL
SELECT '1bcde4' UNION ALL
SELECT '123de4' UNION ALL
SELECT 'abc123'
)
SELECT strings, 'abcdef' as target, HammingDistance('abcdef', strings) as hamming_distance
FROM Input;
Compared to other solutions (like this one), it takes two strings (of the same length, following the definition for hamming distance) and outputs the expected distance.
bigquery similarity standardsql hammingdistance

While I was looking for the answer Felipe above, I worked on my own query and ended up with two versions, one which I called string approximation and another string resemblance.
The first is looking at the shortest distance between letters of source string and test string and returns a score between 0 and 1 where 1 is a complete match. It will always score based on the longest string of the two. It turns out to return similar results to the Levensthein distance.
#standardSql
CREATE OR REPLACE FUNCTION `myproject.func.stringApproximation`(sourceString STRING, testString STRING) AS (
(select avg(best_result) from (
select if(length(testString)<length(sourceString), sourceoffset, testoffset) as ref,
case
when min(result) is null then 0
else 1 / (min(result) + 1)
end as best_result,
from (
select *,
if(source = test, abs(sourceoffset - (testoffset)),
greatest(length(testString),length(sourceString))) as result
from unnest(split(lower(sourceString),'')) as source with offset sourceoffset
cross join
(select *
from unnest(split(lower(testString),'')) as test with offset as testoffset)
) as results
group by ref
)
)
);
The second is a variation of the first, where it will look at sequences of matching distances, so that a character matching at equal distance from the character preceding or following it will count as one point. This works quite well, better than string approximation but not quite as well as I would like to (see example output below).
#standarSql
CREATE OR REPLACE FUNCTION `myproject.func.stringResemblance`(sourceString STRING, testString STRING) AS (
(
select avg(sequence)
from (
select ref,
if(array_length(array(select * from comparison.collection intersect distinct
(select * from comparison.before))) > 0
or array_length(array(select * from comparison.collection intersect distinct
(select * from comparison.after))) > 0
, 1, 0) as sequence
from (
select ref,
collection,
lag(collection) over (order by ref) as before,
lead(collection) over (order by ref) as after
from (
select if(length(testString) < length(sourceString), sourceoffset, testoffset) as ref,
array_agg(result ignore nulls) as collection
from (
select *,
if(source = test, abs(sourceoffset - (testoffset)), null) as result
from unnest(split(lower(sourceString),'')) as source with offset sourceoffset
cross join
(select *
from unnest(split(lower(testString),'')) as test with offset as testoffset)
) as results
group by ref
)
) as comparison
)
)
);
Now here is a sample of result:
#standardSQL
with test_subjects as (
select 'benji' as name union all
select 'benjamin' union all
select 'benjamin alan artis' union all
select 'ben artis' union all
select 'artis benjamin'
)
select name, quick.stringApproximation('benjamin artis', name) as approxiamtion, quick.stringResemblance('benjamin artis', name) as resemblance
from test_subjects
order by resemblance desc
This returns
+---------------------+--------------------+--------------------+
| name | approximation | resemblance |
+---------------------+--------------------+--------------------+
| artis benjamin | 0.2653061224489796 | 0.8947368421052629 |
+---------------------+--------------------+--------------------+
| benjamin alan artis | 0.6078947368421053 | 0.8947368421052629 |
+---------------------+--------------------+--------------------+
| ben artis | 0.4142857142857142 | 0.7142857142857143 |
+---------------------+--------------------+--------------------+
| benjamin | 0.6125850340136053 | 0.5714285714285714 |
+---------------------+--------------------+--------------------+
| benji | 0.36269841269841263| 0.28571428571428575|
+----------------------------------------------------------------
Edited: updated the resemblance algorithm to improve results.

Try Flookup for Google Sheets... it's definitely faster than Levenshtein distance and it calculates percentage similarities right out of the box.
One Flookup function you might find useful is this:
FUZZYMATCH (string1, string2)
Parameter Details
string1: compares to string2.
string2: compares to string1.
The percentage similarity is then calculated based on these comparisons. Both parameters can be ranges.
I'm currently trying to optimise it for large data sets so you feedback would be very welcome.
Edit: I'm the creator of Flookup.

Related

eval() triggers Unexpected number when part of string passed into it has decimal

The below works perfectly until a key is sent to it with a decimal in it. Then it triggers an "Unexpected number". I can think of some work arounds that have to do with modifying the keys in the object sent from the database, but want to figure out why this triggers an error first.
What is happening in the below:
A number of percentiles are sent from the FE by the user (e.g., 5th, 15th, 35th, 62.5th, etc.) as an object (e.g. incP1: 5th, incP2: 15th, etc.) which are then mapped.
If the key starts with inc it does a certain set of logic.
It constructs a string (fieldStr) that corresponds with a key in the cr object which is basically the actual values of the percentiles the user requested.
In this case it would construct something like cr.TestInc15
The let fieldObj = eval(fieldStr) then returns the value from cr. of the key that was constructed.
Hopefully that makes sense, but that is why I am using eval() because I can't get the value from just the key as string otherwise. It works fine until it hits something like the 62.5th percentile where the key would be constructed as cr.TestInc62.5 which definitely has a value in cr. as I can console.log it out.
renderData(percentiles, cr, varName) {
return (
_.map(
_.pickBy(percentiles, function (value, key) {
return _.startsWith(key, 'inc')
}), p => {
let fieldStr = 'cr.' + varName + 'Inc' +
(p == 'n' ? 'N' :
(p == 50 ? 'Median' : p
));
// a bunch of junk after this, but error stops it here
let fieldObj = eval(
fieldStr
);
}
)
)
}
Of course you can get the value with a string, you can access an object property without using eval even in your case of having dots as part of property names.
It's called bracket notation
See: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Property_Accessors
var cr = { "TestInc62.5": "Val123" }
console.log(cr["TestInc62.5"]);

JavaScript regex: capturing optional groups while having a strict matching

I am trying to extract some data from user input that should follow this format: 1d 5h 30m, which means the user is entering an amount of time of 1 day, 5 hours and 30 minutes.
I am trying to extract the value of each part of the input. However, each group is optional, meaning that 2h 20m is a valid input.
I am trying to be flexible in the input (in the sense that not all parts need to be input) but at the same time I don't watch my regex to match some random imput like asdfasdf20m. This one should be rejected (no match).
So first I am getting rid of any separator the user might have used (their input can look like 4h, 10m and that's ok):
input = input.replace(/[\s.,;_|#-]+/g, '');
Then I am capturing each part, which I indicate as optional using ?:
var match = /^((\d+)d)?((\d+)h)?((\d+)m)?$/.exec(input);
It is kind of messy capturing an entire group including the letter when I only want the actual value, but I cannot say that cluster is optional without wrapping it with parentheses right?
Then, when an empty group is captured its value in match is undefined. Is there any function to default undefined values to a particular value? For example, 0 would be handy here.
An example where input is "4d, 20h, 55m", and the match result is:
["4d20h55m", "4d", "4", "20h", "20", "55m", "55", index: 0, input: "4d20h55m"]
My main issues are:
How can I indicate a group as optional but avoid capturing it?
How can I deal with input that can potentially match, like abcdefg6d8m?
How can I deal with an altered order? For example, the user could input 20m 10h.
When I'm asking "how to deal with x" I mean I'd like to be able to reject those matches.
As variant:
HTML:
<input type="text">
<button>Check</button>
<div id="res"></div>
JS:
var r = [];
document.querySelector('button').addEventListener('click', function(){
var v = document.querySelector('input').value;
v.replace(/(\d+d)|(\d+h)|(\d+m)/ig, replacer);
document.querySelector('#res').innerText = r;
}, false);
function trim(s, mask) {
while (~mask.indexOf(s[0])) {
s = s.slice(1);
}
while (~mask.indexOf(s[s.length - 1])) {
s = s.slice(0, -1);
}
return s;
}
function replacer(str){
if(/d$/gi.test(str)){
r[0] = str;
}
else if(/h$/gi.test(str)){
r[1] = str;
}
else if(/m$/gi.test(str)){
r[2] = str;
}
return trim(r.join(', '), ',');
}
See here.

JavaScript - Matching alphanumeric patterns with RegExp

I'm new to RegExp and to JS in general (Coming from Python), so this might be an easy question:
I'm trying to code an algebraic calculator in Javascript that receives an algebraic equation as a string, e.g.,
string = 'x^2 + 30x -12 = 4x^2 - 12x + 30';
The algorithm is already able to break the string in a single list, with all values on the right side multiplied by -1 so I can equate it all to 0, however, one of the steps to solve the equation involves creating a hashtable/dictionary, having the variable as key.
The string above results in a list eq:
eq = ['x^2', '+30x', '-12', '-4x^2', '+12x', '-30'];
I'm currently planning on iterating through this list, and using RegExp to identify both variables and the respective multiplier, so I can create a hashTable/Dictionary that will allow me to simplify the equation, such as this one:
hashTable = {
'x^2': [1, -4],
'x': [30, 12],
' ': [-12]
}
I plan on using some kind of for loop to iter through the array, and applying a match on each string to get the values I need, but I'm quite frankly, stumped.
I have already used RegExp to separate the string into the individual parts of the equation and to remove eventual spaces, but I can't imagine a way to separate -4 from x^2 in '-4x^2'.
You can try this
(-?\d+)x\^\d+.
When you execute match function :
var res = "-4x^2".match(/(-?\d+)x\^\d+/)
You will get res as an array : [ "-4x^2", "-4" ]
You have your '-4' in res[1].
By adding another group on the second \d+ (numeric char), you can retrieve the x power.
var res = "-4x^2".match(/(-?\d+)x\^(\d+)/) //res = [ "-4x^2", "-4", "2" ]
Hope it helps
If you know that the LHS of the hashtable is going to be at the end of the string. Lets say '4x', x is at the end or '-4x^2' where x^2 is at end, then we can get the number of the expression:
var exp = '-4x^2'
exp.split('x^2')[0] // will return -4
I hope this is what you were looking for.
function splitTerm(term) {
var regex = /([+-]?)([0-9]*)?([a-z](\^[0-9]+)?)?/
var match = regex.exec(term);
return {
constant: parseInt((match[1] || '') + (match[2] || 1)),
variable: match[3]
}
}
splitTerm('x^2'); // => {constant: 1, variable: "x^2"}
splitTerm('+30x'); // => {constant: 30, variable: "x"}
splitTerm('-12'); // => {constant: -12, variable: undefined}
Additionally, these tool may help you analyze and understand regular expressions:
https://regexper.com/
https://regex101.com/
http://rick.measham.id.au/paste/explain.pl

Searching for multiple partial phrases so that one original phrase can not match multiple search phrases

Given a predefined set of phrases, I'd like to perform a search based on user's query. For example, consider the following set of phrases:
index phrase
-----------------------------------------
0 Stack Overflow
1 Math Overflow
2 Super User
3 Webmasters
4 Electrical Engineering
5 Programming Jokes
6 Programming Puzzles
7 Geographic Information Systems
The expected behaviour is:
query result
------------------------------------------------------------------------
s Stack Overflow, Super User, Geographic Information Systems
web Webmasters
over Stack Overflow, Math Overflow
super u Super User
user s Super User
e e Electrical Engineering
p Programming Jokes, Programming Puzzles
p p Programming Puzzles
To implement this behaviour I used a trie. Every node in the trie has an array of indices (empty initially).
To insert a phrase to the trie, I first break it to words. For example, Programming Puzzles has index = 6. Therefore, I add 6 to all the following nodes:
p
pr
pro
prog
progr
progra
program
programm
programmi
programmin
programming
pu
puz
puzz
puzzl
puzzle
puzzles
The problem is, when I search for the query prog p, I first get a list of indices for prog which is [5, 6]. Then, I get a list of indices for p which is [5, 6] as well. Finally, I calculate the intersection between the two, and return the result [5, 6], which is obviously wrong (should be [6]).
How would you fix this?
Key Observation
We can use the fact that two words in a query can match the same word in a phrase only if one query word is a prefix of the other query word (or if they are same). So if we process the query words in descending lexicographic order (prefixes come after their "superwords"), then we can safely remove words from the phrases at the first match. Doing so we left no possibility to match the same phrase word twice. As I said, it is safe because prefixes match superset of phrase words what their "superwords" can match, and pair of query words, where one is not a prefix of the other, always match disjoint set of phrase words.
We don't have to remove words from phrases or the trie "physically", we can do it "virtually".
Implementation of the Algorithm
var PhraseSearch = function () {
var Trie = function () {
this.phraseWordCount = {};
this.children = {};
};
Trie.prototype.addPhraseWord = function (phrase, word) {
if (word !== '') {
var first = word.charAt(0);
if (!this.children.hasOwnProperty(first)) {
this.children[first] = new Trie();
}
var rest = word.substring(1);
this.children[first].addPhraseWord(phrase, rest);
}
if (!this.phraseWordCount.hasOwnProperty(phrase)) {
this.phraseWordCount[phrase] = 0;
}
this.phraseWordCount[phrase]++;
};
Trie.prototype.getPhraseWordCount = function (prefix) {
if (prefix !== '') {
var first = prefix.charAt(0);
if (this.children.hasOwnProperty(first)) {
var rest = prefix.substring(1);
return this.children[first].getPhraseWordCount(rest);
} else {
return {};
}
} else {
return this.phraseWordCount;
}
}
this.trie = new Trie();
}
PhraseSearch.prototype.addPhrase = function (phrase) {
var words = phrase.trim().toLowerCase().split(/\s+/);
words.forEach(function (word) {
this.trie.addPhraseWord(phrase, word);
}, this);
}
PhraseSearch.prototype.search = function (query) {
var answer = {};
var phraseWordCount = this.trie.getPhraseWordCount('');
for (var phrase in phraseWordCount) {
if (phraseWordCount.hasOwnProperty(phrase)) {
answer[phrase] = true;
}
}
var prefixes = query.trim().toLowerCase().split(/\s+/);
prefixes.sort();
prefixes.reverse();
var prevPrefix = '';
var superprefixCount = 0;
prefixes.forEach(function (prefix) {
if (prevPrefix.indexOf(prefix) !== 0) {
superprefixCount = 0;
}
phraseWordCount = this.trie.getPhraseWordCount(prefix);
function phraseMatchedWordCount(phrase) {
return phraseWordCount.hasOwnProperty(phrase) ? phraseWordCount[phrase] - superprefixCount : 0;
}
for (var phrase in answer) {
if (answer.hasOwnProperty(phrase) && phraseMatchedWordCount(phrase) < 1) {
delete answer[phrase];
}
}
prevPrefix = prefix;
superprefixCount++;
}, this);
return Object.keys(answer);
}
function test() {
var phraseSearch = new PhraseSearch();
var phrases = [
'Stack Overflow',
'Math Overflow',
'Super User',
'Webmasters',
'Electrical Engineering',
'Programming Jokes',
'Programming Puzzles',
'Geographic Information Systems'
];
phrases.forEach(phraseSearch.addPhrase, phraseSearch);
var queries = {
's': 'Stack Overflow, Super User, Geographic Information Systems',
'web': 'Webmasters',
'over': 'Stack Overflow, Math Overflow',
'super u': 'Super User',
'user s': 'Super User',
'e e': 'Electrical Engineering',
'p': 'Programming Jokes, Programming Puzzles',
'p p': 'Programming Puzzles'
};
for(var query in queries) {
if (queries.hasOwnProperty(query)) {
var expected = queries[query];
var actual = phraseSearch.search(query).join(', ');
console.log('query: ' + query);
console.log('expected: ' + expected);
console.log('actual: ' + actual);
}
}
}
One can test this code here: http://ideone.com/RJgj6p
Possible Optimizations
Storing the phrase word count in each trie node is not very memory
efficient. But by implementing compressed trie it is possible to
reduce the worst case memory complexity to O(n m), there n is the
number of different words in all the phrases, and m is the total
number of phrases.
For simplicity I initialize answer by adding all the phrases. But
a more time efficient approach is to initialize answer by adding
the phrases matched by the query word matching least number of
phrases. Then intersect with the phrases of the query word matching
second least number of phrases. And so on...
Relevant Differences from the Implementation Referenced in the Question
In trie node I store not only the phrase references (ids) matched by the subtrie, but also the number of matched words in these phrases. So, the result of the match is not only the matched phrase references, but also the number of matched words in these phrases.
I process query words in descending lexicographic order.
I subtract the number of superprefixes (query words of which the current query word is a prefix) from current match results (by using variable superprefixCount), and a phrase is considered matched by the current query word only when the resulting number of matched words in it is greater than zero. As in the original implementation, the final result is the intersection of the matched phrases.
As one can see, changes are minimal and asymptotic complexities (both time and memory) are not changed.
If the set of phrases is defined and does not contain long phrases, maybe you can create not 1 trie, but n tries, where n is the maximum number of words in one phrase.
In i-th trie store i-th word of the phrase. Let's call it the trie with label 'i'.
To process query with m words let's consider the following algorithm:
For each phrase we will store the lowest label of a trie, where the word from this phrase was found. Let's denote it as d[j], where j is the phrase index. At first for each phrase j, d[j] = -1.
Search the first word in each of n tries.
For each phrase j find the label of a trie that is greater than d[j] and where the word from this phrase was found. If there are several such labels, pick the smallest one. Let's denote such label as c[j].
If there is no such index, this phrase can not be matched. You can mark this case with d[j] = n + 1.
If there is such c[j] that c[j] > d[j], than assign d[j] = c[j].
Repeat for every word left.
Every phrase with -1 < d[j] < n is matched.
This is not very optimal. To improve performance you should store only usable values of d array. After first word, store only phrases, matched with this word. Also, instead of assignment d[j] = n + 1, delete index j. Process only already stored phrase indexes.
You can solve it as a Graph Matching Problem in a Bipartite Graph.
For each document, query pair define the graph:
G=(V,E) Where
V = {t1 | for each term t1 in the query} U { t2 | for each term t2 in the document}
E = { (t1,t2) | t1 is a match for t2 }
Intuitively: you have a vertex for each term in the query, a vertex for each term in the document, and an edge between a document term and a query term, only if the query term matches the document term. You have already solved this part with your trie.
You got yourself a bipartite graph, there are only edges between the "query vertices" and the "document vertices" (and not between two vertices of the same type).
Now, invoke a matching problem for bipartite graph, and get an optimal matching {(t1_1,t2_1), ... , (t1_k,t2_k)}.
Your algorithm should return a document d for a query q with m terms in the query, if (and only if) all m terms are satisfied, which means - you have maximal matching where k=m.
In your example, the graph for query="prog p", and document="Programming Jokes", you will get the bipartite graph with the matching: (or with Programming,p matched - doesn't matter which)
And, for the same query, and document="Programming Puzzles", you will get the bipartite graph with the matching:
As you can see, for the first example - there is no matching that covers all the terms, and you will "reject" the document. For the 2nd example - you were able to match all terms, and you will return it.
For performance issues, you can do the suggested algorithm only on a subset of the phrases, that were already filtered out by your initial approach (intersection of documents that have matching for all terms).
After some thought I came up with a similar idea to dened's - in addition to the index of a matched phrase, each prefix will refer to how many words it is a prefix of in that phrase - then that number can be reduced in the query process by the number of its superfixes among other query words, and the returned results include only those with at least the same number of matched words as the query.
We can implement an additional small tweak to avoid large cross-checks by adding (for the English language) a maximum of approximately 26 choose 2 + 26 choose 3 and even an additional 26 choose 4 special elements to the trie that refer to ordered first-letter intersections. When a phrase is inserted, the special elements in the trie referring to the 2 and 3 first-letter combinations will receive its index. Then match results from larger query words can be cross-checked against these. For example, if our query is "Geo i", the match results for "Geo" would be cross-checked against the special trie element, "g-i", which hopefully would have significantly less match results than "i".
Also, depending on the specific circumstances, large cross-checks could at times be more efficiently handled in parallel (for example, via a bitset &).

In an HTML table, can a numeric value be expressed with comma formatting?

In Visual FoxPro, a number can be formatted in a textbox or grid and still be seen by the program as just a number, even though it is shown with commas and period format.
Presently I'm inserting data into a DataTable Jquery as follows:
oTable.fnAddData( ["Bogus data","1,541,512.52","12.5%","0","0","0"]);
But I would like to be entering the data as follows and yet show it with the commas for clarity:
oTable.fnAddData( ["Bogus data",1541512.52,"12.5%","0","0","0"]);
The reason is that when you sort the rows on this column, the character string in the first example will produce a mess. the numbers, hopefully will produce a well ordered list.
If you have any other suggestions on how to fix the sorting of the character number column please suggest it...
TIA
Dennis
A sort plugin for formatted numbers is detailed here, by the author of DataTables:
http://www.datatables.net/plug-ins/sorting
See "formatted numbers":
This plug-in will provide numeric sorting for numeric columns which have extra formatting, such as thousands seperators, currency symobols or any other non-numeric data.
jQuery.extend( jQuery.fn.dataTableExt.oSort, {
"formatted-num-pre": function ( a ) {
a = (a==="-") ? 0 : a.replace( /[^\d\-\.]/g, "" );
return parseFloat( a );
},
"formatted-num-asc": function ( a, b ) {
return a - b;
},
"formatted-num-desc": function ( a, b ) {
return b - a;
}
} );
Here's a quick and dirty way to do it;
var number = 1541512.52;
var nicelyformattedNumber = number.toString().replace(/\B(?=(\d{3})+(?!\d))/g, ",");
oTable.fnAddData( ["Bogus data", nicelyformattedNumber, "12.5%","0","0","0"]);
Source

Categories

Resources