Finding n-gram frequencies in a large set of sentences - javascript

I have a set of text messages. Lets call them m1, m2, ..... The maximum number of message is below 1,000,000. Each message is below 1024 characters in length, and all are in lowercase. Lets also pick an n-gram s1.
I need to find frequency of all possible substring from all of these messages. For example, lets say we have only two messages:
m1 = a cat in a cage
m2 = a bird in a cage
The frequency of some n-gram in these two messages:
'a' = 4
'in a cage' = 2
'a bird' = 1
'a cat' = 1
...
Note that, as in = 2, in a = 2, and a cage = 2 are subsets of in a cage = 2 and have same frequency, they should not be listed. Only take the longest one that have the highest frequency; follow this condition: the longest sn-gram should consist of at most 8 words, with a total character count below 30. If a n-gram exceeds this limit, it can be broken into two or more n-grams and listed separately.
I need to find such n-grams for all of these text messages and sort them by their number of occurrences in descending order.
How to I approach this problem? I need a solution in javascript.
PS: I need help, but do not know to where to ask this. If the question
is not for this site, then where should I post it? please guide this
newbie here.

May be you can approach as follows. I will edit to add explanation as soon as i have some time.
var subSentences = (w,...ws) => ws.length ? ws.reduce((r,s) => (r.push(r[r.length-1] + ` ${s}`), r),[w])
.concat(subSentences(...ws))
: [w],
frequencyMap = sss => sss.reduce((map,ss) => subSentences(...ss.split(/\s+/)).reduce((m,s) => m.set(s, m.get(s) + 1 || 1), map), new Map());
frequencies = frequencyMap(["this is a test string",
"this is another one",
"yet another one is here"]);
console.log(...frequencies.entries()); // logging map object seems not possible hence entries
.as-console-wrapper { max-height : 100% !important
}

Related

adding indicators into a string according to different case

I will receive an array of string-like below.
In each string, there may be three signs: $,%,* in the string
For example,
“I would $rather %be $happy, %if working in a chocolate factory”
“It is ok to play tennis”
“Tennis $is a good sport”
“AO is really *good sport”
However, there may be no signs in it, maybe only one sign in it.
There are only five cases in string,
1. no sign at all,
2. having $,% ;
3. having only $,
4 having only %,
5 having only *
If there is no sign, I don’t need to process it.
Otherwise, I need to process it and add an indicator to the left of the first sign that occurs in the sentence.
For example:
“I would ---dollorAndperSign—-$rather %be $happy, %if working in a chocolate factory”
“Tennis --dollorSign—-$is a good sport”
This is my idea code.
So, I need to decide if the string contains any sign. If there is no sign, I don’t need to process it.
texts.map((text) => {
if (text.includes("$") || text.includes("%") || text.includes("*")) {
//to get the index of signs
let indexOfdollar, indexOfper, indexOfStar;
indexOfdollar = text.indexOf("$");
indexOfper = text.indexOf("%");
indexOfStar = text.indexOf("*");
//return a completed process text
}
});
Question:
how do I know which index is the smallest one in order to locate the position of the first sign occurring in the text? Getting the smallest value may not be the correct approach coz there may be the case that I will get -1 from the above code?
I focussed only on the "get the smallest index" part of your question... Since you will be able to do what you want with it after.
You can have the indexOf() in an array, filter it to remove the -1 and then use Math.min() to get the smallest one.
Edited to output an object instead, which includes the first index and some booleans for the presence each char.
const texts = [
"I would $rather %be $happy, %if working in a chocolate factory",
"It is ok to play tennis",
"Tennis $is a good sport",
"AO is really *good sport"
]
const minIndexes = texts.map((text,i) => {
//to get the signs
const hasDollard = text.indexOf("$") >= 0
const hasPercent = text.indexOf("%") >= 0
const hasStar = text.indexOf("*") >= 0
//to get the first index
const indexes = [text.indexOf("$"), text.indexOf("%"), text.indexOf("*")].filter((index) => index >= 0)
if(!indexes.length){
return null
}
return {
index: Math.min( ...indexes),
hasDollard,
hasPercent,
hasStar
}
});
console.log(minIndexes)
const texts = [
"I would $rather %be $happy, %if working in a chocolate factory",
"It is ok to play tennis",
"Tennis $is a good sport",
"AO is really *good sport"
]
texts.forEach(text => {
let sighs = ["%","$","*"];
let chr = text.split('').find(t => sighs.find(s => s==t));
if (!chr)
return;
text = text.replace(chr, "---some text---" + chr);
console.log(text);
})
const data = ['I would $rather %be $happy, %if working in chocolate factory', 'It is ok to play tennis', 'Tennis $is a good sport', 'AO is really *good sport']
const replace = s => {
signs = { $: 'dollar', '%': 'per', '*': 'star' },
characters = Array.from(s, (c,i)=> '$%*'.includes(c)? c:'').join('')
headText = [...new Set(Array.from(characters))].map(c => signs[c]).join('|')
s.replace(/[\$\%\*]/, `--${text}--$&`);
}
const result = data.map(replace)

Javascript number interval search optimization

As topic says i ran into optimization problem when it comes to a large amount of intervals
Variables: FirstPrice, LastPrice, Increment and the number that user writes
Task: Need to round number to one side or another in specific interval.
Example:
FirstPrice = 0.99, LastPrice = 9.99, Increment = 1 User number = 5.35
Workflow:
I need to find interval in which one number exists, and what i could think of is to push numbers into array. So for this example array would be:
["0.99","1.99","2.99","3.99","4.99","5.99","6.99","7.99","8.99","9.99"].
Then i use for loop to find in which interval (in this case number = 5.35) number exists. In this case interval would be from 4.99 to 5.99. And then user number is updating to 4.99 or 5.99.
Problem:
It works fine doing that with low amount of numbers. But it comes really hard to execute high ranges. E.g. if FirstPrice = 1 LastPrice = 1000000 and Increment = 1. Then my array gets thousands of values and it takes way too long to push every value into array and find the number. Array gets ~999999 values and then loop goes through all of them to find specific interval.
So i think my problem is clear. The optimization. I need a better way of doing this. I tried cutting price range to half and a half but then intervals are wrong. Tried working with the inserted number but the same problem occurred.
Correct me if I'm wrong, but isn't this simply:
lower = floor((q - s) / i) * i + s
upper = lower + i
where:
q = User number
s = FirstPrice
i = Increment
function foo(q, s, i) {
const lower = Math.floor((q - s) / i) * i + s;
const upper = lower + i;
return [lower, upper];
}
console.log(
foo(5.35, 0.99, 1),
foo(5.35, 0.99, 2),
foo(5.35, 0.99, 3)
);

Best way to compare 2 similar strings? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Original Question:
I have a lot of products with various names, I have two variations of
the names I need compared (Basically finding out if these two strings
are the same products). I don't want any false flags, does anyone have
recommendations on how I can achieve this?
Here is a product example:
Canon 50mm f/1.2L vs Canon EF 50mm f/1.2L USM Lens
There are other variations, but this will be the typical difference.
Is there any easy functionality I could implement to get a certain
answer? Only thing I can think of is maybe splitting the strings and
comparing and say if x matches a, b, or c.
My original question was a bit vague. The end goal is to be able to compare two strings and see how similar they are - e.g. 0%, 50%, or 100% similar. In this scenario I am using lens products from different sources, they use similar names - however I have no product sku/id for proper comparison.
The string score plugin has solved my issue, providing a value of how similar these products are.
In the bioinformatics word and I believe in other domains, this kind of pattern matching/searching algorithm is called fuzzy search.
There is a nodeJS module called string_score for it. Essentially you feed the API with 2 pieces of string and it returns you a score of how similar they are.
Example:
var test = require('string_score');
var match_percent = "Canon EF 50mm f/1.2L USM Lens".score("Canon 50mm f/1.2L");
console.log("Match score= " + match_percent);
Output:
Match score= 0.7938133874239354
Using the score as a baseline for comparison. You can say if it has a score of equip or over 80 then it is a match.
More Example:
var score = 0;
score = "hello world".score("he");
console.log("Match score => " + score);
score = "hello world".score("hel");
console.log("Match score => " + score);
score = "hello world".score("hell");
console.log("Match score => " + score);
score = "hello world".score("hello");
console.log("Match score => " + score);
<script type="text/javascript" src="//cdnjs.cloudflare.com/ajax/libs/string_score/0.1.10/string_score.min.js"></script>
References:
String_score: https://github.com/joshaven/string_score
You have to think about how would you recognize if two strings are the same product yourself, just by reading them.
Based solely on the examples you provided, it seems that the way to tell two strings representing a product are the same is if every word (a token separated by spaces) from the shorter string is contained in the longer string.
You might also want to ignore capitalization.
Something like this should work for the basic usage:
const tokens = s => s.toLowerCase().split(/\s+/g);
const sameProducts = (s1, s2) => {
const s1Tokens = tokens(s1);
const s2Tokens = tokens(s2);
const [shorterTokens, longerTokens] = s1Tokens.length > s2Tokens.length
? [s2Tokens, s1Tokens]
: [s1Tokens, s2Tokens];
return shorterTokens.every(st => longerTokens.includes(st));
}
console.log(
sameProducts(
'Canon 50mm f/1.2L',
'Canon EF 50mm f/1.2L USM Lens'
)
)
This code would have quadratic time complexity because the most expensive operation means that, for every token in the shorter string, you have to iterate through every token in the longer string.
A simple optimization would be to build a Set<token> from the longer string. This would make the operation linear because searching a set is O(1).
const tokens = s => s.toLowerCase().split(/\s+/g);
const sameProducts = (s1, s2) => {
const s1Tokens = tokens(s1);
const s2Tokens = tokens(s2);
const [shorterTokens, longerTokens] = s1Tokens.length > s2Tokens.length
? [s2Tokens, s1Tokens]
: [s1Tokens, s2Tokens];
const longerTokensSet = longerTokens.reduce((s, t) => {
s.add(t);
return s;
}, new Set());
return shorterTokens.every(st => longerTokensSet.has(st));
}
console.log(
sameProducts(
'Canon 50mm f/1.2L',
'Canon EF 50mm f/1.2L USM Lens'
)
)
Now you have to consider, do all tokens have to match? Maybe only tokens corresponding to the brand and focal-length have to match.
If this is the case, you might also want to validate both strings while parsing them and return false immediately if the product strings are invalid.
Here's a rough idea:
const productSet = new Set(['canon'])
const focalLengthsSet = new Set(['50mm']);
const isMeaningful = t => productSet.has(t) || focalLengthsSet.has(t);
const meaningfulTokens = s => s.toLowerCase().split(/\s+/g).filter(isMeaningful);
const validTokens = (tokens, s) => {
const valid = tokens.length === 2; // <-- could do better validation here
console.assert(valid, `Missing token(s) in ${s}`);
return valid;
}
const sameProducts = (s1, s2) => {
const s1Tokens = meaningfulTokens(s1);
if (!validTokens(s1Tokens, s1)) { return false; }
const s2Tokens = meaningfulTokens(s2);
if (!validTokens(s2Tokens, s2)) { return false; }
const [shorterTokens, longerTokens] = s1Tokens.length > s2Tokens.length
? [s2Tokens, s1Tokens]
: [s1Tokens, s2Tokens];
const longerTokensSet = longerTokens.reduce((s, t) => {
s.add(t);
return s;
}, new Set());
return shorterTokens.every(st => longerTokensSet.has(st));
}
console.log(
sameProducts(
'Canon 50mm f/1.3',
'Canon EF 50mm f/1.2'
)
)
console.log(
sameProducts(
'Canon 50mm f/1.3',
'Canon EF f/1.2' // <-- missing focal length
)
)
Now you could consider does every focal length correspond to every product or is it more product-specific?
Do tokens contain logic that explicitly depends on previously matched tokens?
All of the above are just basic approaches and techniques you could use but the actual solution would heavily depend on your exact circumstances.
A common algorithm for measuring string similarity is called the Levenstein distance.
The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
This algorithm would allow you to perhaps match the strings directly if you edit distance threshold is strict enough (although this could provide false positives) or you could even account for misspelled products for example when comparing individual tokens by making sure they are within a specific edit distance from one another.

How to count occurrence of multiple sub-string in a long string with JavaScript

I am a fresh with JavaScript. I just tried a lot, but did not get the answer and information to show how to count occurrence of multiple sub-string in a long string at one time.
Further information: I need get the occurrence of these sub-string and if the number of their occurrence to much, I need replace them at one time,so I need get the occurrence at one time.
Here is an example:
The long string Text as below,
Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.
The sub-string is a question, but what I need is to count each word occurrence in this sub-string at one time. for example, the word "name","NFL","championship","game" and "is","the" in this string.
What is the name of NFL championship game?
One of problems is some sub-string is not in the text, and some have shown many times.(which I might replaced it)
The Code I have tried as below, it is wrong, I have tried many different ways but no good results.
$(".showMoreFeatures").click(function(){
var text= $(".article p").text(); // This is to get the text.
var textCount = new Array();
// Because I use match, so for the first word "what", will return null, so
this is to avoid this null. and I was plan to get the count number, if it is
more than 7 or even more, I will replace them.
var qus = item2.question; //This is to get the sub-string
var checkQus = qus.split(" "); // I split the question to words
var newCheckQus = new Array();
// This is the array I was plan put the sub-string which count number less than 7, which I really needed words.
var count = new Array();
// Because it is a question as sub-string and have many words, so I wan plan to get their number and put them in a array.
for(var k =0; k < checkQus.length; k++){
textCount = text.match(checkQus[k],"g")
if(textCount == null){
continue;
}
for(var j =0; j<checkQus.length;j++){
count[j] = textCount.length;
}
//count++;
}
I was tried many different ways, and searched a lot, but no good results. The above code just want to show what I have tried and my thinking(might totally wrong). But actually it is not working , if you know how to implement it,solve my problem, please just tell me, no need to correct my code.
Thanks very much.
If I have understood the question correctly then it seems you need to count the number of times the words in the question (que) appear in the text (txt)...
var txt = "Super Bowl 50 was an American ...etc... Arabic numerals 50.";
var que = "What is the name of NFL championship game?";
I'll go through this in vanilla JavaScript and you can transpose it for JQuery as required.
First of all, to focus on the text we can make things a little simpler by changing the strings to lowercase and removing some of the punctuation.
// both strings to lowercase
txt = txt.toLowerCase();
que = que.toLowerCase();
// remove punctuation
// using double \\ for proper regular expression syntax
var puncArray = ["\\,", "\\.", "\\(", "\\)", "\\!", "\\?"];
puncArray.forEach(function(P) {
// create a regular expresion from each punctuation 'P'
var rEx = new RegExp( P, "g");
// replace every 'P' with empty string (nothing)
txt = txt.replace(rEx, '');
que = que.replace(rEx, '');
});
Now we can create a cleaner array from str and que as well as a hash table from que like so...
// Arrays: split at every space
var txtArray = txt.split(" ");
var queArray = que.split(" ");
// Object, for storing 'que' counts
var queObject = {};
queArray.forEach(function(S) {
// create 'queObject' keys from 'queArray'
// and set value to zero (0)
queObject[S] = 0;
});
queObject will be used to hold the words counted. If you were to console.debug(queObject) at this point it would look something like this...
console.debug(queObject);
/* =>
queObject = {
what: 0,
is: 0,
the: 0,
name: 0,
of: 0,
nfl: 0,
championship: 0,
game: 0
}
*/
Now we want to test each element in txtArray to see if it contains any of the elements in queArray. If the test is true we'll add +1 to the equivalent queObject property, like this...
// go through each element in 'queArray'
queArray.forEach(function(A) {
// create regular expression for testing
var rEx = new RegExp( A );
// test 'rEx' against elements in 'txtArray'
txtArray.forEach(function(B) {
// is 'A' in 'B'?
if (rEx.test(B)) {
// increase 'queObject' property 'A' by 1.
queObject[A]++;
}
});
});
We use RegExp test method here rather than String match method because we just want to know if "is A in B == true". If it is true then we increase the corresponding queObject property by 1. This method will also find words inside words, such as 'is' in 'San Francisco' etc.
All being well, logging queObject to the console will show you how many times each word in the question appeared in the text.
console.debug(queObject);
/* =>
queObject = {
what: 0
is: 2
the: 17
name: 0
of: 2
nfl: 1
championship: 0
game: 4
}
*/
Hoped that helped. :)
See MDN for more information on:
Array.forEach()
Object.keys()
RegExp.test()

Searching for multiple partial phrases so that one original phrase can not match multiple search phrases

Given a predefined set of phrases, I'd like to perform a search based on user's query. For example, consider the following set of phrases:
index phrase
-----------------------------------------
0 Stack Overflow
1 Math Overflow
2 Super User
3 Webmasters
4 Electrical Engineering
5 Programming Jokes
6 Programming Puzzles
7 Geographic Information Systems
The expected behaviour is:
query result
------------------------------------------------------------------------
s Stack Overflow, Super User, Geographic Information Systems
web Webmasters
over Stack Overflow, Math Overflow
super u Super User
user s Super User
e e Electrical Engineering
p Programming Jokes, Programming Puzzles
p p Programming Puzzles
To implement this behaviour I used a trie. Every node in the trie has an array of indices (empty initially).
To insert a phrase to the trie, I first break it to words. For example, Programming Puzzles has index = 6. Therefore, I add 6 to all the following nodes:
p
pr
pro
prog
progr
progra
program
programm
programmi
programmin
programming
pu
puz
puzz
puzzl
puzzle
puzzles
The problem is, when I search for the query prog p, I first get a list of indices for prog which is [5, 6]. Then, I get a list of indices for p which is [5, 6] as well. Finally, I calculate the intersection between the two, and return the result [5, 6], which is obviously wrong (should be [6]).
How would you fix this?
Key Observation
We can use the fact that two words in a query can match the same word in a phrase only if one query word is a prefix of the other query word (or if they are same). So if we process the query words in descending lexicographic order (prefixes come after their "superwords"), then we can safely remove words from the phrases at the first match. Doing so we left no possibility to match the same phrase word twice. As I said, it is safe because prefixes match superset of phrase words what their "superwords" can match, and pair of query words, where one is not a prefix of the other, always match disjoint set of phrase words.
We don't have to remove words from phrases or the trie "physically", we can do it "virtually".
Implementation of the Algorithm
var PhraseSearch = function () {
var Trie = function () {
this.phraseWordCount = {};
this.children = {};
};
Trie.prototype.addPhraseWord = function (phrase, word) {
if (word !== '') {
var first = word.charAt(0);
if (!this.children.hasOwnProperty(first)) {
this.children[first] = new Trie();
}
var rest = word.substring(1);
this.children[first].addPhraseWord(phrase, rest);
}
if (!this.phraseWordCount.hasOwnProperty(phrase)) {
this.phraseWordCount[phrase] = 0;
}
this.phraseWordCount[phrase]++;
};
Trie.prototype.getPhraseWordCount = function (prefix) {
if (prefix !== '') {
var first = prefix.charAt(0);
if (this.children.hasOwnProperty(first)) {
var rest = prefix.substring(1);
return this.children[first].getPhraseWordCount(rest);
} else {
return {};
}
} else {
return this.phraseWordCount;
}
}
this.trie = new Trie();
}
PhraseSearch.prototype.addPhrase = function (phrase) {
var words = phrase.trim().toLowerCase().split(/\s+/);
words.forEach(function (word) {
this.trie.addPhraseWord(phrase, word);
}, this);
}
PhraseSearch.prototype.search = function (query) {
var answer = {};
var phraseWordCount = this.trie.getPhraseWordCount('');
for (var phrase in phraseWordCount) {
if (phraseWordCount.hasOwnProperty(phrase)) {
answer[phrase] = true;
}
}
var prefixes = query.trim().toLowerCase().split(/\s+/);
prefixes.sort();
prefixes.reverse();
var prevPrefix = '';
var superprefixCount = 0;
prefixes.forEach(function (prefix) {
if (prevPrefix.indexOf(prefix) !== 0) {
superprefixCount = 0;
}
phraseWordCount = this.trie.getPhraseWordCount(prefix);
function phraseMatchedWordCount(phrase) {
return phraseWordCount.hasOwnProperty(phrase) ? phraseWordCount[phrase] - superprefixCount : 0;
}
for (var phrase in answer) {
if (answer.hasOwnProperty(phrase) && phraseMatchedWordCount(phrase) < 1) {
delete answer[phrase];
}
}
prevPrefix = prefix;
superprefixCount++;
}, this);
return Object.keys(answer);
}
function test() {
var phraseSearch = new PhraseSearch();
var phrases = [
'Stack Overflow',
'Math Overflow',
'Super User',
'Webmasters',
'Electrical Engineering',
'Programming Jokes',
'Programming Puzzles',
'Geographic Information Systems'
];
phrases.forEach(phraseSearch.addPhrase, phraseSearch);
var queries = {
's': 'Stack Overflow, Super User, Geographic Information Systems',
'web': 'Webmasters',
'over': 'Stack Overflow, Math Overflow',
'super u': 'Super User',
'user s': 'Super User',
'e e': 'Electrical Engineering',
'p': 'Programming Jokes, Programming Puzzles',
'p p': 'Programming Puzzles'
};
for(var query in queries) {
if (queries.hasOwnProperty(query)) {
var expected = queries[query];
var actual = phraseSearch.search(query).join(', ');
console.log('query: ' + query);
console.log('expected: ' + expected);
console.log('actual: ' + actual);
}
}
}
One can test this code here: http://ideone.com/RJgj6p
Possible Optimizations
Storing the phrase word count in each trie node is not very memory
efficient. But by implementing compressed trie it is possible to
reduce the worst case memory complexity to O(n m), there n is the
number of different words in all the phrases, and m is the total
number of phrases.
For simplicity I initialize answer by adding all the phrases. But
a more time efficient approach is to initialize answer by adding
the phrases matched by the query word matching least number of
phrases. Then intersect with the phrases of the query word matching
second least number of phrases. And so on...
Relevant Differences from the Implementation Referenced in the Question
In trie node I store not only the phrase references (ids) matched by the subtrie, but also the number of matched words in these phrases. So, the result of the match is not only the matched phrase references, but also the number of matched words in these phrases.
I process query words in descending lexicographic order.
I subtract the number of superprefixes (query words of which the current query word is a prefix) from current match results (by using variable superprefixCount), and a phrase is considered matched by the current query word only when the resulting number of matched words in it is greater than zero. As in the original implementation, the final result is the intersection of the matched phrases.
As one can see, changes are minimal and asymptotic complexities (both time and memory) are not changed.
If the set of phrases is defined and does not contain long phrases, maybe you can create not 1 trie, but n tries, where n is the maximum number of words in one phrase.
In i-th trie store i-th word of the phrase. Let's call it the trie with label 'i'.
To process query with m words let's consider the following algorithm:
For each phrase we will store the lowest label of a trie, where the word from this phrase was found. Let's denote it as d[j], where j is the phrase index. At first for each phrase j, d[j] = -1.
Search the first word in each of n tries.
For each phrase j find the label of a trie that is greater than d[j] and where the word from this phrase was found. If there are several such labels, pick the smallest one. Let's denote such label as c[j].
If there is no such index, this phrase can not be matched. You can mark this case with d[j] = n + 1.
If there is such c[j] that c[j] > d[j], than assign d[j] = c[j].
Repeat for every word left.
Every phrase with -1 < d[j] < n is matched.
This is not very optimal. To improve performance you should store only usable values of d array. After first word, store only phrases, matched with this word. Also, instead of assignment d[j] = n + 1, delete index j. Process only already stored phrase indexes.
You can solve it as a Graph Matching Problem in a Bipartite Graph.
For each document, query pair define the graph:
G=(V,E) Where
V = {t1 | for each term t1 in the query} U { t2 | for each term t2 in the document}
E = { (t1,t2) | t1 is a match for t2 }
Intuitively: you have a vertex for each term in the query, a vertex for each term in the document, and an edge between a document term and a query term, only if the query term matches the document term. You have already solved this part with your trie.
You got yourself a bipartite graph, there are only edges between the "query vertices" and the "document vertices" (and not between two vertices of the same type).
Now, invoke a matching problem for bipartite graph, and get an optimal matching {(t1_1,t2_1), ... , (t1_k,t2_k)}.
Your algorithm should return a document d for a query q with m terms in the query, if (and only if) all m terms are satisfied, which means - you have maximal matching where k=m.
In your example, the graph for query="prog p", and document="Programming Jokes", you will get the bipartite graph with the matching: (or with Programming,p matched - doesn't matter which)
And, for the same query, and document="Programming Puzzles", you will get the bipartite graph with the matching:
As you can see, for the first example - there is no matching that covers all the terms, and you will "reject" the document. For the 2nd example - you were able to match all terms, and you will return it.
For performance issues, you can do the suggested algorithm only on a subset of the phrases, that were already filtered out by your initial approach (intersection of documents that have matching for all terms).
After some thought I came up with a similar idea to dened's - in addition to the index of a matched phrase, each prefix will refer to how many words it is a prefix of in that phrase - then that number can be reduced in the query process by the number of its superfixes among other query words, and the returned results include only those with at least the same number of matched words as the query.
We can implement an additional small tweak to avoid large cross-checks by adding (for the English language) a maximum of approximately 26 choose 2 + 26 choose 3 and even an additional 26 choose 4 special elements to the trie that refer to ordered first-letter intersections. When a phrase is inserted, the special elements in the trie referring to the 2 and 3 first-letter combinations will receive its index. Then match results from larger query words can be cross-checked against these. For example, if our query is "Geo i", the match results for "Geo" would be cross-checked against the special trie element, "g-i", which hopefully would have significantly less match results than "i".
Also, depending on the specific circumstances, large cross-checks could at times be more efficiently handled in parallel (for example, via a bitset &).

Categories

Resources