Fastest / most efficient way to compare two string arrays Javascript - javascript

Hi I was wondering whether anyone could offer some advice on the fastest / most efficient way to compre two arrays of strings in javascript.
I am developing a kind of tag cloud type thing based on a users input - the input being in the form a written piece of text such as a blog article or the likes.
I therefore have an array that I keep of words to not include - is, a, the etc etc.
At the moment i am doing the following:
Remove all punctuation from the input string, tokenize it, compare each word to the exclude array and then remove any duplicates.
The comparisons are preformed by looping over each item in the exclude array for every word in the input text - this seems kind of brute force and is crashing internet explorer on arrays of more than a few hundred words.
i should also mention my exclude list has around 300 items.
Any help would really be appreciated.
Thanks

I'm not sure about the whole approach, but rather than building a huge array then iterating over it, why not put the "keys" into a map-"like" object for easier comparison?
e.g.
var excludes = {};//object
//set keys into the "map"
excludes['bad'] = true;
excludes['words'] = true;
excludes['exclude'] = true;
excludes['all'] = true;
excludes['these'] = true;
Then when you want to compare... just do
var wordsToTest = ['these','are','all','my','words','to','check','for'];
var checkWord;
for(var i=0;i<wordsToTest.length;i++){
checkWord = wordsToTest[i];
if(excludes[checkword]){
//bad word, ignore...
} else {
//good word... do something with it
}
}
allows these words through ['are','my','to','check','for']

It would be worth a try to combine the words into a single regex, and then compare with that. The regex engine's optimizations might allow the search to skip forward through the search text a lot more efficiently than you could do by iterating yourself over separate strings.

You could use a hashing function for strings (I don't know if JS has one but i'm sure uncle Google can help ;] ). Then you would calculate hashes for all the words in your exclude list and create an array af booleans indexed by those hashes. Then just iterate through the text and check the word hashes against that array.

I have taken scunliffe's answer and modified it as follows:
var excludes = ['bad','words','exclude','all','these']; //array
now lets prototype a function that checks if a value is inside an Array:
Array.prototype.hasValue= function(value) {
for (var i=0; i<this.length; i++)
if (this[i] === value) return true;
return false;
}
lets test some words:
var wordsToTest = ['these','are','all','my','words','to','check','for'];
var checkWord;
for(var i=0; i< wordsToTest.length; i++){
checkWord = wordsToTest[i];
if( excludes.hasValue(checkWord) ){
//is bad word
} else {
//is good word
console.log( checkWord );
}
}
output:
['are','my','to','check','for']

I'd opt for the regex version
text = 'This is a text that contains the words to delete. It has some <b>HTML</b> code in it, and punctuation!';
deleteWords = ['is', 'a', 'that', 'the', 'to', 'this', 'it', 'in', 'and', 'has'];
// clear punctuation and HTML code
onlyWordsReg = /\<[^>]*\>|\W/g;
onlyWordsText = text.replace(onlyWordsReg, ' ');
reg = new RegExp('\\b' + deleteWords.join('\\b|\\b') + '\\b', 'ig');
cleanText = onlyWordsText .replace(reg, '');
// tokenize after this

Related

How to define a line break in extendscript for Adobe Indesign

I am using extendscript to build some invoices from downloaded plaintext emails (.txt)
At points in the file there are lines of text that look like "Order Number: 123456" and then the line ends. I have a script made from parts I found on this site that finds the end of "Order Number:" in order to get a starting position of a substring. I want to use where the return key was hit to go to the next line as the second index number to finish the substring. To do this, I have another piece of script from the helpful people of this site that makes an array out of the indexes of every instance of a character. I will then use whichever array object is a higher number than the first number for the substring.
It's a bit convoluted, but I'm not great with Javascript yet, and if there is an easier way, I don't know it.
What is the character I need to use to emulate a return key in a txt file in javascript for extendscript for indesign?
Thank you.
I have tried things like \n and \r\n and ^p both with and without quotes around them but none of those seem to show up in the array when I try them.
//Load Email as String
var b = new File("~/Desktop/Test/email.txt");
b.open('r');
var str = "";
while (!b.eof)
str += b.readln();
b.close();
var orderNumberLocation = str.search("Order Number: ") + 14;
var orderNumber = str.substring(orderNumberLocation, ARRAY NUMBER GOES HERE)
var loc = orderNumberLocation.lineNumber
function indexes(source, find) {
var result = [];
for (i = 0; i < source.length; ++i) {
// If you want to search case insensitive use
// if (source.substring(i, i + find.length).toLowerCase() == find) {
if (source.substring(i, i + find.length) == find) {
result.push(i);
}
}
alert(result)
}
indexes(str, NEW PARAGRAPH CHARACTER GOES HERE)
I want all my line breaks to show up as an array of indexes in the variable "result".
Edit: My method of importing stripped all line breaks from the document. Using the code below instead works better. Now \n works.
var file = File("~/Desktop/Test/email.txt", "utf-8");
file.open("r");
var str = file.read();
file.close();
You need to use Regular Expressions. Depending on the fields do you need to search, you'l need to tweek the regular expressions, but I can give you a point. If the fields on the email are separated by new lines, something like that will work:
var str; //your string
var fields = {}
var lookFor = /(Order Number:|Adress:).*?\n/g;
str.replace(lookFor, function(match){
var order = match.split(':');
var field = order[0].replace(/\s/g, '');//remove all spaces
var value = order[1];
fields[field]= value;
})
With (Order Number:|Adress:) you are looking for the fields, you can add more fields separated the by the or character | ,inside the parenthessis. The .*?\n operators matches any character till the first break line appears. The g flag indicates that you want to look for all matches. Then you call str.replace, beacause it allows you to perfom a single task on each match. So, if the separator of the field and the value is a colon ':', then you split the match into an array of two values: ['Order number', 12345], and then, store that matches into an object. That code wil produce:
fields = {
OrderNumber: 12345,
Adresss: "my fake adress 000"
}
Please try \n and \r
Example: indexes(str, "\r");
If i've understood well, wat you need is to str.split():
function indexes(source, find) {
var order;
var result = [];
var orders = source.split('\n'); //returns an array of strings: ["order: 12345", "order:54321", ...]
for (var i = 0, l = orders.length; i < l; i++)
{
order = orders[i];
if (order.match(/find/) != null){
result.push(i)
}
}
return result;
}

A javascript regular expression to tokenize the query

Hi I'm stumbled up on a problem related to regular expressions that I cannot resolve.
I need to tokenize the query (split query into parts), suppose the following one as an example:
These are the separate query elements "These are compound composite terms"
What I eventually need is to have an array of 7 tokens:
1) These
2) are
3) the
4) separate
5) query
6) elements
7) These are compound composite term
The seventh token consists of several words because it was inside double quotation marks.
My question is: Is it possible to tokenize the input string accordingly to above explanations using one regular expression?
Edit
I was curious about possibility of using Regex.exec or similar code instead of split while achieving the same thing, so I've did some investigation that was followed by another question here. And so as a another answer to a question a following regex can be used:
(?:")(?:\w+\W*)+(?:")|\w+
With the following one-liner usage scenario:
var tokens = query.match(/(?:")(?:\w+\W*)+(?:")|\w+/g);
Hope it will be useful...
You can use this regex:
var s = 'These are the separate query elements "These are compound composite term"';
var arr = s.split(/(?=(?:(?:[^"]*"){2})*[^"]*$)\s+/g);
//=> ["These", "are", "the", "separate", "query", "elements", ""These are compound composite term""]
This regex will split on spaces if those are outside double quotes by using a lookahead to make sure there are even number of quotes after space.
You can use a simpler approach to split the string and grab the substrings inside double quotation marks, and then get rid of empty array items with clean function:
Array.prototype.clean = function() {
for (var i = 0; i < this.length; i++) {
if (this[i] == undefined || this[i] == '') {
this.splice(i, 1);
i--;
}
}
return this;
};
var re = /"(.*?)"|\s/g;
var str = 'These are the separate query elements "These are compound composite term"';
var arr = str.split(re);
alert(arr.clean());
You can get everything that is between one quote and the next ".*?" or everything that is not a whitespace \S+:
var re = /".*?"|\S+/g,
str = 'These are the separate query elements "These are compound composite term"',
m,
arr = [];
while ( m = re.exec( str ) ){
arr.push( m[0] );
}
alert( arr.join('\n') );
\s(?=[^"]*(?:"[^"]*")*[^"]*$)
You can split by this.See demo.
https://www.regex101.com/r/fJ6cR4/20

Match String of Input to text/element and highlight reactive

HTML(JADE)
p#result Lorem ipsum is javascript j s lo 1 2 4 this meteor thismeteor. meteor
input.search
JS
Template.pg.events({
'keyup .search': function(e){
e.preventDefault();
var text = $('p#result').text();
var splitText = text.match(/\S+\s*/g);
var input = $(e.target).val();
var splitInput = input.match(/\S+\s*/g);
if(_.intersection(splitText, splitInput)) {
var match = _.intersection(splitText, splitInput);
var matchToString = match.toString();
$('p#result').text().replace(matchToString, '<b>'+matchToString+ '</b>')
}
console.log(splitText); //check what I get
console.log(splitInput); //check what I get
}
})
I have the above code.
What I'm trying to do is matching the input field's value, and then matching the text. I added it the function to keyup so that this is reactive.
When the fields and text match, it will add bold tagsto the matched strings (texts).
I think I'm almost there, but not quite yet.
How would I proceed on from here?
MeteorPad
Here
In your code, you seem to only be matching on whole words, although your question does not specify that. If you want to match on any text in the input (e.g., if you type "a", all "a"s in the text are made bold), you can do that relatively easily using the javascript split and join String methods:
Template.pg.events({
'keyup .search': function(e){
e.preventDefault();
var text = $('p#result').text();
var input = $(e.target).val();
var splitText = text.split(input); // Produces an array without whatever's in the input
console.log(splitText);
var rep = splitText.join("<b>" + input + "</b>"); // Produces a string with inputs replaced by boldened inputs
console.log(rep);
$('p#result').html(rep);
}
});
Notably, you have to replace the text on the page using $('p#result').html(), which was missing in your MeteorPad example. Note also that this is a case-sensitive implementation; you can use a regex to do the split, but it gets a bit more complicated when you want to replace the text in the join. You can play around with it on this MeteorPad.
To do this case-insensitively, the split is very straightforward -- you can use a RegExp like so:
...
var regex = new RegExp($(e.target).val(), 'gi'); // global and case-insensitive, where `input` used to be
The tricky thing is to extract the correct case of what you want to pull out, then put it back in -- you can't do this with a simple join, so you'll have to interleave the two arrays. You can see an example of interleaved arrays here, which was taken from this question. I've amended that a bit to deal with the uneven array lengths, here:
var interleave = function(array1, array2) {
return $.map(array1, function(v, i) {
if (array2[i]) { return [v, array2[i]]; } // deals with uneven array lengths
else { return [v]; }
});
}
I've also created another MeteorPad that you can play around with that does all of this. lo is a good test string to check out.

Regular expression in Javascript: table of positions instead of table of occurrences

Regular expressions are most powerful. However, the result they return is sometimes useless:
For example:
I want to manage a CSV string using semicolons.
I define a string like:
var data = "John;Paul;Pete;Stuart;George";
If I use the instruction:
var tab = data.match(/;/g)
after what, "tab" contains an array of 4 ";" :
tab[0]=";", tab[1]=";", tab[2]=";", tab[3]=";"
This array is not useful in the present case, because I knew it even before using the regular expression.
Indeed, what I want to do is 2 things:
1stly: Suppress the 4th element (not "Stuart" as "Stuart", but "Stuart" as 4th element)
2ndly: Replace the 3rd element by "Ringo" so as to get back (to where you once belonged!) the following result:
data == "John;Paul;Ringo;George";
In this case, I would greatly prefer to obtain an array giving the positions of semicolons:
tab[0]=4, tab[1]=9, tab[2]=14 tab[3]=21
instead of the useless (in this specific case)
tab[0]=";", tab[1]=";", tab[2]=";", tab[3]=";"
So, here's my question: Is there a way to obtain this numeric array using regular expressions?
To get tab[0]=4, tab[1]=9, tab[2]=14 tab[3]=21, you can do
var tab = [];
var startPos = 0;
var data = "John;Paul;Pete;Stuart;George";
while (true) {
var currentIndex = data.indexOf(";", startPos);
if (currentIndex == -1) {
break;
}
tab.push(currentIndex);
startPos = currentIndex;
}
But if the result wanted is "John;Paul;Ringo;George", you can do
var tab = data.split(';'); // Split the string into an array of strings
tab.splice(3, 1); // Suppress the 4th element
tab[2] = "Ringo"; // Replace the 3rd element by "Ringo"
var str = tab.join(';'); // Join the elements of the array into a string
The second approach is maybe better in your case.
String.split
Array.splice
Array.join
You should try a different approach, using split.
tab = data.split(';') will return an array of the form
tab[0]="John", tab[1]="Paul", tab[2]="Pete", tab[3]="Stuart", tab[4]="George"
You should be able to achieve your goal with this array.
Why use a regex to perform this operation? You have a built-in function split, which can split your string based on the delimiter you pass.
var data = "John;Paul;Pete;Stuart;George";
var temp=data.split(';');
temp[0],temp[1]...

Javascript / jQuery faster alternative to $.inArray when pattern matching strings

I've got a large array of words in Javascript (~100,000), and I'd like to be able to quickly return a subset of them based on a text pattern.
For example, I'd like to return all the words that begin with a pattern so typing hap should give me ["happy", "happiness", "happening", etc, etc], as a result.
If it's possible I'd like to do this without iterating over the entire array.
Something like this is not working fast enough:
// data contains an array of beginnings of words e.g. 'hap'
$.each(data, function(key, possibleWord) {
found = $.inArray(possibleWord, words);
// do something if found
}
Any ideas on how I could quickly reduce the set to possible matches without iterating over the whole word set? The word array is in alphabetical order if that helps.
If you just want to search for prefixes there are data structures just for that, such as the Trie and Ternary search trees
A quick Google search and some promissing Javascrit Trie and autocomplete implementations show up:
http://ejohn.org/blog/javascript-trie-performance-analysis/
Autocomplete using a trie
http://odhyan.com/blog/2010/11/trie-implementation-in-javascript/
I have absolutely no idea if this is any faster (a jsperf test is probably in order...), but you can do it with one giant string and a RegExp search instead of arrays:
var giantStringOfWords = giantArrayOfWords.join(' ');
function searchForBeginning(beginning, str) {
var pattern = new RegExp('\\b' + str + '\\w*'),
matches = str.match(pattern);
return matches;
}
var hapResults = searchForBeginning('hap', giantStringOfWords);
The best approach is to structure the data better. Make an object with keys like "hap". That member holds an array of words (or word suffixes if you want to save space) or a separated string of words for regexp searching.
This means you will have shorter objects to iterate/search. Another way is to sort the arrays and use a binary search pattern. There's a good conversation about techniques and optimizations here: http://ejohn.org/blog/revised-javascript-dictionary-search/
I suppose that using raw javascript can help a bit, you can do:
var arr = ["happy", "happiness", "nothere", "notHereEither", "happening"], subset = [];
for(var i = 0, len = arr.length; i < len; i ++) {
if(arr[i].search("hap") !== -1) {
subset.push(arr[i]);
}
}
//subset === ["happy", "happiness","happening"]
Also, if the array is ordered you could break early if the first letter is bigger than the first of your search, instead of looping the entire array.
var data = ['foo', 'happy', 'happiness', 'foohap'];
jQuery.each(data, function(i, item) {
if(item.match(/^hap/))
console.log(item)
});
If you have the data in an array, you're going to have to loop through the whole thing.
A really simple optimization is on page load go through your big words array and make a note of what index ranges apply to each starting letter. E.g., in my example below the "a" words go from 0 to 2, "b" words go from 3 to 4, etc. Then when actually doing a pattern match only look through the applicable range. Although obviously some letters will have more words than others, a given search will only have to look through an average of 100,000/26 words.
// words array assumed to be lowercase and in alphabetical order
var words = ["a","an","and","be","blue","cast","etc."];
// figure out the index for the first and last word starting with
// each letter of the alphabet, so that later searches can use
// just the appropriate range instead of searching the whole array
var letterIndexes = {},
i,
l,
letterIndex = 0,
firstLetter;
for (i=0, l=words.length; i<l; i++) {
if (words[i].charAt(0) === firstLetter)
continue;
if (firstLetter)
letterIndexes[firstLetter] = {first : letterIndex, last : i-1};
letterIndex = i;
firstLetter = words[i].charAt(0);
}
function getSubset(pattern) {
pattern = pattern.toLowerCase()
var subset = [],
fl = pattern.charAt(0),
matched = false;
if (letterIndexes[firstLetter])
for (var i = letterIndexes[fl].first, l = letterIndex[fl].last; i <= l; i++) {
if (pattern === words[i].substr(0, pattern.length)) {
subset.push(words[i]);
matched = true;
} else if (matched) {
break;
}
}
return subset;
}
Note also that when searching through the (range within the) words array, once a match is found I set a flag, which indicates we've gone past all of the words that are alphabetically before the pattern and are now making our way through the matching words. That way as soon as the pattern no longer matches we can break out of the loop. If the pattern doesn't match at all we still end up going through all the words for that first letter though.
Also, if you're doing this as a user types, when letters are added to the end of the pattern you only have to search through the previous subset, not through the whole list.
P.S. Of course if you want to break the word list up by first letter you could easily do that server-side.

Categories

Resources