Tokenize in JavaScript - javascript

If I have a string, how can I split it into an array of words and filter out some stopwords? I only want words of length 2 or greater.
If my string is
var text = "This is a short text about StackOverflow.";
I can split it with
var words = text.split(/\W+/);
But using split(/\W+/), I get all words. I could check if the words have a length of at least 2 with
function validate(token) {
return /\w{2,}/.test(token);
}
but I guess I could do this smarter/faster with regexp.
I also have an array var stopwords = ['has', 'have', ...] which shouldn't be allowed in the array.
Actually, if I can find a way to filter out stopwords, I could just add all letters a, b, c, ..., z to the stopwords array to only accept words with at least 2 characters.

I would do what you started: split by /W+/ and then validate each token (length and stopwords) in the array by using .filter().
var text = "This is a short text about StackOverflow.";
var stopwords = ['this'];
var words = text.split(/\W+/).filter(function(token) {
token = token.toLowerCase();
return token.length >= 2 && stopwords.indexOf(token) == -1;
});
console.log(words); // ["is", "short", "text", "about", "StackOverflow"]
You could easily tweak a regex to look for words >= 2 characters, but there's no point if you're already going to need to post-process to remove stopwords (token.length will be faster than any fancy regex you write).

Easy with Ramda:
var text = "This is a short text about how StackOverflow has gas.";
var stopWords = ['have', 'has'];
var isLongWord = R.compose(R.gt(R.__, 2), R.length);
var isGoWord = R.compose(R.not, R.contains(R.__, stopWords));
var tokenize = R.compose(R.filter(isGoWord), R.filter(isLongWord), R.split(' '));
tokenize(text); // ["This", "short", "text", "about", "how", "StackOverflow", "gas."]
http://bit.ly/1V5bVrP

What about splitting on something like this if you want to use a pure regex approach:
\W+|\b\w{1,2}\b
https://regex101.com/r/rB4cJ4/1

Something like this?
function filterArray(a, num_words, stop_words) {
b = [];
for (var ct = 0; ct <= a.length - 1; ct++) {
if (!(a[ct] <= num_words) && !ArrayContains[a[ct], stop_words) {
b.push(a[ct]);
}
}
return b
}
function ArrayContains(word, a) {
for (var ct = 0; ct <= a.length - 1; ct++) {
if (word == a[ct]) {
return true
}
return false
}
}
var words = "He walks the dog";
var stops = ["dog"]
var a = words.split(" ");
var f = filterArray(a, 2, stops);

This should be help
(?:\b\W*\w\W*\b)+|\W+
output:
ThisisashorttextaboutStackOverflow. A..Zabc..xyz.
where is matched string.

Related

Separate characters and numbers from a string

I have a string variable that contain character and numbers like this
var sampleString = "aaa1211"
Note that variable always start with a character/s and end with number/s. Character and number size is not fixed. It could be something like followings
var sampleString = "aaaaa12111"
var sampleString = "aaa12111"
I need to separate the characters and numbers and assign them into separate variables.
How could I do that ?
I try to use split and substring but for this scenario I couldn't apply those. I know this is a basic question but i'm search over the internet and I was unable to find an answer.
Thank you
Please use
[A-Za-z] - all letters (uppercase and lowercase)
[0-9] - all numbers
function myFunction() {
var str = "aaaaAZE12121212";
var patt1 = /[0-9]/g;
var patt2 = /[a-zA-Z]/g;
var letters = str.match(patt2);
var digits = str.match(patt1);
document.getElementById("alphabets").innerHTML = letters;
document.getElementById("numbers").innerHTML = digits;
}
Codepen-http://codepen.io/nagasai/pen/pbbGOB
A shorter solution if the string always starts with letters and ends with numbers as you say:
var str = 'aaaaa12111';
var chars = str.slice(0, str.search(/\d/));
var numbs = str.replace(chars, '');
console.log(chars, numbs);
You can use it in a single regex,
var st = 'red123';
var regex = new RegExp('([0-9]+)|([a-zA-Z]+)','g');
var splittedArray = st.match(regex);
var num= splittedArray[0];
var text = splittedArray[1];
this will give you both the text and number.
Using Match
const str = "ASDF1234";
const [word, digits] = str.match(/\D+|\d+/g);
console.log(word); // "ASDF"
console.log(digits); // "1234"
The above will work even if your string starts with digits.
Using Split
with Positive lookbehind (?<=) and Positive lookahead (?=):
const str = "ASDF1234";
const [word, digits] = str.split(/(?<=\D)(?=\d)/);
console.log(word); // "ASDF"
console.log(digits); // "1234"
where \D stands for not a digit and \d for digit.
Use isNaN() to differentiate
var sampleString = "aaa1211"
var newnum =""
var newstr =""
for(var i=0;i<sampleString.length;i++){
if(isNaN(sampleString[i])){
newstr = newstr+sampleString[i]
}else{
newnum= newstr+sampleString[i]
}
}
console.log(newnum) //1121
console.log(newstr) //aaa
If you're like me, you were looking to separate alphabets and numbers, no matter what their position is, Try this:
var separateTextAndNum = (x) => {
var textArray = x.split('')
var text = []
var num = []
textArray.forEach(t=>{
if (t>-1) {
num.push(t)
} else {
text.push(t)
}
})
return [text, num]
}
For ex - if you try this:
separateTextAndNum('abcd1234ava') // result [["a","b","c","d","a","v","a"],["1","2","3","4"]]
This isn't beautiful but it works.
function splitLettersAndNumbers(string) {
var numbers = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'];
var numbers, letters, splitIndex;
for (var i = 0; i < string.length; i++) {
if (numbers.indexOf(string[i]) > -1) {
letters = string.substring(0, i);
numbers = string.substring(i);
return [letters, numbers];
}
}
// in the chance that you don't find any numbers just return the initial string or array of the string of letters
return [string];
}
Essentially just looping through the string until you find a number and you split it at that index. Returning a tuple with your letters and numbers. So when you run it you can do something like this:
var results = splitLettersAndNumbers(string);
var letters = results[0];
var numbers = results[1];
A functional approach...
var sampleString = "aaaaa12111";
var seperate = sampleString.split('').reduce(function(start , item){
Number(item) ? start[0] += item : start[1] += item;
return start
},['',''])
console.log(seperate) //["12111", "aaaaa"]
You can loop through the string length, check it & add to the variable.
It is not clear if you want to assign each of the character to a variable or all alphabets to one variable & integers to another.
var sampleString = "aaa12111"
var _num = "";
var _alp = "";
for (var i = 0; i < sampleString.length; i++) {
if (isNaN(sampleString[i])) {
_num += sampleString[i];
} else {
_alp += sampleString[i];
}
}
console.log(_num, _alp)

Extract Keywords from String: Javascript

Lets consider i have a string & want to extract uncommon keywords for SEO. $text = "This is some text. This is some text. Vending Machines are great.";
& Will define a array of common words to ignore keywords in extracted list like $commonWords = ['i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www'];
Expected output: Result=[some,text,machines,vending]
Would really appreciate if Could any one help us to write generic logic or procedure for the extracting keywords from string?
This can help ( it supports multi languages):
https://github.com/michaeldelorenzo/keyword-extractor
var sentence = "President Obama woke up Monday facing a Congressional defeat that many in both parties believed could hobble his presidency."
// Extract the keywords
var extraction_result = keyword_extractor.extract(sentence,{
language:"english",
remove_digits: true,
return_changed_case:true,
remove_duplicates: false
});
Some like this
var $commonWords = ['i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www'];
var $text = "This is some text. This is some text. Vending Machines are great.";
// Convert to lowercase
$text = $text.toLowerCase();
// replace unnesessary chars. leave only chars, numbers and space
$text = $text.replace(/[^\w\d ]/g, '');
var result = $text.split(' ');
// remove $commonWords
result = result.filter(function (word) {
return $commonWords.indexOf(word) === -1;
});
// Unique words
result = result.unique();
console.log(result);
var string = "This is some text. This is some text. Vending Machines are great.";
var substrings = ['your','words', 'here'],
var results = array();
for (var i = substrings.length - 1; i >= 0; --i) {
if (string.indexOf(substrings[i]) != -1) {
// str contains substrings[i]
array.push(substrings[i]);
}
}
var arrayLength = commonWords.length;
var words = []; //new array to say the words
for (var i = 0; i < arrayLength; i++) {
if ($text.indexOf(commonWords[i]) > -1){
words.push(commonWords[i]);
}
}

How to match overlapping keywords with regex

This example finds only sam. How to make it find both sam and samwise?
var regex = /sam|samwise|merry|pippin/g;
var string = 'samwise gamgee';
var match = string.match(regex);
console.log(match);
Note: this is simple example, but my real regexes are created by joining 500 keywords at time, so it's too cumbersome to search all overlapping and make a special case for them with something like /sam(wise)/. The other obvious solution I can think of, is to just iterate though all keywords individually, but I think it must be a fast and elegant, single-regex solution.
You can use lookahead regex with capturing group for this overlapping match:
var regex = /(?=(sam))(?=(samwise))/;
var string = 'samwise';
var match = string.match( regex ).filter(Boolean);
//=> ["sam", "samwise"]
It is important to not to use g (global) flag in the regex.
filter(Boolean) is used to remove first empty result from matched array.
Why not just map indexOf() on array substr:
var string = 'samwise gamgee';
var substr = ['sam', 'samwise', 'merry', 'pippin'];
var matches = substr.map(function(m) {
return (string.indexOf(m) < 0 ? false : m);
}).filter(Boolean);
See fiddle console.log(matches);
Array [ "sam", "samwise" ]
Probably of better performance than using regex. But if you need the regex functionality e.g. for caseless matching, word boundaries, returned matches... use with exec method:
var matches = substr.map(function(v) {
var re = new RegExp("\\b" + v, "i"); var m = re.exec(string);
return (m !== null ? m[0] : false);
}).filter(Boolean);
This one with i-flag (ignore case) returns each first match with initial \b word boundary.
I can't think of a simple and elegant solution, but I've got something that uses a single regex:
function quotemeta(s) {
return s.replace(/\W/g, '\\$&');
}
let keywords = ['samwise', 'sam'];
let subsumed_by = {};
keywords.sort();
for (let i = keywords.length; i--; ) {
let k = keywords[i];
for (let j = i - 1; j >= 0 && k.startsWith(keywords[j]); j--) {
(subsumed_by[k] = subsumed_by[k] || []).push(keywords[j]);
}
}
keywords.sort(function (a, b) b.length - a.length);
let re = new RegExp('(?=(' + keywords.map(quotemeta).join('|') + '))[\\s\\S]', 'g');
let string = 'samwise samgee';
let result = [];
let m;
while (m = re.exec(string)) {
result.push(m[1]);
result.push.apply(result, subsumed_by[m[1]] || []);
}
console.log(result);
How about:
var re = /((sam)(?:wise)?)/;
var m = 'samwise'.match(re); // gives ["samwise", "samwise", "sam"]
var m = 'sam'.match(re); // gives ["sam", "sam", "sam"]
You can use Unique values in an array to remove dupplicates.
If you don't want to create special cases, and if order doesn't matter, why not first match only full names with:
\b(sam|samwise|merry|pippin)\b
and then, filter if some of these doesn't contain shorter one? for example with:
(sam|samwise|merry|pippin)(?=\w+\b)
It is not one elegant regex, but I suppose it is simpler than iterating through all matches.

Testing for a common word between 2 strings in javascript

I have to match 2 strings where at least one word is same, I need to give a success msg.
var str1 = "Hello World";
var str2 = "world is beautiful";
I need to match/compare these 2 strings, in both strings world is matching, So i need to print a success message. How do I go about it.
The following code will output all the matching words in the both strings:
var words1 = str1.split(/\s+/g),
words2 = str2.split(/\s+/g),
i,
j;
for (i = 0; i < words1.length; i++) {
for (j = 0; j < words2.length; j++) {
if (words1[i].toLowerCase() == words2[j].toLowerCase()) {
console.log('word '+words1[i]+' was found in both strings');
}
}
}
You can avoid comparing all the words in one list with all the words in the other by sorting each and eliminating duplicates. Adapting bjornd's answer:
var words1 = str1.split(/\s+/g),
words2 = str2.split(/\s+/g);
var allwords = {};
// set 1 for all words in words1
for(var wordid=0; wordid < words1.length; ++wordid) {
var low = words1[wordid].toLowerCase();
allwords[low] = 1;
}
// add 2 for all words in words2
for(var wordid=0; wordid < words2.length; ++wordid) {
var current = 0;
var low = words2[wordid].toLowerCase();
if(allwords.hasOwnProperty(low)) {
if(allwords[low] > 1) {
continue;
}
}
current += 2;
allwords[low] = current;
}
// now those seen in both lists have value 3, the rest either 1 or 2.
// this is effectively a bitmask where the unit bit indicates words1 membership
// and the 2 bit indicates words2 membership
var both = [];
for(var prop in allwords) {
if(allwords.hasOwnProperty(prop) && (allwords[prop] == 3)) {
both.push(prop);
}
}
This version should be reasonably efficient, because we are using a dictionary/hash structure to store information about each set of words. The whole thing is O(n) in javascript expressions, but inevitably dictionary insertion is not, so expect something like O(n log n) in practise. If you only care that a single word matches, you can quit early in the second for loop; the code as-is will find all matches.
This is broadly equivalent to sorting both lists, reducing each to unique words, and then looking for pairs in both lists. In C++ etc you would do it via two sets, as you could do it without using a dictionary and the comparison would be O(n) after the sorts. In Python because it's easy to read:
words1 = set(item.lower() for item in str1.split())
words2 = set(item.lower() for item in str2.split())
common = words1 & words2
The sort here (as with any set) happens on insertion into the set O(n log n) on word count n, and the intersection (&) is then efficent O(m) on the set length m.
I just tried this on WriteCodeOnline and it works there:
var s1 = "hello world, this is me";
var s2 = "I am tired of this world and I want to get off";
var s1s2 = s1 + ";" + s2;
var captures = /\b(\w+)\b.*;.*\b\1\b/i.exec(s1s2);
if (captures[1])
{
document.write(captures[1] + " occurs in both strings");
}
else
{
document.write("no match in both strings");
}
Just adapting #Phil H's code with a real bitmask:
var strings = ["Hello World", "world is beautiful"]; // up to 32 word lists
var occurrences = {},
result = [];
for (var i=0; i<strings.length; i++) {
var words = strings[i].toLowerCase().split(/\s+/),
bit = 1<<i;
for (var j=0, l=words.length; j<l; j++) {
var word = words[j];
if (word in occurrences)
occurrences[word] |= bit;
else
occurrences[word] = bit;
}
}
// now lets do a match for all words which are both in strings[0] and strings[1]
var filter = 3; // 1<<0 | 1<<1
for (var word in occurrences)
if ((occurrences[word] & filter) === filter)
result.push(word);
OK, the simple way:
function isMatching(a, b)
{
return new RegExp("\\b(" + a.match(/\w+/g).join('|') + ")\\b", "gi").test(b);
}
isMatching("in", "pin"); // false
isMatching("Everything is beautiful, in its own way", "Every little thing she does is magic"); // true
isMatching("Hello World", "world is beautiful"); // true
...understand?
I basically converted "Hello, World!" to the regular expression /\b(Hello|World)\b/gi
Something like this would also do:
isMatching = function(str1, str2) {
str2 = str2.toLowerCase();
for (var i = 0, words = str1.toLowerCase().match(/\w+/g); i < words.length; i++) {
if (str2.search(words[i]) > -1) return true;
}
return false;
};
var str1 = "Hello World";
var str2 = "world is beautiful";
isMatching(str1, str2); // returns true
isMatching(str1, 'lorem ipsum'); // returns false

Regex for comma separated string

I'm not a regex-master, but I'm looking for a regex that would give this result in js:
var regex = ...;
var result = '"a b", "c, d", e f, g, "h"'.match(regex);
and result would be
['"a b"', '"c, d"', 'e f', 'g', '"h"']
EDIT:
Escaped quotes don't need to be handled. It's for a tagging field, where users must be able to enter:
tag1, tag2
but also
"New York, USA", "Boston, USA"
EDIT2:
Thank you for your amazingly quick answer minitech, that did the trick!
I'd just use a loop:
function splitCSVFields(row) {
var result = [];
var i, c, q = false;
var current = '';
for(i = 0; c = row.charAt(i); i++) {
if(c === '"') {
current += c;
q = !q;
} else if(c === ',' && !q) {
result.push(current.trim());
current = '';
} else {
current += c;
}
}
if(row.length > 0) {
result.push(current.trim());
}
return result;
}
Note: requires String#trim, which you can shiv as follows:
if(!String.prototype.trim) {
String.prototype.trim = function() {
return this.replace(/^\s+/, '').replace(/\s+$/, '');
};
}
Regular expressions may not be the best tool for this task. You may want to instead do it instead by looping through the characters and deciding what to do. Here's some pseudocode that would do that:
Loop through the characters:
Is it a quote?
Toggle the quote flag.
Is it a comma when the quote flag is not set?
Add the accumulated string to the array.
Clear the accumulated string.
Skip the remaining steps in this iteration.
Add the current character to the string being accumulated.
Is the accumulated string not empty?
Add the accumulated string into the array.
Optionally, strip the whitespace off of all the strings in the array.
This could also be done when adding the strings into the array.
var result = input.match(/(?:(?:"((?:[^"]|"")*)")|([^",\n]*))/g);
for (var i = 0; i < result.length; i++) {
result[i] = result[i].replace(/^\s*/, "").replace(/\s*$/, "");
if (result[i].length === 0) {
result.splice(i--, 1);
}
}
Test this code here.

Categories

Resources