Extract Keywords from String: Javascript - javascript

Lets consider i have a string & want to extract uncommon keywords for SEO. $text = "This is some text. This is some text. Vending Machines are great.";
& Will define a array of common words to ignore keywords in extracted list like $commonWords = ['i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www'];
Expected output: Result=[some,text,machines,vending]
Would really appreciate if Could any one help us to write generic logic or procedure for the extracting keywords from string?

This can help ( it supports multi languages):
https://github.com/michaeldelorenzo/keyword-extractor
var sentence = "President Obama woke up Monday facing a Congressional defeat that many in both parties believed could hobble his presidency."
// Extract the keywords
var extraction_result = keyword_extractor.extract(sentence,{
language:"english",
remove_digits: true,
return_changed_case:true,
remove_duplicates: false
});

Some like this
var $commonWords = ['i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www'];
var $text = "This is some text. This is some text. Vending Machines are great.";
// Convert to lowercase
$text = $text.toLowerCase();
// replace unnesessary chars. leave only chars, numbers and space
$text = $text.replace(/[^\w\d ]/g, '');
var result = $text.split(' ');
// remove $commonWords
result = result.filter(function (word) {
return $commonWords.indexOf(word) === -1;
});
// Unique words
result = result.unique();
console.log(result);

var string = "This is some text. This is some text. Vending Machines are great.";
var substrings = ['your','words', 'here'],
var results = array();
for (var i = substrings.length - 1; i >= 0; --i) {
if (string.indexOf(substrings[i]) != -1) {
// str contains substrings[i]
array.push(substrings[i]);
}
}

var arrayLength = commonWords.length;
var words = []; //new array to say the words
for (var i = 0; i < arrayLength; i++) {
if ($text.indexOf(commonWords[i]) > -1){
words.push(commonWords[i]);
}
}

Related

Tokenize in JavaScript

If I have a string, how can I split it into an array of words and filter out some stopwords? I only want words of length 2 or greater.
If my string is
var text = "This is a short text about StackOverflow.";
I can split it with
var words = text.split(/\W+/);
But using split(/\W+/), I get all words. I could check if the words have a length of at least 2 with
function validate(token) {
return /\w{2,}/.test(token);
}
but I guess I could do this smarter/faster with regexp.
I also have an array var stopwords = ['has', 'have', ...] which shouldn't be allowed in the array.
Actually, if I can find a way to filter out stopwords, I could just add all letters a, b, c, ..., z to the stopwords array to only accept words with at least 2 characters.
I would do what you started: split by /W+/ and then validate each token (length and stopwords) in the array by using .filter().
var text = "This is a short text about StackOverflow.";
var stopwords = ['this'];
var words = text.split(/\W+/).filter(function(token) {
token = token.toLowerCase();
return token.length >= 2 && stopwords.indexOf(token) == -1;
});
console.log(words); // ["is", "short", "text", "about", "StackOverflow"]
You could easily tweak a regex to look for words >= 2 characters, but there's no point if you're already going to need to post-process to remove stopwords (token.length will be faster than any fancy regex you write).
Easy with Ramda:
var text = "This is a short text about how StackOverflow has gas.";
var stopWords = ['have', 'has'];
var isLongWord = R.compose(R.gt(R.__, 2), R.length);
var isGoWord = R.compose(R.not, R.contains(R.__, stopWords));
var tokenize = R.compose(R.filter(isGoWord), R.filter(isLongWord), R.split(' '));
tokenize(text); // ["This", "short", "text", "about", "how", "StackOverflow", "gas."]
http://bit.ly/1V5bVrP
What about splitting on something like this if you want to use a pure regex approach:
\W+|\b\w{1,2}\b
https://regex101.com/r/rB4cJ4/1
Something like this?
function filterArray(a, num_words, stop_words) {
b = [];
for (var ct = 0; ct <= a.length - 1; ct++) {
if (!(a[ct] <= num_words) && !ArrayContains[a[ct], stop_words) {
b.push(a[ct]);
}
}
return b
}
function ArrayContains(word, a) {
for (var ct = 0; ct <= a.length - 1; ct++) {
if (word == a[ct]) {
return true
}
return false
}
}
var words = "He walks the dog";
var stops = ["dog"]
var a = words.split(" ");
var f = filterArray(a, 2, stops);
This should be help
(?:\b\W*\w\W*\b)+|\W+
output:
ThisisashorttextaboutStackOverflow. A..Zabc..xyz.
where is matched string.

Get Full string using part of a given string

var string = "Let's say the user inputs hello world inputs inputs inputs";
My input to get the whole word is "put".
My expected word is "inputs"
Can anyone share your solution?
Thanks in advance
One way to do what you're asking is to split the input string into tokens, then check each one to see if it contains the desired substring. To eliminate duplicates, store the words in an object and only put a word into the result list if you're seeing it for the first time.
function findUniqueWordsWithSubstring(text, sub) {
var words = text.split(' '),
resultHash = {},
result = [];
for (var i = 0; i < words.length; ++i) {
var word = words[i];
if (word.indexOf(sub) == -1) {
continue;
}
if (resultHash[word] === undefined) {
resultHash[word] = true;
result.push(word);
}
}
return result;
}
var input = 'put some putty on the computer output',
words = findUniqueWordsWithSubstring(input, 'put');
alert(words.join(', '));
A RegEx and filter to remove duplicates;
var string = "I put putty on the computer. putty, PUT do I"
var uniques = {};
var result = (string.match(/\b\w*put\w*\b/ig) || []).filter(function(item) {
item = item.toLowerCase();
return uniques[item] ? false : (uniques[item] = true);
});
document.write( result.join(", ") );
// put, putty, computer

Word count issue with JS

I am very, very new at JS with no programming experience and I am struggling with creating a script that counts words in a text box. I have the following code and I can't get anything to populate:
var myTextareaElement = document.getElementById("myWordsToCount");
myTextareaElement.onkeyup = function(){
var wordsCounted = myTextareaElement.value;
var i = 0;
var str = wordsCounted;
var words = str.split('');
for (var i = words.length; i++) {if (words[i].length > 0; i++) { words[i] };
}
And for the Span Id in my HTML, I put the following:
<span id="wordsCounted"></span>
Any direction I where I am royally messing up would be great. I have tried it in JS fiddle and can't get it to populate.
The split method needs a proper character, you can use an space " " or a regex to indicate any whitespace character: "My name is XXX".split(/\s+/) will show ["My", "name", "is", "XXX"].
If you just want the number of words you can do "My name is XXX".split(/\s+/).length, which will return 4.
Try this, this may do what you want. Instead of doing a for loop, just count how many words are there and display the length of the array.
var myTextareaElement = document.getElementById("myWordsToCount");
myTextareaElement.onkeyup = function(){
var wordsCounted = myTextareaElement.value;
var i = 0;
var str = wordsCounted;
var words = str.split('');
if (words.length > 0){
document.getElementById('wordsCounted').innerHTML = words.length;
}
}

Get first word of string

Okay, here is my code with details of what I have tried to do:
var str = "Hello m|sss sss|mmm ss";
//Now I separate them by "|"
var str1 = str.split("|");
//Now I want to get the first word of every split-ed sting parts:
for (var i = 0; i < codelines.length; i++) {
//What to do here to get the first word of every spilt
}
So what should I do there? :\
What I want to get is :
firstword[0] will give "Hello"
firstword[1] will give "sss"
firstword[2] will give "mmm"
Use regular expression
var totalWords = "foo love bar very much.";
var firstWord = totalWords.replace(/ .*/,'');
$('body').append(firstWord);
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
Split again by a whitespace:
var firstWords = [];
for (var i=0;i<codelines.length;i++)
{
var words = codelines[i].split(" ");
firstWords.push(words[0]);
}
Or use String.prototype.substr() (probably faster):
var firstWords = [];
for (var i=0;i<codelines.length;i++)
{
var codeLine = codelines[i];
var firstWord = codeLine.substr(0, codeLine.indexOf(" "));
  firstWords.push(firstWord);
}
To get first word of string you can do this:
let myStr = "Hello World"
let firstWord = myStr.split(" ")[0]
console.log(firstWord)
split(" ") will convert your string into an array of words (substrings resulted from the division of the string using space as divider) and then you can get the first word accessing the first array element with [0].
See more about the split method.
I 'm using this :
function getFirstWord(str) {
let spaceIndex = str.indexOf(' ');
return spaceIndex === -1 ? str : str.substring(0, spaceIndex);
};
How about using underscorejs
str = "There are so many places on earth that I want to go, i just dont have time. :("
firstWord = _.first( str.split(" ") )
An improvement upon previous answers (working on multi-line or tabbed strings):
String.prototype.firstWord = function(){return this.replace(/\s.*/,'')}
Or using search and substr:
String.prototype.firstWord = function(){let sp=this.search(/\s/);return sp<0?this:this.substr(0,sp)}
Or without regex:
String.prototype.firstWord = function(){
let sps=[this.indexOf(' '),this.indexOf('\u000A'),this.indexOf('\u0009')].
filter((e)=>e!==-1);
return sps.length? this.substr(0,Math.min(...sps)) : this;
}
Examples:
String.prototype.firstWord = function(){return this.replace(/\s.*/,'')}
console.log(`linebreak
example 1`.firstWord()); // -> linebreak
console.log('space example 2'.firstWord()); // -> singleline
console.log('tab example 3'.firstWord()); // -> tab
var str = "Hello m|sss sss|mmm ss"
//Now i separate them by "|"
var str1 = str.split('|');
//Now i want to get the first word of every split-ed sting parts:
for (var i=0;i<str1.length;i++)
{
//What to do here to get the first word :)
var firstWord = str1[i].split(' ')[0];
alert(firstWord);
}
This code should get you the first word,
var str = "Hello m|sss sss|mmm ss"
//Now i separate them by "|"
var str1 = str.split('|');
//Now i want to get the first word of every split-ed sting parts:
for (var i=0;i<str1.length;i++)
{
//What to do here to get the first word :(
var words = str1[i].split(" ");
console.log(words[0]);
}
In modern JS, this is simplified, and you can write something like this:
const firstWords = str =>
str .split (/\|/) .map (s => s .split (/\s+/) [0])
const str = "Hello m|sss sss|mmm ss"
console .log (firstWords (str))
We first split the string on the | and then split each string in the resulting array on any white space, keeping only the first one.
I'm surprised this method hasn't been mentioned: "Some string".split(' ').shift()
To answer the question directly:
let firstWords = []
let str = "Hello m|sss sss|mmm ss";
const codeLines = str.split("|");
for (var i = 0; i < codeLines.length; i++) {
const first = codeLines[i].split(' ').shift()
firstWords.push(first)
}
const getFirstWord = string => {
const firstWord = [];
for (let i = 0; i < string.length; i += 1) {
if (string[i] === ' ') break;
firstWord.push(string[i]);
}
return firstWord.join('');
};
console.log(getFirstWord('Hello World'));
or simplify it:
const getFirstWord = string => {
const words = string.split(' ');
return words[0];
};
console.log(getFirstWord('Hello World'));
This code should get you the first word,
const myName = 'Jahid Bhuiyan';
console.log(myName.slice(0, myName.indexOf(' ')));
Ans will be "Jahid"

Testing for a common word between 2 strings in javascript

I have to match 2 strings where at least one word is same, I need to give a success msg.
var str1 = "Hello World";
var str2 = "world is beautiful";
I need to match/compare these 2 strings, in both strings world is matching, So i need to print a success message. How do I go about it.
The following code will output all the matching words in the both strings:
var words1 = str1.split(/\s+/g),
words2 = str2.split(/\s+/g),
i,
j;
for (i = 0; i < words1.length; i++) {
for (j = 0; j < words2.length; j++) {
if (words1[i].toLowerCase() == words2[j].toLowerCase()) {
console.log('word '+words1[i]+' was found in both strings');
}
}
}
You can avoid comparing all the words in one list with all the words in the other by sorting each and eliminating duplicates. Adapting bjornd's answer:
var words1 = str1.split(/\s+/g),
words2 = str2.split(/\s+/g);
var allwords = {};
// set 1 for all words in words1
for(var wordid=0; wordid < words1.length; ++wordid) {
var low = words1[wordid].toLowerCase();
allwords[low] = 1;
}
// add 2 for all words in words2
for(var wordid=0; wordid < words2.length; ++wordid) {
var current = 0;
var low = words2[wordid].toLowerCase();
if(allwords.hasOwnProperty(low)) {
if(allwords[low] > 1) {
continue;
}
}
current += 2;
allwords[low] = current;
}
// now those seen in both lists have value 3, the rest either 1 or 2.
// this is effectively a bitmask where the unit bit indicates words1 membership
// and the 2 bit indicates words2 membership
var both = [];
for(var prop in allwords) {
if(allwords.hasOwnProperty(prop) && (allwords[prop] == 3)) {
both.push(prop);
}
}
This version should be reasonably efficient, because we are using a dictionary/hash structure to store information about each set of words. The whole thing is O(n) in javascript expressions, but inevitably dictionary insertion is not, so expect something like O(n log n) in practise. If you only care that a single word matches, you can quit early in the second for loop; the code as-is will find all matches.
This is broadly equivalent to sorting both lists, reducing each to unique words, and then looking for pairs in both lists. In C++ etc you would do it via two sets, as you could do it without using a dictionary and the comparison would be O(n) after the sorts. In Python because it's easy to read:
words1 = set(item.lower() for item in str1.split())
words2 = set(item.lower() for item in str2.split())
common = words1 & words2
The sort here (as with any set) happens on insertion into the set O(n log n) on word count n, and the intersection (&) is then efficent O(m) on the set length m.
I just tried this on WriteCodeOnline and it works there:
var s1 = "hello world, this is me";
var s2 = "I am tired of this world and I want to get off";
var s1s2 = s1 + ";" + s2;
var captures = /\b(\w+)\b.*;.*\b\1\b/i.exec(s1s2);
if (captures[1])
{
document.write(captures[1] + " occurs in both strings");
}
else
{
document.write("no match in both strings");
}
Just adapting #Phil H's code with a real bitmask:
var strings = ["Hello World", "world is beautiful"]; // up to 32 word lists
var occurrences = {},
result = [];
for (var i=0; i<strings.length; i++) {
var words = strings[i].toLowerCase().split(/\s+/),
bit = 1<<i;
for (var j=0, l=words.length; j<l; j++) {
var word = words[j];
if (word in occurrences)
occurrences[word] |= bit;
else
occurrences[word] = bit;
}
}
// now lets do a match for all words which are both in strings[0] and strings[1]
var filter = 3; // 1<<0 | 1<<1
for (var word in occurrences)
if ((occurrences[word] & filter) === filter)
result.push(word);
OK, the simple way:
function isMatching(a, b)
{
return new RegExp("\\b(" + a.match(/\w+/g).join('|') + ")\\b", "gi").test(b);
}
isMatching("in", "pin"); // false
isMatching("Everything is beautiful, in its own way", "Every little thing she does is magic"); // true
isMatching("Hello World", "world is beautiful"); // true
...understand?
I basically converted "Hello, World!" to the regular expression /\b(Hello|World)\b/gi
Something like this would also do:
isMatching = function(str1, str2) {
str2 = str2.toLowerCase();
for (var i = 0, words = str1.toLowerCase().match(/\w+/g); i < words.length; i++) {
if (str2.search(words[i]) > -1) return true;
}
return false;
};
var str1 = "Hello World";
var str2 = "world is beautiful";
isMatching(str1, str2); // returns true
isMatching(str1, 'lorem ipsum'); // returns false

Categories

Resources