Matching all excerpts which starts and ends with specific words - javascript

I have a text which looks like:
some non interesting part
trans-top
body of first excerpt
trans-bottom
next non interesting part
trans-top
body of second excerpt
trans-bottom
non interesting part
And I want to extract all excerpts starting with trans-top and ending with trans-bottom into an array. I tried that:
match(/(?=trans-top)(.|\s)*/g)
to find strings witch starts with trans-top. And it works. Now I want to specify the end:
match(/(?=trans-top)(.|\s)*(?=trans-bottom)/g)
and it doesn't. Firebug gives me an error:
regular expression too complex
I tried many other ways, but I can't find working solution... I'm shure I made some stupid mistake:(.

This works pretty well, but it's not all in one regex:
var test = "some non interesting part\ntrans-top\nbody of first excerpt\ntrans-bottom\nnext non interesting part\ntrans-top\nbody of second excerpt\ntrans-bottom\nnon interesting part";
var matches = test.match(/(trans-top)([\s\S]*?)(trans-bottom)/gm);
for(var i=0; i<matches.length; i++) {
matches[i] = matches[i].replace(/^trans-top|trans-bottom$/gm, '');
}
console.log(matches);
If you don't want the leading and trailing linebreaks, change the inner loop to:
matches[i] = matches[i].replace(/^trans-top[\s\S]|[\s\S]trans-bottom$/gm, '');
That should eat the linebreaks.

This tested function uses one regex and loops through picking out the contents of each match placing them all in an array which is returned:
function getParts(text) {
var a = [];
var re = /trans-top\s*([\S\s]*?)\s*trans-bottom/g;
var m = re.exec(text);
while (m != null) {
a.push(m[1]);
m = re.exec(text);
}
return a;
}
It also filters out any lealding and trailing whitespace surrounding each match contents.

Related

Regex to get the first element of each line

I'm trying to get the first element of each line, be it either a number or a string but when the line starts with a number, my current attempt still includes it:
const totalWords = "===========\n\n 1-test\n\n 2-ests \n\n 1 zfzrf";
const firstWord = totalWords.replace(/\s.*/,'')
The output I get :
1-test
2-ests
1 zfzrf
The output I would like:
1
2
1
Alternatively, if you are interested in a non-regexp version (should be faster)
var str = "===========\n\n 1-test\n\n 2-ests \n\n 1 zfzrf";
var res = str.split("\n");
for (row of res) {
let words = row.trim().split(' ');
let firstWord = words[0].trim();
// get first character, parse to int, validate it is infact integer
let element = firstWord.charAt(0);
if (Number.isInteger(parseInt(element))) {
console.log('Row', row);
console.log('Element: ', element);
}
}
Your Regex should skip leading spaces and then capture everything until a space or a dash, so you might want to go with ^\s*([^ -]+).
(See https://regex101.com/r/u7ELiw/1 for application to your examples)
If you additionally know exactly that you are looking for a single digit, you can instead go for ^\s*(\d)
(See https://regex101.com/r/IWhEQ1/1 again for applications)
Maybe im not too sure what you are asking but why are you using something as convoluted as regex when
line.charAt(0)
works pretty well?

Regex extracting multiple matches for string [duplicate]

I'm trying to obtain all possible matches from a string using regex with javascript. It appears that my method of doing this is not matching parts of the string that have already been matched.
Variables:
var string = 'A1B1Y:A1B2Y:A1B3Y:A1B4Z:A1B5Y:A1B6Y:A1B7Y:A1B8Z:A1B9Y:A1B10Y:A1B11Y';
var reg = /A[0-9]+B[0-9]+Y:A[0-9]+B[0-9]+Y/g;
Code:
var match = string.match(reg);
All matched results I get:
A1B1Y:A1B2Y
A1B5Y:A1B6Y
A1B9Y:A1B10Y
Matched results I want:
A1B1Y:A1B2Y
A1B2Y:A1B3Y
A1B5Y:A1B6Y
A1B6Y:A1B7Y
A1B9Y:A1B10Y
A1B10Y:A1B11Y
In my head, I want A1B1Y:A1B2Y to be a match along with A1B2Y:A1B3Y, even though A1B2Y in the string will need to be part of two matches.
Without modifying your regex, you can set it to start matching at the beginning of the second half of the match after each match using .exec and manipulating the regex object's lastIndex property.
var string = 'A1B1Y:A1B2Y:A1B3Y:A1B4Z:A1B5Y:A1B6Y:A1B7Y:A1B8Z:A1B9Y:A1B10Y:A1B11Y';
var reg = /A[0-9]+B[0-9]+Y:A[0-9]+B[0-9]+Y/g;
var matches = [], found;
while (found = reg.exec(string)) {
matches.push(found[0]);
reg.lastIndex -= found[0].split(':')[1].length;
}
console.log(matches);
//["A1B1Y:A1B2Y", "A1B2Y:A1B3Y", "A1B5Y:A1B6Y", "A1B6Y:A1B7Y", "A1B9Y:A1B10Y", "A1B10Y:A1B11Y"]
Demo
As per Bergi's comment, you can also get the index of the last match and increment it by 1 so it instead of starting to match from the second half of the match onwards, it will start attempting to match from the second character of each match onwards:
reg.lastIndex = found.index+1;
Demo
The final outcome is the same. Though, Bergi's update has a little less code and performs slightly faster. =]
You cannot get the direct result from match, but it is possible to produce the result via RegExp.exec and with some modification to the regex:
var regex = /A[0-9]+B[0-9]+Y(?=(:A[0-9]+B[0-9]+Y))/g;
var input = 'A1B1Y:A1B2Y:A1B3Y:A1B4Z:A1B5Y:A1B6Y:A1B7Y:A1B8Z:A1B9Y:A1B10Y:A1B11Y'
var arr;
var results = [];
while ((arr = regex.exec(input)) !== null) {
results.push(arr[0] + arr[1]);
}
I used zero-width positive look-ahead (?=pattern) in order not to consume the text, so that the overlapping portion can be rematched.
Actually, it is possible to abuse replace method to do achieve the same result:
var input = 'A1B1Y:A1B2Y:A1B3Y:A1B4Z:A1B5Y:A1B6Y:A1B7Y:A1B8Z:A1B9Y:A1B10Y:A1B11Y'
var results = [];
input.replace(/A[0-9]+B[0-9]+Y(?=(:A[0-9]+B[0-9]+Y))/g, function ($0, $1) {
results.push($0 + $1);
return '';
});
However, since it is replace, it does extra useless replacement work.
Unfortunately, it's not quite as simple as a single string.match.
The reason is that you want overlapping matches, which the /g flag doesn't give you.
You could use lookahead:
var re = /A\d+B\d+Y(?=:A\d+B\d+Y)/g;
But now you get:
string.match(re); // ["A1B1Y", "A1B2Y", "A1B5Y", "A1B6Y", "A1B9Y", "A1B10Y"]
The reason is that lookahead is zero-width, meaning that it just says whether the pattern comes after what you're trying to match or not; it doesn't include it in the match.
You could use exec to try and grab what you want. If a regex has the /g flag, you can run exec repeatedly to get all the matches:
// using re from above to get the overlapping matches
var m;
var matches = [];
var re2 = /A\d+B\d+Y:A\d+B\d+Y/g; // make another regex to get what we need
while ((m = re.exec(string)) !== null) {
// m is a match object, which has the index of the current match
matches.push(string.substring(m.index).match(re2)[0]);
}
matches == [
"A1B1Y:A1B2Y",
"A1B2Y:A1B3Y",
"A1B5Y:A1B6Y",
"A1B6Y:A1B7Y",
"A1B9Y:A1B10Y",
"A1B10Y:A1B11Y"
];
Here's a fiddle of this in action. Open up the console to see the results
Alternatively, you could split the original string on :, then loop through the resulting array, pulling out the the ones that match when array[i] and array[i+1] both match like you want.

Use JavaScript string operations to cut out exact text

I'm trying to cut out some text from a scraped site and not sure what functions or library's I can use to make this easier:
example of code I run from PhantomJS:
var latest_release = page.evaluate(function () {
// everything inside this function is executed inside our
// headless browser, not PhantomJS.
var links = $('[class="interesting"]');
var releases = {};
for (var i=0; i<links.length; i++) {
releases[links[i].innerHTML] = links[i].getAttribute("href");
}
// its important to take note that page.evaluate needs
// to return simple object, meaning DOM elements won't work.
return JSON.stringify(releases);
});
Class interesting has what I need, surrounded by new lines and tabs and whatnot.
here it is:
{"\n\t\t\t\n\t\t\t\tI_Am_Interesting\n\t\t\t\n\t\t":null,"\n\t\t\t\n\t\t\t\tI_Am_Interesting\n\t\t\t\n\t\t":null,"\n\t\t\t\n\t\t\t\tI_Am_Interesting\n\t\t\t\n\t\t":null}
I tried string.slice("\n"); and nothing happened, I really want a effective way to be able to cut out strings like this, based on its relationship to those \n''s and \t's
By the way this was my split code:
var x = latest_release.split('\n');
Cheers.
Its a simple case of stripping out all whitespace. A job that regexes do beautifully.
var s = " \n\t\t\t\n\t\t\t\tI Am Interesting\n\t\t \t \n\t\t";
s = s.replace(/[\r\t\n]+/g, ''); // remove all non space whitespace
s = s.replace(/^\s+/, ''); // remove all space from the front
s = s.replace(/\s+$/, ''); // remove all space at the end :)
console.log(s);
Further reading: https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/RegExp
var interesting = {
"\n\t\t\t\n\t\t\t\tI_Am_Interesting1\n\t\t\t\n\t\t":null,
"\n\t\t\t\n\t\t\t\tI_Am_Interesting2\n\t\t\t\n\t\t":null,
"\n\t\t\t\n\t\t\t\tI_Am_Interesting3\n\t\t\t\n\t\t":null
}
found = new Array();
for(x in interesting) {
found[found.length] = x.match(/\w+/g);
}
alert(found);
Could you try with "\\n" as pattern? your \n may be understood as plain string rather than special character
new_string = string.replace("\n", "").replace("\t", "");

Count the number of occurence of a case sensitive word in a paragraph in jquery

I want to count the number of occurrence of a specific words in a paragraph.
I am writing my code for key down event. I may have few hundreds words initially that may increase later on.
SO when the user is typing i will match the words in a paragraph and then get the number of occurrence. I also need to make sure that the match will be case sensitive.
Right now i am using this code:
$('.msg').val().split("AP").length - 1
Where AP is the keyword to match.
But i am not very happy with this.
Actually i have a list of few hundred keywords, how can i implement it efficiently.
Please note the words to match have spaces on both side i.e they are boundary words
Any help is appreciated
You can try something like the following:
var wordList = ["some", "word", "or", "other", "CASE", "Sensitive", "is", "required"],
wordCount = [];
for (var i=0; i < wordList.length; i++)
wordCount[i] = 0;
$("#someField").keyup(function(){
var i,
text = this.value,
re;
for (i = 0; i < wordList.length; i++) {
re = new RegExp("\\b" + wordList[i] + "\\b", "g");
wordCount[i] = 0;
while (re.test(text)) wordCount[i]++;
}
});
Demo: http://jsfiddle.net/zMdYg/2/ (updated with longer word list)
I don't really know what you want to do with the results, so I've just stuck them in a simple array, but you can see in the demo I then output them to the page so you can see it working. Obviously you'd substitute your own requirement in that part.
This is using a regex to test each word. You'll notice that with .split() or .indexOf() you'll get partial matches, e.g., if you look for "other" it will also match partway through "bother" (and so forth), but with the regex I've used \b to test on word boundaries.
For a large list of words you might want to create all the regexes in advance rather than redoing them on the fly in the loop, but it seemed to work fine for my simple test so I thought I wouldn't start doing premature optimisations. I'll leave that as an exercise for the reader...
If split() is not case-sensitive, then I would look at using indexOf(), which is case sensitive.
So maybe something like:
var words_array = ['one', 'two', 'three'];
var carot = 0;
var n_occurences = 0;
$.each(words_array, function(index, value){
while(carot < $('.msg').val().length && carot > -1){
carot = $('.msg').val().indexOf(' ' + words_array[index] + ' ', carot);
if (carot > -1){
n_occurences++;
}
}
});
I haven't tested this but I hope you get the idea.

Javascript Regex: Get everything from inside / tags

What I want
From the above subject I want to get search=adam and page=content and message=2.
Subject:
/search=adam/page=content/message=2
What I have tried so far
(\/)+search+\=+(.*)\/
But this is not good because sometimes the subject ends with nothing and in my case there must be a /
(\/)+search+\=+(.*?)+(\/*?)
But this is not good because goes trought the (\/*?) and shows me everyting what's after /search=
Tool Tip:
Regex Tester
Use String.split(), no regex required:
var A = '/search=adam/page=content/message=2'.split('/');
Note that you may have to discard the first array item using .slice(1).
Then you can iterate through the name-value pairs using something like:
for(var x = 0; x < A.length; x++) {
var nameValue = A[x].split('=');
if(nameValue[0] == 'search') {
// do something with nameValue[1]
}
}
This assumes that no equals signs will be in the value. Hopefully this is the case, but if not, you could use nameValue.slice(1).join('=') instead of nameValue[1];
shows me everyting what's after /search=
You used a greedy .* that will happily match slashes as well. You can use a non-greedy .*?, or a character class that excludes the slash:
(\/|^)search=([^\/]*)(\/|$)
Here the front and end may be either a slash or the start/end (^/$) of the string. (I removed the +s, as I can't work out at all what they're supposed to be doing.)
Alternatively, forget the regex:
var params= {};
var pieces= subject.split('/');
for (var i= pieces.length; i-->0;) {
var ix= pieces[i].indexOf('=');
if (ix!==-1)
params[pieces[i].slice(0, ix)]= pieces[i].slice(ix+1);
}
Now you can just say params.search, params.page etc.

Categories

Resources