I have a bunch of tracklist content on my site that is in this format:
<div class="tracklist">
1. Artist - Title (Record Label)
2. Another artist - Title (Another label)
</div>
I want to use regular expressions to find the find the artist and label names and wrap them in links like so:
<div class="tracklist">
1. Artist - Title (Record Label)
2. Another artist - Title (Another label)
</div>
I figured I can find the artist and label names with a JavaScript regex:
var artist = /[0-9]\. .*? -/gi
var label = /\(.*?\)/gi
use jQuery to find the matching strings:
$(".tracklist").html().match(label)
$(".tracklist").html().match(artist)
and then remove the number, period, spaces, dashes and parentheses with the substring() method. But what would be a good way to then insert the links and keep the text as well?
On a more general level, is this idea viable or would it fall under the "don't parse HTML with JavaScript"? Would a server side implementation be preferable (with some XML/XSL magic)?
It doesn't falls under the "don't parse html with .." because you are not parsing HTML, you are parsing text and creating HTML from it.
You could get the whole text content of the div:
var text = $('.tracklist').text();
Then split into lines:
var lines = text.split(/\r?\n/);
And parse each line separately:
function parseLine(line) {
var match = line.match(/^\d+\.\s+([^-]+)\s-\s([^(]+)(\s*(.*))/);
if (match) {
var artist = match[1], title = match[2], label = match[4];
// create HTML here
}
}
$.each(lines, function(index, line) {
var elems = parseLine(line);
// append elems to the div
}
The regex can be explained as follows:
/^\d+\. # this matches the number followed by the dot at the begining
\s+ # the number is separated by one or more whitespace
([^-]+) # the artist: match everything except "-"
\s-\s # matches the "-" separated by one or more whitespace
([^(]+) # the title: matches everything except "("
(\s+ # one or more whitespace
(.*))/ # the label
A server side implementation would definitely be better. Where are you pulling the data below from? Surely you have the information in an array or similar?
1. Artist - Title (Record Label)
2. Another artist - Title (Another label)
Also server side would depreciate nicely if the user didn't have javascript (almost negligible nowadays but it does happen!)
I don't see any point in switching to XSLT, because you'd still have to process the content of the DIV as text. For that kind of thing, jQuery/regex is about as good as it gets. You just aren't using regexes as effectively as you could be. Like #arnaud said, you should match and process one whole line at a time, using capturing groups to break out the interesting parts. Here's the regex I would use:
/^(\d+)\.\s*([^-]+?)\s*-\s*([^(]+?)\s*\((.*)\)/
match[1] is the track number,
match[2] is the artist,
match[3] is the title, and
match[4] is the label
I also arranged it so that none of the surrounding whitespace or other characters are captured--in fact, most of the whitespace is optional. In my experience, formatted data like this often contains inconsistencies in spacing; this makes it more likely the regex will match what you want it to, and it gives you the power to correct the inconsistencies. (Of course, it can also contain more serious flaws, but those usually have to be dealt with on a case-by-case basis.)
Related
Hello I would like to make a javascript function to return 7 words before and after a math to an specific keyword
I tried as follows:
function myFunction(text) {
b=text.match("(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,7}"+text+"(?:[^a-zA-Z'-]+[a-zA-Z'-]+){0,7}");
return b;
However when I search in my text "create" I justo got:
create
My desired output would be:
the Community, and view patterns you create or favorite in My Patterns. Explore results
My complete code looks as follows, with my corresponding string called Text, so I would like to appreciate the support to overcome this task.
<!DOCTYPE html>
<html>
<body>
<p id="demo"></p>
<script>
var Text='RegExr was created by gskinner.com, and is proudly hosted by Media Temple. Edit the Expression & Text to see matches. Roll over matches or the expression for details. PCRE & Javascript flavors of RegEx are supported. The side bar includes a Cheatsheet, full Reference, and Help. You can also Save & Share with the Community, and view patterns you create or favorite in My Patterns. Explore results with the Tools below. Replace & List output custom results. Details lists capture groups. Explain describes your expression in plain English.'
function myFunction(text) {
b=text.match("(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,7}"+text+"(?:[^a-zA-Z'-]+[a-zA-Z'-]+){0,7}");
return b;
}
document.getElementById("demo").innerHTML = myFunction("create");
</script>
</body>
</html>
Regular expressions aren't a great tool for this type of task. I would recommend using split to break your sentence into an array of words and then indexOf to find the matching word and print the adjacent words.
Here's a working example:
let sentence = "blah blah blah the Community, and view patterns you create or favorite in My Patterns. Explore results blah blah blah";
let words = sentence.split(" ");
let index = words.indexOf("create");
let result = [];
if (index > -1) {
for (i=index-7; i < (index+8); i++) {
result.push(words[i]);
}
}
console.log(result.join(" "));
That's the gist of it, but you'll need to modify my code sample to take into account edge cases (i.e., multiple matching words, less than 7 words preceding/following the matching word).
You can split the text into a words array and find the index of the word then use Array#slice() and Array#join()
Following also removes . and , punctuation for matching in case that word includes such punctuation before the following space and normalizes case match
var Text = 'RegExr was created by gskinner.com, and is proudly hosted by Media Temple. Edit the Expression & Text to see matches. Roll over matches or the expression for details. PCRE & Javascript flavors of RegEx are supported. The side bar includes a Cheatsheet, full Reference, and Help. You can also Save & Share with the Community, and view patterns you create or favorite in My Patterns. Explore results with the Tools below. Replace & List output custom results. Details lists capture groups. Explain describes your expression in plain English.'
var term = 'create',
words = Text.split(' '),
index = words.findIndex(s => s.replace(/,|\.$/, '').toLowerCase() === term.toLowerCase()),
start = index > 6 ? index - 7 : 0;
var res = words.slice(start, index + 8).join(' ')
console.log(res)
Your regular expression works perfectly for me.
Your hiccup is that you have two variables with similar names: Text and text.
Change b=text.match to b=Text.match, because you want to match against the string outside of your function. Currently, you match the expression to a string containing only the desired word.
Something else to look for when you make your change: match returns the first occurrence of "create" which happens to be a substring of the third word. You may want to consider modifying the expression to prevent partial matches.
Some issues with your attempt:
Regexes are not strings. If you want to create a regex from a string, you need to use new RegExp()
The variables are mixed up. It does not help that one variable is called Text and the other text. And so you end up trying to find text inside text, which obviously is not what you want. So, use distinct variable names, and also pass both of them to the function
The word you search ("create") will first match with "created" near the start of the input. As the regex specifies that what follows is all optional ({0,7}), this will be considered a match! To avoid this, require that there is at least one word interruption following, or the end of the string. The same for the part that precedes the matching word: it should not be completely optional. Use {1,7} and don't require the word part in it (*). Give as alternative ^ or $ respectively.
The match method will return an array when there is a match, so you'll want to return the value inside that array (if there is a match).
So with minimal changes, your code could be made to work like this:
var text='RegExr was created by gskinner.com, and is proudly hosted by Media Temple. Edit the Expression & Text to see matches. Roll over matches or the expression for details. PCRE & Javascript flavors of RegEx are supported. The side bar includes a Cheatsheet, full Reference, and Help. You can also Save & Share with the Community, and view patterns you create or favorite in My Patterns. Explore results with the Tools below. Replace & List output custom results. Details lists capture groups. Explain describes your expression in plain English.'
function myFunction(text, find) {
b = text.match(new RegExp("(?:(?:[a-zA-Z'-]*[^a-zA-Z'-]+){1,7}|^)"+find+"(?:(?:[^a-zA-Z'-]+[a-zA-Z'-]*){1,7}|$)"));
return b && b[0];
}
console.log( myFunction(text, "create") );
Be aware that gskinner.com, is counted as two distinct words in your regular expression. I assume this was your purpose.
You use a string, where you should use a RegExp constructor.
You use the same variable 'text' to search and match. You want to search 'Text' and use 'text' in the regex.
You should add a 'Word boundary' around your variable to match Words.
Here's the code:
function myFunction(text) {
b=Text.match(new RegExp('(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,7}\b' + text + '\b(?:[^a-zA-Z'-]+[a-zA-Z'-]+){0,7}');
return b;
}
Hope this Works for you.
How about this one:
(?:\w+[,\.]? ){6}create(?:[\,.]? \w+){7}
(?:\w+[,\.]? ) is a word followed with a comma or dot optionally and a space
{6} indicated the word will appear 6 times.
create matches self literally
(?:[\,.]? \w+){7} matches 7 words with an optional comma or dot and a space before
Try it out at Regex101 or check the snippet.
var string = "RegExr was created by gskinner.com, and is proudly hosted by Media Temple. Edit the Expression & Text to see matches. Roll over matches or the expression for details. PCRE & Javascript flavors of RegEx are supported. The side bar includes a Cheatsheet, full Reference, and Help. You can also Save & Share with the Community, and view patterns you create or favorite in My Patterns. Explore results with the Tools below. Replace & List output custom results. Details lists capture groups. Explain describes your expression in plain English.";
var regex = /(?:\w+[,\.]? ){6}create(?:[\,.]? \w+){7}/;
var output = string.match(regex);
console.log(output[0]);
The snippet prints:
the Community, and view patterns you create or favorite in My Patterns. Explore results
Edit: At what case do you include the word create among the 7 words?
I need to redact health information from emails that are loaded into a string variable by replacing characters with █. The emails in question need content in between the words "health issues?" and "Have you worked" replaced but ignoring anything that appears in tags. Additionally lines often are wrapped with with = signs, and those new line, spaces, and = signs can occur right in the middle of a tag, and they can also occur in the middle of the strings used to identify the start and end.
Example:
(More content)
.....have any health issues? We currently do not have any health issues</sp=
an></li>
<li id=3D"m_-622133557606915713yui_3_16_0_ym19_1_1515713539439_17326" styl=
e=3D"margin-top:0;margin-bottom:0;vertical-align:middle;line-height:15pt;co=
lor:black"><span id=3D"m_-622133557606915713yui_3_16_0_ym19_1_1515713539439=
_17327" style=3D"font-family:Arial;font-size:11.0pt">Some more text.
Have
you worked.....(more content)
I am figuring there is a way to do this in javascript using one or more regular expressions, but I am at a loss to see how.
The desired result would look like:
(More content)
.....have any health issues?███████████████████████████████████████████</sp=
an></li>
<li id=3D"m_-622133557606915713yui_3_16_0_ym19_1_1515713539439_17326" styl=
e=3D"margin-top:0;margin-bottom:0;vertical-align:middle;line-height:15pt;co=
lor:black"><span id=3D"m_-622133557606915713yui_3_16_0_ym19_1_1515713539439=
_17327" style=3D"font-family:Arial;font-size:11.0pt">███████████████
Have
you worked.....(more content)
You could use two replace methods to solve this problem. The first one matches every thing from health issues? to Have you worked captured into three capturing groups. We are interested in second capturing group:
(health issues\?)([\s\S]*?)(Have\s+you\s+worked)
^^^^^^^^
We run our second replace method on this captured group and substitutes each character outside of tags with a █. This is the regex:
(<\/?\w[^<>]*>)|[\s\S]
We need to keep first capturing group (they are probably HTML tags) and replace the other side of alternation ([\s\S]) with the mentioned character.
Disclaimer: this is not bulletproof as regex shouldn't be used to parse HTML tags.
Demo:
var str = `(More content)
.....have any health issues? We currently do not have any health issues</sp=
an></li>
<li id=3D"m_-622133557606915713yui_3_16_0_ym19_1_1515713539439_17326" styl=
e=3D"margin-top:0;margin-bottom:0;vertical-align:middle;line-height:15pt;co=
lor:black"><span id=3D"m_-622133557606915713yui_3_16_0_ym19_1_1515713539439=
_17327" style=3D"font-family:Arial;font-size:11.0pt">Some more text.
Have
you worked.....(more content)`;
console.log(str.replace(/(health issues\?)([\s\S]*?)(Have\s+you\s+worked)/, function(match, $1, $2, $3) {
return $1 + $2.replace(/(<\/?\w[^<>]*>)|[\s\S]/g, function(match, $1) {
return $1 ? $1 : '█';
}) + $3;
}));
I've been hoving around by some answers here, and I can't find a solution to my problem:
I have this regexp which matches everyting inside an HTML span tag, including contents:
<span\b[^>]*>(.*?)</span>
and I want to find a way to make a search in all the text, except for what is matched with that regexp.
For example, if my text is:
var text = "...for there is a class of <span class="highlight">guinea</span> pigs which..."
... then the regexp would match:
<span class="highlight">guinea</span>
and I want to be able to make a regexp such that if I search for "class", regexp will match "...for there is a class of..."
and will not match inside the tag, like in
"... class="highlight"..."
The word to be matched ("class") might be anywhere within the text. I've tried
(?!<span\b[^>]*>(.*?)</span>)class
but it keeps searching inside tags as well.
I want to find a solution using only regexp, not dealing with DOM nor JQuery. Thanks in advance :).
Although I wouldn't recommend this, I would do something like below
(class)(?:(?=.*<span\b[^>]*>))|(?:(?<=<\/span>).*)(class)
You can see this in action here
Rubular Link for this regex
You can capture your matches from the groups and work with them as needed. If you can, use a HTML parser and then find matches from the text element.
It's not pretty, but if I get you right, this should do what you wan't. It's done with a single RegEx but js can't (to my knowledge) extract the result without joining the results in a loop.
The RegEx: /(?:<span\b[^>]*>.*?<\/span>)|(.)/g
Example js code:
var str = '...for there is a class of <span class="highlight">guinea</span> pigs which...',
pattern = /(?:<span\b[^>]*>.*?<\/span>)|(.)/g,
match,
res = '';
match = pattern.exec(str)
while( match != null )
{
res += match[1];
match = pattern.exec(str)
}
document.writeln('Result:' + res);
In English: Do a non capturing test against your tag-expression or capture any character. Do this globally to get the entire string. The result is a capture group for each character in your string, except the tag. As pointed out, this is ugly - can result in a serious number of capture groups - but gets the job done.
If you need to send it in and retrieve the result in one call, I'd have to agree with previous contributors - It can't be done!
Using Javascript, I'm trying to wrap span tags around certain text on the page, but I don't want to wrap tags around text already inside a set of span tags.
Currently I'm using:
html = $('#container').html();
var regex = /([\s| ]*)(apple)([\s| ]*)/g;
html = html.replace(regex, '$1<span class="highlight">$2</span>$3');
It works but if it's used on the same string twice or if the string appears in another string later, for example 'a bunch of apples' then later 'apples', I end up with this:
<span class="highlight">a bunch of <span class="highlight">apples</span></span>
I don't want it to replace 'apples' the second time because it's already inside span tags.
It should match 'apples' here:
Red apples are my <span class="highlight">favourite fruit.</span>
But not here:
<span class="highlight">Red apples are my favourite fruit.</span>
I've tried using this but it doesn't work:
([\s| ]*)(apples).*(?!</span)
Any help would be appreciated. Thank you.
First off, you should know that parsing html with regex is generally considered to be a bad idea—a Dom parser is usually recommended. With this disclaimer, I will show you a simple regex solution.
This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."
We can solve it with a beautifully-simple regex:
<span.*?<\/span>|(\bapples\b)
The left side of the alternation | matches complete <span... /span> tags. We will ignore these matches. The right side matches and captures apples to Group 1, and we know they are the right ones because they were not matched by the expression on the left.
This program shows how to use the regex (see the results in the right pane of the online demo). Please note that in the demo I replaced with [span] instead of <span> so that the result would show in the browser (which interprets the html):
var subject = 'Red apples are my <span class="highlight">favourite apples.</span>';
var regex = /<span.*?<\/span>|(\bapples\b)/g;
replaced = subject.replace(regex, function(m, group1) {
if (group1 == "" ) return m;
else return "<span class=\"highlight\">" + group1 + "</span>";
});
document.write("<br>*** Replacements ***<br>");
document.write(replaced);
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...
What is the best way to capture everything except when faced with two or more new lines?
ex:
name1
address1
zipcode
name2
address2
zipcode
name3
address3
zipcode
One regex I considered was /[^\n\n]*\s*/g. But this stops when it is faced with a single \n character.
Another way I considered was /((?:.*(?=\n\n)))\s*/g. But this seems to only capture the last line ignoring the previous lines.
What is the best way to handle similar situation?
UPDATE
You can consider replacing the variable length separator with some known fixed length string not appearing in your processed text and then split. For instance:
> var s = "Hi\n\n\nBye\nCiao";
> var x = s.replace(/\n{2,}/, "#");
> x.split("#");
["Hi", "Bye
Ciao"]
I think it is an elegant solution. You could also use the following somewhat contrived regex
> s.match(/((?!\n{2,})[\s\S])+/g);
["Hi", "
Bye
Ciao"]
and then process the resulting array by applying the trim() string method to its members in order to get rid of any \n at the beginning/end of every string in the array.
((.+)\n?)*(you probably want to make the groups non-capturing, left it as is for readability)
The inner part (.+)\n? means "non-empty line" (at least one non-newline character as . does not match newlines unless the appropriate flag is set, followed by an optional newline)
Then, that is repeated an arbitrary number of times (matching an entire block of non-blank lines).
However, depending on what you are doing, regexp probably is not the answer you are looking for. Are you sure just splitting the string by \n\n won't do what you want?
Do you have to use regex? The solution is simple without it.
var data = 'name1...';
var matches = data.split('\n\n');
To access an individual sub section split it by \n again.
//the first section's name
var name = matches[0].split('\n')[0];