I have a list of near 1000 awards to appear on a site. Each award is in it's own span, and follows the format
x for y
example:
<span>Broadcast Film Critics Association Award for Best Director</span>
For each of these spans, I would like to bold all the text before "for". How can I search for an unlimited possible number of words before (and not including) the word(not just characters) "for", and bold them?
I know with an expression like
\S+\s+\S+\s+for
I am searching up to 2 words before (and including) the characters f, o, and r. But I want to match the word "for", and not just the characters, and I don't want to include "for" in what is being bolded.
Regex would seem to be the best solution here, however I would suggest using a Regex which matches groups so you can rebuild the string when adding the <b> tags in. Try this:
$('span').html(function(i, v) {
var matches = /(.+)(for.+)/gi.exec(v);
return '<b>' + matches[1] + '</b>' + matches[2];
});
Example fiddle
This assumes that everything before the very first for is what you need to match. If you worry about more than one for in the string, you'll need a different solution.
(.*?\b)for
This is "reluctantly match and capture everything up to the first word break followed by for."
http://rubular.com/r/OLNZjkA2ck
You could also use
.*?\b(?=for)
You can also do regex replace to achieve it as below.
$('span').html(function(index, content) {
return content.replace(/(.+)(for.+)/gi, '<strong>$1</strong>$2');
});
Demo#Fiddle
Related
I need to redact health information from emails that are loaded into a string variable by replacing characters with █. The emails in question need content in between the words "health issues?" and "Have you worked" replaced but ignoring anything that appears in tags. Additionally lines often are wrapped with with = signs, and those new line, spaces, and = signs can occur right in the middle of a tag, and they can also occur in the middle of the strings used to identify the start and end.
Example:
(More content)
.....have any health issues? We currently do not have any health issues</sp=
an></li>
<li id=3D"m_-622133557606915713yui_3_16_0_ym19_1_1515713539439_17326" styl=
e=3D"margin-top:0;margin-bottom:0;vertical-align:middle;line-height:15pt;co=
lor:black"><span id=3D"m_-622133557606915713yui_3_16_0_ym19_1_1515713539439=
_17327" style=3D"font-family:Arial;font-size:11.0pt">Some more text.
Have
you worked.....(more content)
I am figuring there is a way to do this in javascript using one or more regular expressions, but I am at a loss to see how.
The desired result would look like:
(More content)
.....have any health issues?███████████████████████████████████████████</sp=
an></li>
<li id=3D"m_-622133557606915713yui_3_16_0_ym19_1_1515713539439_17326" styl=
e=3D"margin-top:0;margin-bottom:0;vertical-align:middle;line-height:15pt;co=
lor:black"><span id=3D"m_-622133557606915713yui_3_16_0_ym19_1_1515713539439=
_17327" style=3D"font-family:Arial;font-size:11.0pt">███████████████
Have
you worked.....(more content)
You could use two replace methods to solve this problem. The first one matches every thing from health issues? to Have you worked captured into three capturing groups. We are interested in second capturing group:
(health issues\?)([\s\S]*?)(Have\s+you\s+worked)
^^^^^^^^
We run our second replace method on this captured group and substitutes each character outside of tags with a █. This is the regex:
(<\/?\w[^<>]*>)|[\s\S]
We need to keep first capturing group (they are probably HTML tags) and replace the other side of alternation ([\s\S]) with the mentioned character.
Disclaimer: this is not bulletproof as regex shouldn't be used to parse HTML tags.
Demo:
var str = `(More content)
.....have any health issues? We currently do not have any health issues</sp=
an></li>
<li id=3D"m_-622133557606915713yui_3_16_0_ym19_1_1515713539439_17326" styl=
e=3D"margin-top:0;margin-bottom:0;vertical-align:middle;line-height:15pt;co=
lor:black"><span id=3D"m_-622133557606915713yui_3_16_0_ym19_1_1515713539439=
_17327" style=3D"font-family:Arial;font-size:11.0pt">Some more text.
Have
you worked.....(more content)`;
console.log(str.replace(/(health issues\?)([\s\S]*?)(Have\s+you\s+worked)/, function(match, $1, $2, $3) {
return $1 + $2.replace(/(<\/?\w[^<>]*>)|[\s\S]/g, function(match, $1) {
return $1 ? $1 : '█';
}) + $3;
}));
Given the following data
green-pineapple-bird
red-apple-dog
blue-apple-cat
green-apple-orange-horse
green-apple-mouse
I am trying to figure out how to get (Javascript) RegExp.test() to match for any entry that contains the word "apple" (anywhere) but not match any entry that contains the word "orange" (anywhere). The resulting list would be:
red-apple-dog
blue-apple-cat
green-apple-mouse
I have included the dashes in the data just to make it easier to read. The actual data may, or may not, include dashes.
If I try this:
/^(?!orange).*(apple).*/gm
using https://regex101.com/ it matches all lines.
Using JavaScript RegEx excluding certain word/phrase? for inspiration I tried:
/^(?!.*apple\.(?:orange|butter)).*apple\.\w+.*/gm
If it makes a difference I am using Mozilla Rhino 1.7R4.
For each character not in apple (before or after), you need to repeat the negative lookahead for orange. Because you don't want pineapple to match, you should also put word boundaries around the apple:
const re = /^((?!orange).)*\bapple\b((?!orange).)*$/;
`green-pineapple-bird
red-apple-dog
blue-apple-cat
green-apple-orange-horse
green-apple-mouse`
.split('\n')
.forEach(str => {
console.log(re.test(str) + ' ' + str)
});
I have simple regex which founds some word in text:
var patern = new RegExp("\bsomething\b", "gi");
This match word in text with spaces or punctuation around.
So it match:
I have something.
But doesn't match:
I havesomething.
what is fine and exactly what I need.
But I have issue with for example Arabic language. If I have regex:
var patern = new RegExp("\bرياضة\b", "gi");
and text:
رياضة أنا أحب رياضتي وأنا سعيد حقا هنا لها حبي
The keyword which I am looking for is at the end of the text.
But this doesn't work, it just doesn't find it.
It works if I remove \b from regex:
var patern = new RegExp("رياضة", "gi");
But that is now what I want, because I don't want to find it if it's part of another word like in english example above:
I havesomething.
So I really have low knowledge about regex and if anyone can help me to work this with english and languages like arabic.
We have first to understand what does \b mean:
\b is an anchor that matches at a position that is called a "word boundary".
In your case, the word boundaries that you are looking for are not having other Arabic letters.
To match only Arabic letters in Regex, we use unicode:
[\u0621-\u064A]+
Or we can simply use Arabic letters directly
[ء-ي]+
The code above will match any Arabic letters. To make a word boundary out of it, we could simply reverse it on both sides:
[^ء-ي]ARABIC TEXT[^ء-ي]
The code above means: don't match any Arabic characters on either sides of an Arabic word which will work in your case.
Consider this example that you gave us which I modified a little bit:
أنا أحب رياضتي رياض رياضة رياضيات وأنا سعيد حقا هنا
If we are trying to match only رياض, this word will make our search match also رياضة, رياضيات, and رياضتي. However, if we add the code above, the match will successfully be on رياض only.
var x = " أنا أحب رياضتي رياض رياضة رياضيات وأنا سعيد حقا هنا ";
x = x.replace(/([^ء-ي]رياض[^ء-ي])/g, '<span style="color:red">$1</span>');
document.write (x);
If you would like to account for أآإا with one code, you could use something like this [\u0622\u0623\u0625\u0627] or simply list them all between square brackets [أآإا]. Here is a complete code
var x = "أنا هنا وانا هناك .. آنا هنا وإنا هناك";
x = x.replace(/([أآإا]نا)/g, '<span style="color:red">$1</span>');
document.write (x);
Note: If you want to match every possible Arabic characters in Regex including all Arabic letters أ ب ت ث ج, all diacritics َ ً ُ ٌ ِ ٍ ّ, and all Arabic numbers ١٢٣٤٥٦٧٨٩٠, use this regex: [،-٩]+
Useful link about the ranking of Arabic characters in Unicode: https://en.wikipedia.org/wiki/Arabic_script_in_Unicode
This doesn't work because of the Arabic language which isn't supported on the regex engine.
You could search for the unicode chars in the text (Unicode ranges).
Or you could use encoding to convert the text into unicode and then make somehow the regex (i never have tried this but it should work).
I used this ء-ي٠-٩ and it works for me
If you don't need a complicated RegEx (for instance, because you're looking for a particular word or a short list of words), then I've found that it's actually easier to tokenize the search text and find it that way:
>>> text = 'رياضة أنا أحب رياضتي وأنا سعيد حقا هنا لها حبي '
>>> tokens = text.split()
>>> print(tokens)
['رياضة', 'أنا', 'أحب', 'رياضتي', 'وأنا', 'سعيد', 'حقا', 'هنا', 'لها', 'حبي']
>>> search_words = ['رياضة', 'رياضت']
>>> found = [w for w in tokens if w in search_words]
>>> print(found)
['رياضة'] # returns only full-word match
I'm sure that this is slower than RegEx, but not enough that I've ever noticed.
If your text had punctuation, you could do a more sophisticated tokenization (so it would find things like 'رياضة؟') using NLTK.
What is the best way to capture everything except when faced with two or more new lines?
ex:
name1
address1
zipcode
name2
address2
zipcode
name3
address3
zipcode
One regex I considered was /[^\n\n]*\s*/g. But this stops when it is faced with a single \n character.
Another way I considered was /((?:.*(?=\n\n)))\s*/g. But this seems to only capture the last line ignoring the previous lines.
What is the best way to handle similar situation?
UPDATE
You can consider replacing the variable length separator with some known fixed length string not appearing in your processed text and then split. For instance:
> var s = "Hi\n\n\nBye\nCiao";
> var x = s.replace(/\n{2,}/, "#");
> x.split("#");
["Hi", "Bye
Ciao"]
I think it is an elegant solution. You could also use the following somewhat contrived regex
> s.match(/((?!\n{2,})[\s\S])+/g);
["Hi", "
Bye
Ciao"]
and then process the resulting array by applying the trim() string method to its members in order to get rid of any \n at the beginning/end of every string in the array.
((.+)\n?)*(you probably want to make the groups non-capturing, left it as is for readability)
The inner part (.+)\n? means "non-empty line" (at least one non-newline character as . does not match newlines unless the appropriate flag is set, followed by an optional newline)
Then, that is repeated an arbitrary number of times (matching an entire block of non-blank lines).
However, depending on what you are doing, regexp probably is not the answer you are looking for. Are you sure just splitting the string by \n\n won't do what you want?
Do you have to use regex? The solution is simple without it.
var data = 'name1...';
var matches = data.split('\n\n');
To access an individual sub section split it by \n again.
//the first section's name
var name = matches[0].split('\n')[0];
I have a bunch of tracklist content on my site that is in this format:
<div class="tracklist">
1. Artist - Title (Record Label)
2. Another artist - Title (Another label)
</div>
I want to use regular expressions to find the find the artist and label names and wrap them in links like so:
<div class="tracklist">
1. Artist - Title (Record Label)
2. Another artist - Title (Another label)
</div>
I figured I can find the artist and label names with a JavaScript regex:
var artist = /[0-9]\. .*? -/gi
var label = /\(.*?\)/gi
use jQuery to find the matching strings:
$(".tracklist").html().match(label)
$(".tracklist").html().match(artist)
and then remove the number, period, spaces, dashes and parentheses with the substring() method. But what would be a good way to then insert the links and keep the text as well?
On a more general level, is this idea viable or would it fall under the "don't parse HTML with JavaScript"? Would a server side implementation be preferable (with some XML/XSL magic)?
It doesn't falls under the "don't parse html with .." because you are not parsing HTML, you are parsing text and creating HTML from it.
You could get the whole text content of the div:
var text = $('.tracklist').text();
Then split into lines:
var lines = text.split(/\r?\n/);
And parse each line separately:
function parseLine(line) {
var match = line.match(/^\d+\.\s+([^-]+)\s-\s([^(]+)(\s*(.*))/);
if (match) {
var artist = match[1], title = match[2], label = match[4];
// create HTML here
}
}
$.each(lines, function(index, line) {
var elems = parseLine(line);
// append elems to the div
}
The regex can be explained as follows:
/^\d+\. # this matches the number followed by the dot at the begining
\s+ # the number is separated by one or more whitespace
([^-]+) # the artist: match everything except "-"
\s-\s # matches the "-" separated by one or more whitespace
([^(]+) # the title: matches everything except "("
(\s+ # one or more whitespace
(.*))/ # the label
A server side implementation would definitely be better. Where are you pulling the data below from? Surely you have the information in an array or similar?
1. Artist - Title (Record Label)
2. Another artist - Title (Another label)
Also server side would depreciate nicely if the user didn't have javascript (almost negligible nowadays but it does happen!)
I don't see any point in switching to XSLT, because you'd still have to process the content of the DIV as text. For that kind of thing, jQuery/regex is about as good as it gets. You just aren't using regexes as effectively as you could be. Like #arnaud said, you should match and process one whole line at a time, using capturing groups to break out the interesting parts. Here's the regex I would use:
/^(\d+)\.\s*([^-]+?)\s*-\s*([^(]+?)\s*\((.*)\)/
match[1] is the track number,
match[2] is the artist,
match[3] is the title, and
match[4] is the label
I also arranged it so that none of the surrounding whitespace or other characters are captured--in fact, most of the whitespace is optional. In my experience, formatted data like this often contains inconsistencies in spacing; this makes it more likely the regex will match what you want it to, and it gives you the power to correct the inconsistencies. (Of course, it can also contain more serious flaws, but those usually have to be dealt with on a case-by-case basis.)