Get string between tags when multiple tags present - javascript

Just trying to figure this one out as regex is nowhere near my strong point :(
Basically I'm trying to get the value between bbcode tags: That could look like either of the following:
[center]text[/center]
[left][center]text[/center][/left]
[right][left][center]text[/center][/left][/right]
And currently have this hideous if else block of code to prevent it getting large like the third option above.
if (/\[left\]|\[\/left\]/.test(text[2])) {
// set the value in the [left][/left] tags
text[2] = text[2].match(/\[left\](.*?)\[\/left\]/)[1];
} else if (/\[right\]|\[\/right\]/.test(text[2])) {
// set value in the [right][/right] tags
text[2] = text[2].match(/\[right\](.*?)\[\/right\]/)[1];
} else if (/\[center\]|\[\/center\]/.test(text[2])) {
// set value in the [right][/right] tags
text[2] = text[2].match(/\[center\](.*?)\[\/center\]/)[1];
}
What I'd like to do is shorten it down to a single regex expression to grab that value text from the above examples, I've gotten down to an expression like this:
/\[(?:center|left|right)\](.*?)\[\/(?:center|left|right)\]/
But as you can see in this RegExr demo, it doesn't match what I need it to.
How can I achieve this?
Note
It should only match left|right|center as the selected text could also have various other bbcode tags.
If the string looks like this:
[center][left][img]/link/to/img.png[/img][/left][/center]
I want to get what is between the left|center|right tags which in this case would be:
[img]/link/to/img.png[/img]
More examples:
[center][url=lintosomething.com]LINK TEXT[/url][/center]
Should only get: [url=lintosomething.com]LINK TEXT[/url]
Or
[center]egibibskdfbgfdkfbg sd fgkgb fkgbgk fhwo3g regbiurb geir so go to [url=lintosomething.com]LINK TEXT[/url] and ibgri gbenkenbieurgnerougnerogrnreog erngo[/center]
Wanting:
egibibskdfbgfdkfbg sd fgkgb fkgbgk fhwo3g regbiurb geir so go to [url=lintosomething.com]LINK TEXT[/url] and ibgri gbenkenbieurgnerougnerogrnreog erngo

Edit: Ok, I think this fits your needs.
My regex:
/[^\]\[]*\[(\w+)[=\.\"\w]*\][^\]]+\[\/\1\][^\]\[]*/g
Explanation:
Match 0 or more characters that arent [ or ]
Match a single [
Match 1 or more of alpha characters, we'll use this later as a backreference
Match 0 or more of = . " or alpha characters
Match a single ]
Match 1 or more non [ characters
Match a single [
Match a single /
Match the same characters as step 3. (Our back reference)
Match a single ]
Match 0 or more characters that arent [ or ]
See it in action
However I would like to state that if you're going to be parsing bbcodes you're almost certainly better off just using a bbparser.

Why not just replace all those tags with empty string
var rawString; // your input string
var cleanedString = rawString.replace(~\[/?(left|right|center)\]~, '');

You could use a capturing group like this:
(?:\[\w+\])*(\w+)(?:\[\/\w+\])*
Or with a capture group named "value" like this:
(?:\[\w+\])*(?<value>\w+)(?:\[\/\w+\])*
The first and last groups are non-capturing... (?: ...)
And the middle group is capturing (\w+)
And the middle group if named like this (?<value>\w+)
Note: For simplicity, I replaced your center|left|right values with \w+ but you could swap them back in with no impact.
I use an app called RegExRX. Here's a screenshot with the RegEx and captured values.
Lots of ways you could tweak it. Good luck!

Related

Extracting a complicated part of the string with plain Javascript

I have a following string:
Text
I want to extract from this string, with the use of JavaScript 'pl' or 'pl_company_com'
There are a few variables:
jan_kowalski is a name and surname it can change, and sometimes even have 3 elements
the country code (in this example 'pl') will change to other en / de / fr (this is that part of the string i want to get)
the rest of the string remains the same for every case (beginning + everything after starting with _company_com ...
Ps. I tried to do it with split, but my knowledge of JS is very basic and I cant get what i want, plase help
An alternative to Randy Casburn's solution using regex
let out = new URL('https://my.domain.com/personal/jan_kowalski_pl_company_com/Documents/Forms/All.aspx').href.match('.*_(.*_company_com)')[1];
console.log(out);
Or if you want to just get that string with those country codes you specified
let out = new URL('https://my.domain.com/personal/jan_kowalski_pl_company_com/Documents/Forms/All.aspx').href.match('.*_((en|de|fr|pl)_company_com)')[1];
console.log(out);
let out = new URL('https://my.domain.com/personal/jan_kowalski_pl_company_com/Documents/Forms/All.aspx').href.match('.*_((en|de|fr|pl)_company_com)')[1];
console.log(out);
A proof of concept that this solution also works for other combinations
let urls = [
new URL('https://my.domain.com/personal/jan_kowalski_pl_company_com/Documents/Forms/All.aspx'),
new URL('https://my.domain.com/personal/firstname_middlename_lastname_pl_company_com/Documents/Forms/All.aspx')
]
urls.forEach(url => console.log(url.href.match('.*_(en|de|fr|pl).*')[1]))
I have been very successful before with this kind of problems with regular expressions:
var string = 'Text';
var regExp = /([\w]{2})_company_com/;
find = string.match(regExp);
console.log(find); // array with found matches
console.log(find[1]); // first group of regexp = country code
First you got your given string. Second you have a regular expression, which is marked with two slashes at the beginning and at the end. A regular expression is mostly used for string searches (you can even replace complicated text in all major editors with it, which can be VERY useful).
In this case here it matches exactly two word characters [\w]{2} followed directly by _company_com (\w indicates a word character, the [] group all wanted character types, here only word characters, and the {}indicate the number of characters to be found). Now to find the wanted part string.match(regExp) has to be called to get all captured findings. It returns an array with the whole captured string followed by all capture groups within the regExp (which are denoted by ()). So in this case you get the country code with find[1], which is the first and only capture group of the regular expression.

javascript regex insert new element into expression

I am passing a URL to a block of code in which I need to insert a new element into the regex. Pretty sure the regex is valid and the code seems right but no matter what I can't seem to execute the match for regex!
//** Incoming url's
//** url e.g. api/223344
//** api/11aa/page/2017
//** Need to match to the following
//** dir/api/12ab/page/1999
//** Hence the need to add dir at the front
var url = req.url;
//** pass in: /^\/api\/([a-zA-Z0-9-_~ %]+)(?:\/page\/([a-zA-Z0-9-_~ %]+))?$/
var re = myregex.toString();
//** Insert dir into regex: /^dir\/api\/([a-zA-Z0-9-_~ %]+)(?:\/page\/([a-zA-Z0-9-_~ %]+))?$/
var regVar = re.substr(0, 2) + 'dir' + re.substr(2);
var matchedData = url.match(regVar);
matchedData === null ? console.log('NO') : console.log('Yay');
I hope I am just missing the obvious but can anyone see why I can't match and always returns NO?
Thanks
Let's break down your regex
^\/api\/ this matches the beginning of a string, and it looks to match exactly the string "/api"
([a-zA-Z0-9-_~ %]+) this is a capturing group: this one specifically will capture anything inside those brackets, with the + indicating to capture 1 or more, so for example, this section will match abAB25-_ %
(?:\/page\/([a-zA-Z0-9-_~ %]+)) this groups multiple tokens together as well, but does not create a capturing group like above (the ?: makes it non-captuing). You are first matching a string exactly like "/page/" followed by a group exactly like mentioned in the paragraph above (that matches a-z, A-Z, 0-9, etc.
?$ is at the end, and the ? means capture 0 or more of the precending group, and the $ matches the end of the string
This regex will match this string, for example: /api/abAB25-_ %/page/abAB25-_ %
You may be able to take advantage of capturing groups, however, and use something like this instead to get similar results: ^\/api\/([a-zA-Z0-9-_~ %]+)\/page\/\1?$. Here, we are using \1 to reference that first capturing group and match exactly the same tokens it is matching. EDIT: actually, this probably won't work, since the text after /api/ and the text after /page/ will most likely be different, carrying on...
Afterwards, you are are adding "dir" to the beginning of your search, so you can now match someting like this: dir/api/abAB25-_ %/page/abAB25-_ %
You have also now converted the regex to a string, so like Crayon Violent pointed out in their comment, this will break your expected funtionality. You can fix this by using .source on your regex: var matchedData = url.match(regVar.source); https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/source
Now you can properly match a string like this: dir/api/11aa/page/2017 see this example: https://repl.it/Mj8h
As mentioned by Crayon Violent in the comments, it seems you're passing a String rather than a regular expression in the .match() function. maybe try the following:
url.match(new RegExp(regVar, "i"));
to convert the string to a regular expression. The "i" is for ignore case; don't know that's what you want. Learn more here:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp

Regular expression to match a string which is NOT matched by a given regexp

I've been hoving around by some answers here, and I can't find a solution to my problem:
I have this regexp which matches everyting inside an HTML span tag, including contents:
<span\b[^>]*>(.*?)</span>
and I want to find a way to make a search in all the text, except for what is matched with that regexp.
For example, if my text is:
var text = "...for there is a class of <span class="highlight">guinea</span> pigs which..."
... then the regexp would match:
<span class="highlight">guinea</span>
and I want to be able to make a regexp such that if I search for "class", regexp will match "...for there is a class of..."
and will not match inside the tag, like in
"... class="highlight"..."
The word to be matched ("class") might be anywhere within the text. I've tried
(?!<span\b[^>]*>(.*?)</span>)class
but it keeps searching inside tags as well.
I want to find a solution using only regexp, not dealing with DOM nor JQuery. Thanks in advance :).
Although I wouldn't recommend this, I would do something like below
(class)(?:(?=.*<span\b[^>]*>))|(?:(?<=<\/span>).*)(class)
You can see this in action here
Rubular Link for this regex
You can capture your matches from the groups and work with them as needed. If you can, use a HTML parser and then find matches from the text element.
It's not pretty, but if I get you right, this should do what you wan't. It's done with a single RegEx but js can't (to my knowledge) extract the result without joining the results in a loop.
The RegEx: /(?:<span\b[^>]*>.*?<\/span>)|(.)/g
Example js code:
var str = '...for there is a class of <span class="highlight">guinea</span> pigs which...',
pattern = /(?:<span\b[^>]*>.*?<\/span>)|(.)/g,
match,
res = '';
match = pattern.exec(str)
while( match != null )
{
res += match[1];
match = pattern.exec(str)
}
document.writeln('Result:' + res);
In English: Do a non capturing test against your tag-expression or capture any character. Do this globally to get the entire string. The result is a capture group for each character in your string, except the tag. As pointed out, this is ugly - can result in a serious number of capture groups - but gets the job done.
If you need to send it in and retrieve the result in one call, I'd have to agree with previous contributors - It can't be done!

Matching invisible characters in JavaScript RegEx

I've got some string that contain invisible characters, but they are in somewhat predictable places. Typically the surround the piece of text I want to extract, and then after the 2nd occurrence I want to keep the rest of the text.
I can't seem to figure out how to both key off of the invisible characters, and exclude them from my result. To match invisibles I've been using this regex: /\xA0\x00-\x09\x0B\x0C\x0E-\x1F\x7F/ which does seem to work.
Here's an example: [invisibles]Keep as match 1[invisibles]Keep as match 2
Here's what I've been using so far without success:
/([\xA0\x00-\x09\x0B\x0C\x0E-\x1F\x7F]+)(.+)([\xA0\x00-\x09\x0B\x0C\x0E-\x1F\x7F]+)/(.+)
I've got the capture groups in there, but it's bee a while since I've had to use regex's in this way, so I know I'm missing something important. I was hoping to just make the invisible matches non-capturing groups, but it seems that JavaScript does not support this.
Something like this seems like what you want. The second regex you have pretty much works, but the / is in totally the wrong place. Perhaps you weren't properly reading out the group data.
var s = "\x0EKeep as match 1\x0EKeep as match 2";
var r = /[\xA0\x00-\x09\x0B\x0C\x0E-\x1F\x7F]+(.+)[\xA0\x00-\x09\x0B\x0C\x0E-\x1F\x7F]+(.+)/;
var match = s.match(r);
var part1 = match[1];
var part2 = match[2];

regex search a string for contents between two strings

I am trying my upmost best to get my head around regex, however not having too much luck.
I am trying to search within a string for text, I know how the string starts, and i know how the string ends, I want to return ALL the text inbetween the string including the start and end.
Start search = [{"lx":
End search = }]
i.e
[{"lx":variablehere}]
So far I have tried
/^\[\{"lx":(*?)\}\]/;
and
/(\[\{"lx":)(*)(\}\])/;
But to no real avail... can anyone assist?
Many thanks
You're probably making the mistake of believing the * is a wildcard. Use the period (.) instead and you'll be fine.
Also, are you sure you want to stipulate zero or more? If there must be a value, use + (one or more).
Javascript:
'[{"lx":variablehere}]'.match(/^\[\{"lx":(.+?)\}\]/);
The * star character multiplies the preceding character. In your case there's no such character. You should either put ., which means "any character", or something more specific like \S, which means "any non whitespace character".
Possible solution:
var s = '[{"lx":variablehere}]';
var r = /\[\{"(.*?)":(.*?)\}\]/;
var m = s.match(r);
console.log(m);
Results to this array:
[ '[{"lx":variablehere}]',
'lx',
'variablehere',
index: 0,
input: '[{"lx":variablehere}]' ]
\[\{"lx"\:(.*)\}\]
This should work for you. You can reach the captured variable by \1 notation.
Try this:
^\[\{\"lx\"\:(.*)\}\]$
all text between [{"lx": and }] you will find in backreference variable (something like \$1 , depends on programming language).

Categories

Resources