javascript regexp match tag names - javascript

I can't remember the name of it, but I believe you can reference already matched strings within a RegExp object. What I want to do is match all tags within a given string eg
<ul><li>something in the list</li></ul>
the RegExp should be able to match only the same tags, then I will use a recursive function to put all the individual matches in an array. The regex that should work if I can reference the first match would be.
var reg = /(?:<(.*)>(.*)<(?:FIRST_MATCH)\/>)/g;
The matched array should then contain
match[0] = "<ul><li>something in the list</li></ul>";
match[1] = "ul";
match[2] = ""; // no text to match
match[3] = "li";
match[4] = "something in the list";
thanks for any help

It seems like you mean backreference (\1, \2):
var s = '<ul><li>something in the list</li></ul>';
s.match(/<([^>]+)><([^>]+)>(.*?)<\/\2><\/\1>/)
// => ["<ul><li>something in the list</li></ul>",
// "ul",
// "li",
// "something in the list"]
The result is not exactly same with what you want. But point is that the backreference \1, \2 match the string that was matched by earlier group.

It is not possible to parse HTML using regular expressions (if you're interested in the specifics, it is because HTML parsing requires a stronger type of automaton than a finite state automaton which is what a regular expression can express - look up FSA vs FST for more info).
You might be able to get away with some hack for a specific problem, but if you want to reliably parse HTML using Javascript then there are other ways to do this. Search the web for: parse html javascript and you'll get plenty of pointers on how to do this.

I made a dirty workaround. Still needs work thought.
var str = '<div><ul id="list"><li class="something">this is the text</li></ul></div>';
function parseHTMLFromString(str){
var structure = [];
var matches = [];
var reg = /(<(.+)(?:\s([^>]+))*>)(.*)<\/\2>/;
str.replace(reg, function(){
//console.log(arguments);
matches.push(arguments[4]);
structure.push(arguments[1], arguments[4]);
});
while(matches.length){
matches.shift().replace(reg, function(){
console.log(arguments);
structure.pop();
structure.push(arguments[1], arguments[4]);
matches.push(arguments[4]);
});
}
return structure;
}
// parseHTMLFromString(str); // ["<div>", "<ul id="list">", "<li class="something">", "this is the text"]

Related

How do I pass a variable into regex with Node js?

So basically, I have a regular expression which is
var regex1 = /10661\" class=\"fauxBlockLink-linkRow u-concealed\">([\s\S]*?)<\/a>/;
var result=text.match(regex1);
user_activity = result[1].replace(/\s/g, "")
console.log(user_activity);
What I'm trying to do is this
var number = 1234;
var regex1 = /${number}\" class=\"fauxBlockLink-linkRow u-concealed\">([\s\S]*?)<\/a>/;
but it is not working, and when I tried with RegExp, I kept getting errors.
You can use RegExp to create regexp from a string and use variables in that string.
var number = 1234;
var regex1 = new RegExp(`${number}aa`);
console.log("1234aa".match(regex1));
You can build the regex string with templates and/or string addition and then pass it to the RegExp constructor. One key in doing that is to get the escaping correct as you need an extra level of escaping for backslashes because the interpretation of the string takes one level of backslash, but you need one to survive as it gets to the RegExp contructor. Here's a working example:
function match(number, str) {
let r = new RegExp(`${number}" class="fauxBlockLink-linkRow u-concealed">([\\s\\S]*?)<\\/a>`);
return str.match(r);
}
const exampleHTML = 'Some link text';
console.log(match(1234, exampleHTML));
Note, using regex to match HTML like this becomes very order-sensitive (whereas the HTML itself isn't order-sensitive). And, your regex requires exactly one space between classes which HTML doesn't. If the class names were in a slightly different order or spacing different in the <a> tag, then it would not match. Depending upon what you're really trying to do, there may be better ways to parse and use the HTML that isn't order-sensitive.
I solved it with the method of Adem,
function escapeRegExp(string) {
return string.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'); // $& means the whole matched string
}
var number = 1234;
var firstPart = `<a href="/forum/search/member?user_id=${number}" class="fauxBlockLink-linkRow u-concealed">`
var regexpString = escapeRegExp(firstPart) + '([\\s\\S]*?)' + escapeRegExp('</a>');
console.log(regexpString)
var sample = ` `
var regex1 = new RegExp(regexpString);
console.log(sample.match(regex1));
in the first place the issue was actually the way I was reading the file, the data I was applying the match on, was undefined.

Javascript RegEx contains

I'm using Javascript RegEx to compare if a string matches a standart format.
I have this variable called inputName, which has the following format (sample):
input[name='data[product][tool_team]']
And what I want to achieve with Javascript's regex is to determine if the string has the following but contains _team in between those brackets.
I tried the following:
var inputName = "input[name='data[product][tool_team]']";
var teamPattern = /\input[name='data[product][[_team]]']/g;
var matches = inputName.match(teamPattern);
console.log(matches);
I just get null with the result I gave as an example.
To be honest, RegEx isn't really my area, so I suppose it's wrong.
A couple of things:
You need to escape [ and ] as they have special meaning in regex
You need .* (or perhaps [^[]*) in front of _team if you want to allow anything there ([^[]* means "anything but a [ repeated zero or more times)
Example if you just want to know if it matches:
var string = "input[name='data[product][tool_team]']";
var teamPattern = /input\[name='data\[product\]\[[^[]*_team\]'\]/;
console.log(teamPattern.test(string));
Example if you need to capture the xyz_team bit:
var string = "input[name='data[product][tool_team]']";
var teamPattern = /input\[name='data\[product\]\[([^[]*_team)\]'\]/;
var match = string.match(teamPattern);
console.log(match ? match[1] : "no match");
If you are trying to check for DOM elements you can use attribute contains or attribute equals selector
document.querySelectorAll("input[name*='[_team]']")

How to split a string by a character not directly preceded by a character of the same type?

Let's say I have a string: "We.need..to...split.asap". What I would like to do is to split the string by the delimiter ., but I only wish to split by the first . and include any recurring .s in the succeeding token.
Expected output:
["We", "need", ".to", "..split", "asap"]
In other languages, I know that this is possible with a look-behind /(?<!\.)\./ but Javascript unfortunately does not support such a feature.
I am curious to see your answers to this question. Perhaps there is a clever use of look-aheads that presently evades me?
I was considering reversing the string, then re-reversing the tokens, but that seems like too much work for what I am after... plus controversy: How do you reverse a string in place in JavaScript?
Thanks for the help!
Here's a variation of the answer by guest271314 that handles more than two consecutive delimiters:
var text = "We.need.to...split.asap";
var re = /(\.*[^.]+)\./;
var items = text.split(re).filter(function(val) { return val.length > 0; });
It uses the detail that if the split expression includes a capture group, the captured items are included in the returned array. These capture groups are actually the only thing we are interested in; the tokens are all empty strings, which we filter out.
EDIT: Unfortunately there's perhaps one slight bug with this. If the text to be split starts with a delimiter, that will be included in the first token. If that's an issue, it can be remedied with:
var re = /(?:^|(\.*[^.]+))\./;
var items = text.split(re).filter(function(val) { return !!val; });
(I think this regex is ugly and would welcome an improvement.)
You can do this without any lookaheads:
var subject = "We.need.to....split.asap";
var regex = /\.?(\.*[^.]+)/g;
var matches, output = [];
while(matches = regex.exec(subject)) {
output.push(matches[1]);
}
document.write(JSON.stringify(output));
It seemed like it'd work in one line, as it did on https://regex101.com/r/cO1dP3/1, but had to be expanded in the code above because the /g option by default prevents capturing groups from returning with .match (i.e. the correct data was in the capturing groups, but we couldn't immediately access them without doing the above).
See: JavaScript Regex Global Match Groups
An alternative solution with the original one liner (plus one line) is:
document.write(JSON.stringify(
"We.need.to....split.asap".match(/\.?(\.*[^.]+)/g)
.map(function(s) { return s.replace(/^\./, ''); })
));
Take your pick!
Note: This answer can't handle more than 2 consecutive delimiters, since it was written according to the example in the revision 1 of the question, which was not very clear about such cases.
var text = "We.need.to..split.asap";
// split "." if followed by "."
var res = text.split(/\.(?=\.)/).map(function(val, key) {
// if `val[0]` does not begin with "." split "."
// else split "." if not followed by "."
return val[0] !== "." ? val.split(/\./) : val.split(/\.(?!.*\.)/)
});
// concat arrays `res[0]` , `res[1]`
res = res[0].concat(res[1]);
document.write(JSON.stringify(res));

Match a string if it comes after certain string

I need a regular expression for JavaScript to match John (case insensitive) after Name:
I know how to do it, but I don't know how to get string from a different line like so (from a textarea):
Name
John
This is what I tried to do :: var str = /\s[a-zA-Z0-9](?= Name)/;
The logic: get a string with letter/numbers on a linespace followed by Name.
Then, I would use the .test(); method.
EDIT:
I tried to make the question more simple than it should have been. The thing I don't quite understand is how do I isolate "John" (really anything) on a new line followed by a specific string (in this case Name).
E.g., IF John comes after Name {dosomething} else{dosomethingelse}
Unfortunately, JavaScript doesn't support look-behinds. For something this simple, you can just match both parts of the string like this:
var str = /Name\s+([a-zA-Z0-9]+)/;
You then just have to extract the first capture group if you want to get John. For example:
"Name\n John".match(/Name\s+([a-zA-Z0-9]+)/)[1]; // John
However if you're just using .test, the capture group isn't necessary. For example:
var input = "Name\n John";
if (/Name\s+[a-zA-Z0-9]+/.test(input)) {
// dosomething
} else{
// dosomethingelse
}
Also, if you need to ensure that Name and John appear on separate lines with nothing but whitespace in between, you can use this pattern with the multi-line (m) flag.
var str = /Name\s*^\s*([a-zA-Z0-9]+)/m;
You do not need a lookahead here, simply place Name before the characters you want to match. And to enable case-insensitive matching, place the i modifier on the end of your regular expression.
var str = 'Name\n John'
var re = /Name\s+[a-z0-9]+/i
if (re.test(str)) {
// do something
} else {
// do something else
}
Use the String.match method if you want to extract the name from the string.
'Name\n John'.match(/Name\s+([a-z0-9]+)/i)[1];
The [1] here refers back to what was matched/captured in capturing group #1

Return the part of the regex that matched

In a regular expression that uses OR (pipe), is there a convenient method for getting the part of the expression that matched.
Example:
/horse|caMel|TORTOISe/i.exec("Camel");
returns Camel. What I want is caMel.
I understand that I could loop through the options instead of using one big regular expression; that would make far more sense. But I'm interested to know if it can be done this way.
Very simply, no.
Regex matches have to do with your input string and not the text used to create the regular expression. Note that that text might well be lost, and theoretically is not even necessary. An equivalent matcher could be built out of something like this:
var test = function(str) {
var text = str.toLowerCase();
return text === "horse" || text === "camel" || text === "tortoise";
};
Another way to think of it is that the compilation of regular expressions can divorce the logic of the function from their textual representation. It's one-directional.
Sorry.
There is not a way built-in to the Javascript RegExp object; without changing your expression. The closest you can get is source which will just return the entire expression as a string.
Since you know you're expression is a series of | ORs, you could capturing groups to figure out which group matched, and combine that with .source to find out the contents of that group:
var exp = /(horse)|(caMel)|(TORTOISe)/i;
var result = exp.exec("Camel");
var match = function(){
for(var i = 1; i < result.length; i++){
if(result[i]){
return exp.source.match(new RegExp('(?:[^(]*\\((?!\\?\\:)){' + i + '}([^)]*)'))[1];
}
}
}();
// match == caMel
It is also extremely easy (although somewhat impractical) to write a RegExp engine from scratch would you could technically add that functionality to. It would be much slower than using an actual RegExp object, since the whole engine would have to be interpreted at run-time. It would, however, be able to return exactly the matched portion of the expression for any regular expression and not be limited to one which consists of a series of | ORs.
The best way to solve your problem, however, is probably not to use a loop or a regular expression at all, but instead to create an object where you use a canonical form for the key:
var matches = {
'horse': 'horse',
'camel': 'caMel',
'tortoise': 'TORTOISe'
};
// Test "Camel"
matches['Camel'.toLowerCase()]; // "caMel"
This will give the wanted value without looping:
var foo, pat, tres, res, reg = /horse|caMel|TORTOISe/i;
foo = reg.exec('Camel');
if (foo) {
foo = foo[0].replace(/\./g, '\\.');
pat = new RegExp('\\|' + foo + '\\|', 'i');
tres = '|' + reg.source + '|';
res = tres.match(pat)[0].replace(/\|/g, '');
}
alert(res);
If there's no match, now you get undefined, though it's easy to change to something else.

Categories

Resources