In other words, there can be no other occurrence of the pattern between the end of the match and the second pattern. This needs to be implemented in a single regular expression.
In my specific case I have a page of HTML and need to extract all the content between
<w-block-content><span><div>
and
</div></span></w-block-content>
where
the elements might have attributes
the HTML might be formatted or not - there might be extra white space and newlines
there may be other content between any of the above tags, including inner div elements within the above outer div. But you can assume each <w-block-content> element
contains ONLY ONE direct child <span> child (i.e. it may contain other non-span children)
which contains ONLY ONE direct <div> child
which wraps the content that must be extracted
🚩 the match must extend all the way to the last </div> within the <span> within the <w-block-content>, even if it is unmatched with an opening <div>.
the solution must be pure ECMAScript-spec Regex. No Javascript code can be used
Thus the problem stated in the question at the top.
The following regex successfully matches as long as there are NO internal </div> tags:
(?:<w-block-content.*[\s\S]*?<div>)([\s\S]*?)(?:<\/div>[\s\S]*?<\/span>[\s\S]*?<\/w-block-content>)
❌ But if there are additional </div> tags, the match ends prematurely, not including the entirety of the block.
I use [\s\S]*? to match against arbitrary content, including extra whitespace and newlines.
Here is sample test data:
</tr>
<td>
<div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
<w-block-content data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112"><span class="source-block-tooltip">
<div>
Další master<br><div><b>Master č. 2</b> </div><br>
</div>
</span></w-block-content>
</div>
</td>
</tr>
</tr>
<td>
<div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
<w-block-content data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112"><span class="source-block-tooltip">
<div>
Další master<br><b>Master č. 2</b><br>
</div>
</span></w-block-content>
</div>
</td>
</tr>
which I've been testing here: (https://regex101.com/r/jekZhr/3
The first extracted chunk should be:
Další master<br><div><b>Master č. 2</b> </div><br>
I know that regex is not the best tool for handling XML/HTML but I need to know if such regex is possible or if I need to change the structure of data.
As already commented, regex isn't a general purpose tool -- in fact it's a specific tool that matches patterns in a string. Having said that here's a regex solution that will match everything after the first <div> up to </w-block-content>. From there find the last index of </div> and .slice() it.
RegExp
/(?<=<w-block-content[\s\S]*?<div[\s\S]*?>)
[\s\S]*?
(?=<\/w-block-content>)/g
regex101
Explanation
A look behind: (?<=...) must precede the match, but will not be included in the match itself.
A look ahead: (?=...) must proceed the match, but will not be included in the match itself.
Segment
Description
(?<=<w-block-content[\s\S]*?<div[\s\S]*?>)
Find if literal "<w-block-content", then anything, then literal "<div", then anything, then literal ">" is before whatever is matched. Do not include it in the match.
[\s\S]*?
Match anything
(?=<\/w-block-content>)
Find if literal "</w-block-content>" is after whatever is matched. Do not include it in the match.
Example
const rgx = /(?<=<w-block-content[\s\S]*?<div[\s\S]*?>)[\s\S]*?(?=<\/w-block-content>)/g;
const str = document.querySelector("main").innerHTML;
const A = str.match(rgx)[0];
const idx = A.lastIndexOf("</div>");
const X = A.slice(0, idx);
console.log(X);
<main>
<w-block-content id="A">
CONTENT OF #A
<span id="B">
CONTENT OF #B
<div id="C">
<div>CONTENT OF #C</div>
<div>CONTENT OF #C</div>
</div>
CONTENT OF #B
</span>
CONTENT OF #A
</w-block-content>
</main>
In your pattern you use [\s\S]*? which matches any character, as few as possible. But as you use that part in between the elements, the pattern can backtrack and allow to match the first </div>
If you want to extract the parts that match, and as you already have a pattern that uses a capture group "as long as there are NO internal tags" you don't need any lookarounds.
You can make your pattern more specific and match the opening and closing tags with only optional whitespace chars in between.
<w-block-content[^<>]*>\s*<span[^<>]*>\s*<div[^<>]*>([^]*?)<\/div>\s*<\/span>\s*<\/w-block-content>
Explanation
<w-block-content[^<>]*>\s* Match the w-block-content element, where [^<>]* is a negated character class that matches optional chars other than < and >, and the \s* matches optional whitespace chars (including newlines)
<span[^<>]*>\s* The same for the span
<div[^<>]*> The same for the div
([^]*?) Capture group 1, match any character including newlines, as few as possible
<\/div>\s*<\/span>\s*<\/w-block-content> Match then ending part where there can be optional whitespace chars in between the closing tags.
See a regex demo.
See why parsing HTML with a regex is not advisable
Pure regex solution that accepts trickier input than the sample data provided in the question.
The code and data snippet at the bottom includes such tricky input. For example, it includes additional (unexpected) non-whitespace within the matching elements that are not part of the extracted data, HTML comments in this case.
🚩 I inferred this as a requirement from the original regex provided in the question.
None of the other answers as of this writing can handle this input.
⚠️ It also accepts some illegal input, but that's what you get by requiring the use of regular expressions and disallowing a true HTML parser.
On the other hand, a HTML parser will make it difficult to handle the malformed HTML in the sample input given in the question. A conforming parser will handle such "tag soup" by forcibly matching the tag to an open div further up the tree, prematurely closing any intervening parent elements on along the way. So not only will it use the first rather than last </div> with the data record, it may close higher up container elements and wreak havoc on how the rest of the file is parsed.
The regex
<w-block-content[^>]*>[\s\S]*?<span[^>]*>[\s\S]*?<div[^>]*>([\s\S]*?)<\/div\s*>(?:(?!<\/div\s*>)[\s\S])*?<\/span\s*>[\s\S]*?<\/w-block-content\s*>/g
The regex meets all the requirements stated in the question:
It is pure Regexp. It requires no Javascript other than the standard code needed to invoke it.
It can be invoked in one call via String.matchAll() (returns an array of matches)
Or you can iteratively invoke it to iteratively parse records via Regexp.exec(), which returns successive matches on each call, keeping track of where it left off automatically. See test code below.
Regex grouping is used so that the entire outer "record" is parsed and consumed but the "data" within is still available separately. Otherwise parsing successive records would require additional Javascript code to set the pointer to the end of the record before the next parse. That would not only go against the requirements but would also result in redundant and inefficient parsing.
The full record is available as group 0 of each match
The data within is available as group 1 of each match
It handles all legal extra whitespace within tags
It handles both whitespace and legal non-whitespace between elements (explained above).
In addition:
It works in older browsers, not relying on lookabehind or dotall
Lookbehind assertions have backward compatibility limits. Lookbehind was added in ECMAScript 2018, but as you can see at the above link and here not all of even the latest browser support it.
dotall also has backward compatibility limits
The regex explained
/
<w-block-content[^>]*>
opening w-block-content "record" tag with arbitrary attributes and whitespace
[\s\S]*?
arbitrary whitespace and non-whitespace within w-block-content before span
<span[^>]*>
expected nested span with arbitrary attributes and whitespace
[\s\S]*?
arbitrary whitespace and non-whitespace within span before div
<div[^>]*>
expected nested div with arbitrary attributes and whitespace. This div wraps the data.
([\s\S]*?)
the data
<\/div\s*>
the closing div tag with arbitrary legal whitespace.
(?:(?!<\/div\s*>)[\s\S])*?
arbitrary whitespace and non-whitespace within span after div 🌶 except that it guarantees that </div> matched by the preceding pattern is the last one within the span element.
<\/span\s*>
the closing span tag with arbitrary legal whitespace.
[\s\S]*?
arbitrary whitespace and non-whitespace within w-block-content after span
<\/w-block-content\s*>
the closing w-block-content tag with arbitrary legal whitespace.
/g
global flag that enables extracting multiple matches from the input. Affects how String.matchAll and RegExp.exec work.
Tricky Test Data and Example Usage/Test Code
'use strict'
const input = `<tr>
<td>
<div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
<w-block-content data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112">
<span class="source-block-tooltip">
<div>SIMPLE CASE DATA STARTS HERE
Další master<br><b>Master č. 2</b><br>
SIMPLE CASE DATA ENDS HERE</div>
</span>
</w-block-content>
</div>
</td>
</tr><tr>
<td>
<div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
<w-block-content class="tricky"
data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112" >
<!-- TRICKY: whitespace within expected tags above and below,
and also this comment inserted between the tags -->
<span class="source-block-tooltip"
color="burgandy"
> <!-- TRICKY: some more non-whitespace
between expected tags -->
<div
>TRICKY CASE DATA STARTS HERE
<div> TRICKY inner div
Další master<br><b>Master č. 2</b><br>
</div>
TRICKY unmatched closing div tags
</div> Per the requirements, THIS closing div tag should be ignored and
the one below (the last one before the closing outer tags) should be
treated as the closing tag.
TRICKY CASE DATA ENDS HERE</div> TRICKY closing tags can have whitespace including newlines
<!-- TRICKY more stuff between closing tags -->
</span
>
<!-- TRICKY more stuff between closing tags -->
</w-block-content
>
</div>
</td>
</tr>
`
const regex = /<w-block-content[^>]*>[\s\S]*?<span[^>]*>[\s\S]*?<div[^>]*>([\s\S]*?)<\/div\s*>((?:(?!<\/div\s*>)[\s\S])*?)<\/span\s*>[\s\S]*?<\/w-block-content\s*>/g
function extractNextRecord() {
const match = regex.exec(input)
if (match) {
return {record: match[0], data: match[1]}
} else {
return null
}
}
let output = '', result, count = 0
while (result = extractNextRecord()) {
count++
console.log(`-------------------- RECORD ${count} -----------------------\n${result.record}\n---------------------------------------------------\n\n`)
output += `<hr><pre>${result.data.replaceAll('<', '<')}</pre>`
}
output += '<hr>'
output = `<p>Extracted ${count} records:</p>` + output
document.documentElement.innerHTML = output
Here's the regex that worked for me, when applied to the example you provided; I've broken it out to three separate lines for visual clarity, and presumably you'd combine them back into one line or something:
(?<=<w-block-content[^>]*>\s*<span[^>]*>\s*<div[^>]*>)
[\s\S]*?
(?=<\/div>\s*<\/span>\s*<\/w-block-content>)
I don't think you need to use capture groups () in this case. If you're using a look-behind (?<=) and a look-ahead (?=) for your boundaries-finding (both of which are non-capturing), then you can just let the entire match be the content that you want to find.
I added this answer because I didn't see the other answers using [^>] (= negated character class) to allow the tag strings to be open-ended in accepting additional attributes without entirely skipping any enforcement of tag closure, which I think is a cleaner and safer approach.
I'm admittedly not a JavaScript guy here, so: today I learned that JavaScript regex-matching doesn't support single-line mode (/s), so you have to do those [\s\S] things as a work-around, instead of just .. What a pain that must be for you JavaScript folks... sorry.
The following solution assumes that there can only be whitespace and/or newlines between the target </div> and the </span>, which follows from the OP's statement that the <span> only has one direct child and this is the wrapper <div> whose contents we are seeking:
/(?:<w-block-content.*[\s\S]*?<div>)([\s\S]*?)(?:<\/div>((?!<\/div>))*[\s]+<\/span>[\s]*?<\/w-block-content>)/gm
https://regex101.com/r/sn0frx/1
EDIT: explanation. This is essentially the OP's regex with the following changes:
A negative lookahead ((?!<\/div>))* is inserted after the pattern's <\/div> to ignore any earlier </div>s.
The OP's character class that now follows this insertion has had the \S removed so is now [\s]*? based on the assumption stated above.
Similarly, the same edit has been made to the character class following the <\/span>, based on the assumption that the </span> we are seeking is the one immediately preceding the </w-block-content>, whitespace and newlines notwithstanding, as indicated in the question.
I have tag like <span style="font-size:10.5pt;\nfont-family:\nKaiTi"> and I want to replace \n within tag with empty character.
Note: Tag could be anything(not fixed)
I want regex expression to replace the same in the javascript.
You should be able to strip out the \n character before applying this HTML to the page.
Having said that, try this (\\n)
You can see it here: regex101
Edit: A bit of refinement and I have this (\W\\n). It works with the example you provided. It breaks down if you have spaces in the body of the tags (<span> \n </span>).
I've tried everything I know to do. Perhaps someone with more regex experience can assist?
I'm trying to remove all the characters between the characters <p and </p> (basically all the attributes in the p tags).
With the following block of code, it removes everything, including the text inside the <p>
MyString.replace(/<p.*>/, '<p>');
Example: <p style="test" class="test">my content</p> gives <p></p>
Thank you in advance for your help!
Try this RegEx: /<p [^>]*>/, basically just remove the closing bracket from the accepted characters. . matches all characters, that's why this doesn't work. With the new one it stops at the first >.
Edit: You can add a global and multi-line flag: /<p [^>]*>/gm. Also as one of the comments pointed out, removing the tag makes it applicant for every tag, however this will make replacing a bit harder. This RegEx is: /<[^>]*>/gm
MyString.replace(/\<p.*<\/p>/, '<p></p>');
I'm trying to replace the urls in the block of text with clickable link while rendering.
The regex am using :
/(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/ig
Example
This is the text i got from http://www.sample.com
it should be converted to
This is the text i got from
http://www.sample.com
the problem is when the text having the img tag , then the src attribute also getting replaced which i don't want.
Kindly help me to replace only direct links not the links in the src="" attributes
Thanks
Add a negative look-behind assertion at the beginning of your regex, to search only for strings not after src=":
(?<!src=")
Edit: Unfortunately look-behind assertions do not work in javascript regexes. Alternatively, you can use a negative look-ahead assertion like this:
((?!src=").{0,4})
remembering that you need to use the matched string in the replacement (otherwise you would delete 4 characters before http://).
I have this text:
<a>
a lot of text here with all types of symbols ! : . %& < >
</a>
<a>
another text here with all types of symbols ! : . %& < >
</a>
I want to match the tag name and its contents: so the procedure I'm using is match:
<([^]*?)>(?:([^]*)<\/\1>)?
NOTE: I use the conditional group at the end because it can be omitted, for example.
<a>
<a>
another text here with all types of symbols ! : . %& < >
</a>
But my problem is that the regex tries to consume every character so it opens and closes the tab and the contents of the tab becomes:
<a>
another text here with all types of symbols ! : . %& < >
when I wanted to detect two matches one the isolated tag and the other the multiline tag.
NOTE2: This is NOT HTML or XML so I don't need to parse it like wise.
NOTE3: my ideia was to replace the regex part:
(?:([^]*)....
by something that would 'match every character until '<' appears at the beginning of the line (this because in the text I'm parsing there can't be tags inside tags) so I thought that would be good.. but I can't seem to find a regex for that :(
I think what you want is /<([a-z0-9-]+)>([^]*?)(?:(<\/\1>)|$|(?=(?:<[a-zA-Z0-9\-]+>)))/gi
I suggest you parse it by program:
Match the first occurrence of any opening tag:
<([a-z0-9]+)>
With this, you can get the tag's name.
Get the position of the second occurrence of any opening tag and the position of the first ocurrence of the closing tag with the same name that the read before.
Compare these positions and decide if it was a single-line just-open-tag or a multi-line open-and-close-tag.
Get the content enclosed between the first opening tag and the lowest position got in step 2.