This question already has answers here:
Parse an HTML string with JS
(15 answers)
Closed 3 years ago.
I have an issue parsing the dom elements when text contains something like below. I wanted to remove highligted text from actual using Javascript. Can you please help me on this. I want to depend on regular expressions on the same.
I know how to get the quoted attributes using standard string functions and also using dom parser.
For the nodes like below, using string functions such as replace, slice may work but I need to traverse thru entire string. Which is performance issue.
So I wanted to go with regular expressions to find such attributes in a node.
<p class=MsoListParagraphCxSpFirst style='text-indent:-.25in;mso-list:l0 level1 lfo1'>
In the above example I want to remove class attribute and class name could be anything. These nodes are generated from MS word and are not in my control.
EDIT: Following is the pattern I am using to search unquoted text. But it is not working
var pattern = /<p class=\s*=\s*([^" >]+)/im
Regex101 Example
Regex:
\S+?=[^'"]\S*[^'"\s]
the tricky part with this one is finding the end of the unquoted attribute, in this example i'm assuming it will not contain any white space characters, so I can use the first occurrence of white space to terminate the match
Related
This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have a string such as
var string = 'blabla blabla</custom-tag>';
where I want to strip all the custom-tag tags. The problem is that the deletion needs to happen sequential (I think) as multiple instances occur and the tag can contain other tags or be included in other tags itself.
At the moment, the best solution I have is
var deleteTag = '<custom-tag>.*<\/custom-tag>';
string= string.replace(new RegExp(deleteTag , 'g'), '');
which leaves me with
blabla <a href="http://www.url.com"
instead of
blabla an url blabla blablabla between tags
Should I implement a loop or is there a way to do this with RegExp?
Thanks!
PS: It is not possible for me to parse my string as HTML as it contains tags within tags and would thus render false HTML (it is part of a templating module in our software so the string goes through some iterations after which it eventually does end up as HTML).
So it is not a duplicate of questions such as Remove specific HTML tag with its content from javascript string
You can avoid this particular issue by using a lazy match in your regex instead of a greedy one. Try this:
var deleteTag = '<custom-tag>.*?<\/custom-tag>';
string= string.replace(new RegExp(deleteTag , 'g'), '');
If you have nested <custom-tag>, though, regex is probably not the tool for the job.
This question already has answers here:
RegEx for a^b instead of pow(a,b)
(6 answers)
Closed 4 years ago.
I'd like to apply a regexp code to the part of a string that is before and/or after a specific character, and that character must be outside parenthesis.
To be more specific, I am coding a website (in React.JS) showing logical calculation, and I want to remove the first and last parenthesis before and after the main logical operator.
For example, in the string:
"((p∧r)∧(q∧r))∧(p∧q)"
I want to get only: "(p∧r)∧(q∧r)" and "p∧q".
That means I want to get all the character before and after the only "∧" outside of any parenthesis, and I want to remove the first opening parenthesis and the last closing parenthesis of the two string. (The result could be an array with the part before and after for example.)
I am already able to remove the first parenthis with this code :
str.replace(/(\()(.*)(\))/, "$2");
But that code is applied to the whole string right now.
So how do I apply this code to the two parts before and after the only "∧" outside parenthesis ? If possible I'd prefer a code only in regexp, but a JavaScript part would do. Thanks by advance.
You will not be able to do this in pure regex, this is similar of trying to parse HTML with regex. in short, it is not possible (and have been answered many time over in all possible way)
So you will need javascript. To do this in javascript will not be simple, as you will need to parse the entire string to find "^-who-are-not-in-parenthese".
I would first check if there is some parser for mathematical operation or other who already exist and could be adapted for your need.
If you do not find something, you will need to create yourself.
You could do it with a loop passing character by character, having a counter for the indentation level of parenthesis, and a tree-like data structure as an output.
you could also use regex to find the last level of indentation and creating a bottom-up parser. you find all the innerest parenthesis group, saved them, and replace them with a special identifier character. you can then redo the operation to find the now innerest parenthesis group, check if there is special character in them and place them in the tree (as parent of the element identified by the character.)
once you have your tree structure (in one way or another), the root and the first level of element should be what you want.
You could do it this way:
// your string
const fixMe = "((p∧r)∧(q∧r))^(p∧q)";
// Separate double paren items from single paren items
const parts = fixMe.match(/\(\(.*\)\)|\(.*\)/g).map(
// Get rid of leading and following parens
item => item.replace(/^\(|\)$/g, '')
);
However, note: this solution is not very flexible. Better might some sort of recursive parenthesis lookup where each loop keeps an accounting of nesting level, etc...
This question already has answers here:
Finding substring whilst ignoring HTML tags
(3 answers)
Closed 2 years ago.
I have HTML content like this:
<p>The bedding was hardly <strong>able to cover</strong> it and seemed ready to slide off any moment.</p>
Here's a complete version of the HTML.
http://collabedit.com/gkuc2
I need to search the string hardly able to cover (just an example), I want to ignore any HTML tags inside the string I'm looking for. Because in the HTML file there's HTML tags inside the string and a simple search won't find it.
The use case is: I have two versions of a file:
An HTML file with text and tags
The same file but with the raw text only (removed any tags and extra spaces)
The sub-string that I want to search (the needle) is from the text version (that doesn't contain any HTML tag) and I want to find it's position in the HTML version (the file that has tags).
What is the regular expression that would work?
Put this between each letter:
(?:<[^>]+>)*
and replace the spaces with:
(?:\s*<[^>]+>\s*)*\s+(?:\s*<[^>]+>\s*)*
Like:
h(?:<[^>]+>)*a(?:<[^>]+>)*r(?:<[^>]+>)*d(?:<[^>]+>)*l(?:<[^>]+>)*y(?:\s*<[^>]+>\s*)*\s+(?:\s*<[^>]+>\s*)*a(?:<[^>]+>)*b(?:<[^>]+>)*l(?:<[^>]+>)*e(?:\s*<[^>]+>\s*)*\s+(?:\s*<[^>]+>\s*)*t(?:<[^>]+>)*o(?:\s*<[^>]+>\s*)*\s+(?:\s*<[^>]+>\s*)*c(?:<[^>]+>)*o(?:<[^>]+>)*v(?:<[^>]+>)*e(?:<[^>]+>)*r
you only need the ones between each letter if you want to allow tags to break words, like: This is b<b>old</b>
This is it without the letter break:
hardly(?:\s*<[^>]+>\s*)*\s+(?:\s*<[^>]+>\s*)*able(?:\s*<[^>]+>\s*)*\s+(?:\s*<[^>]+>\s*)*to(?:\s*<[^>]+>\s*)*\s+(?:\s*<[^>]+>\s*)*cover
This should work for most cases. However, if the Html is malformed in which the < or > is not htmlencoded, you may run into issues. Also it may break on script blocks or other elements with CDATA sections.
Try to save the text in a variable or something, then remove all the tags and perform a normal search in that.
You can use a simple php function strip_tags().
EDIT:
So you might try to look for the first and last words (or just first and then play with the rest of the result) to locate the string, then parse the result and remove tags and check if it's the one you're looking for.
Like using regex:
hardly.cover
or even
hardly.$
And saving the location of each result.
Then use strip_tags() on the results and analyze each result if it's the one you want.
I know it's kinda weird solution but you can avoid endless regex etc.
I've researched stackoverflow and find similar results but it is not really what I wanted.
Given an xml string: "<a b=\"c\"></a>" in javascript context, I want to create a regex that will capture the attribute value including the quotation marks.
NOTE: this is similar if you're using single quotation marks.
Currently I have a regular expression tailored to the XML specification:
[_A-Za-z][\w\.\-]*(?:=\"[^\"]*\")?
[_A-Za-z][\w\.\-]* //This will match the attribute name.
(?:=\"[^\"]*\")? //This will match the attribute value.
\"[^\"]*\" //This part concerns me.
My question now is, what if the xml string looks like this:
<shout statement="Hi! \"Richeve\"."></shout>
I know this is a dumb question to ask but I just want to capture rare cases that this scenario might happen (I know the coder can use single quotes on this scenario) but there are cases that we don't know the current value of the attribute given that the attribute value changes dynamically at runtime.
So to make this clearer, the result of that using the correct regex should be:
"Hi! \"Richeve\"."
I hope my question is clear. Thanks for all the help!
PS: Note that the language context is Javascript and I know it is tempting to use lookbehinds but currently lookbehinds are not supported.
PS: I know it is really hard to parse XML but I have an elegant solution to this :) so I just need this small problem to be solved. So this problem only main focus is capturing quotation marked string tokens containing quotation marks inside the string token.
The standard pattern for content with matching delimiters and embedded escaped delimiters goes like this:
"[^"\\]*(?:\\.[^"\\]*)*"
Ignoring the obvious first and last characters in the pattern, here's how the rest of the pattern works:
[^"\\]*: Consume all characters until a delimiter OR backslash (matching Hi! in your example)
(?:\\.[^"\\]*)* Try to consume a single escaped character \\. followed by a series of non delimiter/backslash characters, repeatedly (matching \"Richeve first and then \". next in your example)
That's it.
You can try to use a more generic delimiter approach using (['"]) and back references, or you can just allow for an alternate pattern with single quotes like so:
("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')
Here's another description of this technique that might also help (see the section called Strings): http://www.regular-expressions.info/examplesprogrammer.html
Description
I'm pretty really sure embedding double quotes inside a double quoted attribute value is not legal. You could use the unicode equivalent of a double quote \x22 inside the value.
However to answer the question, this expression will:
allow escaped quotes inside attribute values
capture the attribute statement 's value
allow attributes to appear in any order inside the tag
will avoid many of the edge cases which will trip up pattern matching inside html text
doesn't use lookbehinds
<shout\b(?=\s)(?=(?:[^>=]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*?\sstatement=(['"])((?:\\['"]|.)*?)\1(?:\s|\/>|>))(?:[^>=]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>.*?<\/shout>
Example
Pretty Rubular
Ugly RegexPlanet set to Javascript
Sample Text
Note the difficult edge case in the first attribute :)
<shout onmouseover=' statement="He said \"I am Inside the onMouseOver\" " ; if ( 6 > a ) { funRotate(statement) } ; ' statement="Hi! \"Richeve\"." title="sometitle">SomeString</shout>
Matches
Group 0 gets the entire tag from open to close
Group 1 gets the quote surrounding the statement attribute value, this is used to match the closing quote correctly
Group 2 gets the statement attribute value which may include escaped quotes like \" but not including the surrounding quotes
[0][0] = <shout onmouseover=' statement="He said \"I am Inside the onMouseOver\" " ; if ( 6 > a ) { funRotate(statement) } ; ' statement="Hi! \"Richeve\"." title="sometitle">SomeString</shout>
[0][1] = "
[0][2] = Hi! \"Richeve\".
This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 6 years ago.
I'm trying to use RegEx to select all strings between two dollar signs.
text = text.replace(/\$.*\$/g, "meow");
I'm trying to turn all text between two dollar signs into "meow" (placeholder).
EDIT:
Original question changed because the solution was too localized, but the accepted answer is useful information.
That's pretty close to what you want, but it will fail if you have multiple pairs of $text$ in your string. If you make your .* repeater lazy, it will fix that. E.g.,
text = text.replace(/\$.*?\$/g, "meow");
I see one problem: if you have more than one "template" like
aasdasdsadsdsa $a$ dasdasdsd $b$ asdasdasdsa
your regular expression will consider '$a$ dasdasdsd $b$' as a text between two dolar signals. you can use a less specific regular expression like
/\$[^$]*\$/g
to consider two strings in this example