How do you remove from /> back to < and everything in between? (Javascript) - javascript

I'm having an issue with some XML when processing it with my Javascript, because the Node modules (libxslt & libxmljs) don't know how to handle a self closing tag. Through some different testing I have narrowed the problem down to XML elements that self close, like the center element in the example below:
var string =
"<head>
<body>
<example />
</body>
</head>"
Simply put, I need a way of removing
<example />
entirely; without knowing the position prior, because there are multiple in a document, and without addressing the tag name directly, because the self closing tags vary from document to document.
If .replace() obtains the location ID of the parameter, it could be used with a function as the second parameter. Something like this:
string.replace('/>', function(match){
//search from match back for the closest '<' and remove that substring.
})

Thanks all for the advice; particularly to #Tonioyoyo, his led to solving my question, solution below:
//Xml with random element tags
var xml = "<head><body><example1 /><example2 /><example3 /></body></head>"
//Convert to string
xml = xml.toString();
//Create pattern variable to match self-closing elements
var myRegexp = /.*?(\<\w+\s*\/\>).*/
//Removing all problem elements
var match = myRegexp.exec(xml);
while (match != null && match[1] != null) {
xml = xml.replace(match[1], '')
match = myRegexp.exec(xml);
}
//Log result
console.log(xml);
However, the real problem turned out to be a comma getting added, like so:
<opti,ons/>
When porting from SQL to Node.js using node package 'mssql', (the comma was not in the source SQL), which produced the mismatching tags error. Using:
xml.toString();
xml.replace(<opti,ons/>, ''); //Fixes the mismatch tags error.
This means that #Quentin is correct the Node modules libxslt & libxmljs do know how to deal with self closing tags, as the added comma was the problem not the tags.

You can write your own regular expression to capture either self closing tags or code between classic tags.
For instance, if you do:
var string =
"<head>
<body>
<example />
</body>
</head>"
var pattern = /<(.*) \/>/;
var result = string.replace(pattern, '');
You will end up with your string value equals to:
<head>
<body>
</body>
</head>
And if you want to test your regular expression online, you may want to visit https://regex101.com/ (you can test for Javascript language)
Hope this helps :)

Related

Replace with RegExp only outside tags in the string

I have a strings where some html tags could present, like
this is a nice day for bowling <b>bbbb</b>
how can I replace with RegExp all b symbols, for example, with :blablabla: (for example) but ONLY outside html tags?
So in that case the resulting string should become
this is a nice day for :blablabla:owling <b>bbbb</b>
EDIT: I would like to be more specific, based on the answers I have received. So first of all I have just a string, not DOM element, or anything else. The string may or may not contain tags (opening and closing). The main idea is to be able to replace anywhere in the text except inside tags. For example if I have a string like
not feeling well today :/ check out this link http://example.com
the regexp should replace only first :/ with real smiley image, but should not replace second and third, because they are inside (and part of) tag. Here's an example snippet using the regexp from one of the answer.
var s = 'not feeling well today :/ check out this link http://example.com';
var replaced = s.replace(/(?:<[^\/]*?.*?<\/.*?>)|(:\/)/g, "smiley_image_here");
document.querySelector("pre").textContent = replaced;
<pre></pre>
It is strange but the DEMO shows that it captured the correct group, but the same regexp in replace function seem not to be working.
The regex itself to replace all bs with :blablabla: is not that hard:
.replace(/b/g, ":blablabla:")
It is a bit tricky to get the text nodes where we need to perform search and replace.
Here is a DOM-based example:
function replaceTextOutsideTags(input) {
var doc = document.createDocumentFragment();
var wrapper = document.createElement('myelt');
wrapper.innerHTML = input;
doc.appendChild( wrapper );
return textNodesUnder(doc);
}
function textNodesUnder(el){
var n, walk=document.createTreeWalker(el,NodeFilter.SHOW_TEXT,null,false);
while(n=walk.nextNode())
{
if (n.parentNode.nodeName.toLowerCase() === 'myelt')
n.nodeValue = n.nodeValue.replace(/:\/(?!\/)/g, "smiley_here");
}
return el.firstChild.innerHTML;
}
var s = 'not feeling well today :/ check out this link http://example.com';
console.log(replaceTextOutsideTags(s));
Here, we only modify the text nodes that are direct children of the custom-created element named myelt.
Result:
not feeling well today smiley_here check out this link http://example.com
var input = "this is a nice day for bowling <b>bbbb</b>";
var result = input.replace(/(^|>)([^<]*)(<|$)/g, function(_,a,b,c){
return a
+ b.replace(/b/g, ':blablabla:')
+ c;
});
document.querySelector("pre").textContent = result;
<pre></pre>
You can do this:
var result = input.replace(/(^|>)([^<]*)(<|$)/g, function(_,a,b,c){
return a
+ b.replace(/b/g, ':blablabla:') // you may do something else here
+ c;
});
Note that in most (no all but most) real complex use cases, it's much more convenient to manipulate a parsed DOM rather than just a string. If you're starting with a HTML page, you might use a library (some, like my one, accept regexes to do so).
I think you can use a regex like this : (Just for a simple data not a nested one)
/<[^\/]*?b.*?<\/.*?>|(b)/ig
[Regex Demo]
If you wanna use a regex I can suggest you use below regex to remove all tags recursively until all tags removed:
/<[^\/][^<]*>[^<]*<\/.*?>/g
then use a replace for finding any b.

Javascript regex not containing keyword with backslashes

I'm having a problem with a javascript regex that has to comment out all tags inside a script tag. But it can not comment out special first script tag with id "ignorescript".
Here is a sample string to regex:
<script id="ignorescript">
var test = '<script>test<\/script>;
var xxxx = 'x';
</script>
Script tag inside ignorescipt has extra backslash because it is JSON encoded (from PHP).
And here is the final result i have to get:
<script id="ignorescript">
var test = '<!ignore-- <script>test<\/script> ignore-->;
var xxxx = 'x';
</script>
Following example works:
content = content.replace(/(<script>.*<\\\/script>)/g,
"<!--ignore $1 ignore-->");
But I need to check that it does not contain a keyword "ignorescript". If that keyword comes up then I do not want to replace anything. Otherwise add ignore comments to whole script tag So far I have gotten this far:
content = content.replace(/(<script.((?!ignorescript).)*<\/script>)/g,
"<!--ignore $1 ignore-->");
It kinda works, but not the way it supposed to be. I also have one more backslash in ending tag. So I changed it to:
content = content.replace(/(<script.((?!ignorescript).)*<\\\/script>)/g,
"<!--ignore $1 ignore-->");
Not it does not find anything at all.
Got it finally working.
Here is the working regex:
/(<script(?!\sid="ignorescript").*?<\\\/script>)/g

regex replace characters within tags

I'm already using a html parser, but I need to create a regex that will select the < and > symbols within the first instance of <code> tags - in this case, the one with the class "html".
<code class="html">
<b>test</b><script>lol</script>
<code>test</code> <b> test </b>
<lol>
</lol>
<test>
</code>
So every < or > within the indented area starting from <b> to the start of the last </code> should be replaced, leaving the outer <code> tags alone.
I'm using javascript's .replace method and would like all < and > symbols within the code area to turn into ascii < and >.
I imagine its best to use a look forward/back regex using $1 etc. but can't figure out where to begin, so any help would be much appreciated.
How about something like this? In this example I'm creating a variable and populating the variable with html, just to get things started
var doc = document.createElement( 'div' );
doc.innerHTML = ---your input html here
Here I'm pulling the code tag
var string = doc.getElementsByTagName( 'code' ).innerHTML;
Once you have the string then simply replace the desired brackets with
var string = string .replace(/[<]/, "<)
var string = string .replace(/[>]/, ">)
then just reinsert the replaced value back into your source html
The easy way:
var elem = $('.html');
elem.text(elem.html());
This will not necessarily use literally < for escaping; if you're fine with a different escape, it's much simpler than anything else you can do, though.
If you have multiple elements like that, you might need to wrap the second line in an elem.each(); otherwise the html() method will probably just concatenate the content from all elements or something similarly pointless.

Remove and extract text in javascript

I'm wanting to do the following in JavaScript as efficiently as possible:
Remove <ul></ul> tags from a string and everything in between.
For what remains, every string that is encased within <li> and </li> I want dumped in an array, without any newline characters lurking at the end.
I'm thinking regexes are the answer but I've never used them before. Guess I could figure out a way but eventually it would probably not be the most efficient.
As others have said, you do have to be careful parsing HTML with regexes. If the HTML is controlled and does not have nested ul or li tags in it and doesn't have embedded strings that contain valid HTML tags or < or > chars (e.g. the HTML is coming from a known source in a known format, it can work fine). Here's one way to do what I think you were asking for:
function parseList(str) {
var output = [], matches;
var re = /<\s*li[^>]*>(.*?)<\/li>/gi;
// remove newlines
str = str.replace(/\n|\r/igm, "");
// get text between ul tags
matches = str.match(/<\s*ul[^>]*>(.*?)<\/ul\s*>/);
if (matches) {
str = matches[1];
// get text between each li tag
while (matches = re.exec(str)) {
output.push(matches[1]);
}
}
return(output);
}
It is more foolproof to use an actual HTML parser that understands the finer points of the format (like nested tags, tag values in embedded strings, etc...), but if you have none of that, a simpler parser like this can be used.
You can see it work here: http://jsfiddle.net/jfriend00/c9ZLT/

How to get regex to match multiple script tags?

I'm trying to return the contents of any tags in a body of text. I'm currently using the following expression, but it only captures the contents of the first tag and ignores any others after that.
Here's a sample of the html:
<script type="text/javascript">
alert('1');
</script>
<div>Test</div>
<script type="text/javascript">
alert('2');
</script>
My regex looks like this:
//scripttext contains the sample
re = /<script\b[^>]*>([\s\S]*?)<\/script>/gm;
var scripts = re.exec(scripttext);
When I run this on IE6, it returns 2 matches. The first containing the full tag, the 2nd containing alert('1').
When I run it on http://www.pagecolumn.com/tool/regtest.htm it gives me 2 results, each containing the script tags only.
The "problem" here is in how exec works. It matches only first occurrence, but stores current index (i.e. caret position) in lastIndex property of a regex. To get all matches simply apply regex to the string until it fails to match (this is a pretty common way to do it):
var scripttext = ' <script type="text/javascript">\nalert(\'1\');\n</script>\n\n<div>Test</div>\n\n<script type="text/javascript">\nalert(\'2\');\n</script>';
var re = /<script\b[^>]*>([\s\S]*?)<\/script>/gm;
var match;
while (match = re.exec(scripttext)) {
// full match is in match[0], whereas captured groups are in ...[1], ...[2], etc.
console.log(match[1]);
}
Don't use regular expressions for parsing HTML. HTML is not a regular language. Use the power of the DOM. This is much easier, because it is the right tool.
var scripts = document.getElementsByTagName('script');
Try using the global flag:
document.body.innerHTML.match(/<script.*?>([\s\S]*?)<\/script>/gmi)
Edit: added multiple line and case insensitive flags (for obvious reasons).
The first group contains the content of the tags.
Edit: Don't you have to surround the regex-satement with quotes? Like:
re = "/<script\b[^>]*>([\s\S]*?)<\/script>/gm";
In .Net, there's a submatch method, in PHP, preg_match_all, which should solve you problem. In Javascript there isn't such a method. But you can made by yourself.
Test in
http://www.pagecolumn.com/tool/regtest.htm
Select $1elements method will return what you want
try this
for each(var x in document.getElementsByTagName('script');
if (x && x.innerHTML){
var yourRegex = /http:\/\/\.*\.com/g;
var matches = yourRegex.exec(x.innerHTML);
if (matches){
your code
}}

Categories

Resources