javascript regex for xml/html attributes - javascript

I cant seem to be able to build a good regex expression (in javascript) that extracts each attribute from an xml node. For example,
<Node attribute="one" attribute2="two" n="nth"></node>
I need an express to give me an array of
['attribute="one"', 'attribute2="two"' ,'n="nth"']
...
Any help would be appreciated. Thank you

In case you missed Kerrek's comment:
you can't parse XML with a regular expression.
And the link: RegEx match open tags except XHTML self-contained tags
You can get the attributes of a node by iterating over its attributes property:
function getAttributes(el) {
var r = [];
var a, atts = el.attributes;
for (var i=0, iLen=atts.length; i<iLen; i++) {
a = atts[i];
r.push(a.name + ': ' + a.value);
}
alert(r.join('\n'));
}
Of course you probably want to do somethig other than just put them in an alert.
Here is an article on MDN that includes links to relevant standards:
https://developer.mozilla.org/En/DOM/Node.attributes

try this~
<script type="text/javascript">
var myregexp = /<node((\s+\w+=\"[^\"]+\")+)><\/node>/im;
var match = myregexp.exec("<Node attribute=\"one\" attribute2=\"two\" n=\"nth\"></node>");
if (match != null) {
result = match[1].trim();
var arrayAttrs = result.split(/\s+/);
alert(arrayAttrs);}
</script>

I think you could get it using the following. You would want the second and third matching group.
<[\w\d\-_]+\s+(([\w\d\-_]+)="(.*?)")*>

The regex is /\w+=".+"/g (note the g of global).
You might try it right now on your firebug / chrome console by doing:
var matches = '<Node attribute="one" attribute2="two" n="nth"></node>'.match(/\w+="\w+"/g)

Related

Javascript string replace with regex variable manipulation

How do I replace all instances of digits within a string pattern with that digit plus an offset.
Say I want to replace all HTML tags with that number plus an offset
strRegEx = /<ol start="(\d+)">/gi;
strContent = strContent.replace(strRegEx, function() {
/* return $1 + numOffset; */
});
#Tomalak is right, you shouldn't really use regex's with HTML, you should use the broswer's own HTML DOM or an XML parser.
For example, if that tag also had another attribute assigned to it, such as a class, the regex will not match it.
<ol start="#" > does not equal <ol class="foo" start="#">.
There is no way to use regexes for this, you should just go through the DOM to find the element you are looking for, grab its attributes, check to see if they match, and then go from there.
function replaceWithOffset(var offset) {
var elements = document.getElementsByTagName("ol");
for(var i = 0; i < elements.length; i++) {
if(elements[i].hasAttribute("start")) {
elements[i].setAttribute("start", parseInt(elements[i].getAttribute("start")) + offset);
}
}
}
the replace function obviously doesn't allow that, so doing what you need required a bit more effort
executing (with .exec()) a global regex multiple time will return subsequent results until no more matches are available and null is returned. You can use that in a while loop and then use the returned match to substring the original input and perform your modifications manually
var strContent = "<ol start=\"1\"><ol start=\"2\"><ol start=\"3\"><ol start=\"4\">"
var strRegEx = /<ol start="(\d+)">/g;
var match = null
while (match = strRegEx.exec(strContent)) {
var tag = match[0]
var value = match[1]
var rightMark = match.index + tag.length - 2
var leftMark = rightMark - value.length
strContent = strContent.substr(0, leftMark) + (1 + +value) + strContent.substr(rightMark)
}
console.log(strContent)
note: as #tomalak said, parsing HTML with regexes is generally a bad idea. But if you're parsing just a piece of content of which you know the precise structure beforehand, I don't see any particular issue ...

javascript regular expression - getting value after colon, without the colon

I have tried the https://www.regex101.com/#javascript tool, as well as a similar stackoverflow question and yet haven't been able to solve/understand this. Hopefully someone here can explain what I am doing wrong. I have created as detailed, step-by-step of an example as I can.
My goal is to be able to parse custom attributes, so for example:
I wrote some jquery code to pull in the attribute and the value, and then wanted to run regex against the result.
Below is the html/js, the output screenshot, and the regular expression screenshot, which says my regex query should match what I am expecting.
Expected result: 'valOne'
Result: ':valOne' <-- why am I getting a ':' character?
<html>
<head>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.2.0/jquery.min.js"></script>
<script>
$(document).ready(function() {
$('[customAttr]').each(function(){
var attrValues = $(this).attr('customAttr');
var regEx_attrVal = /[\w:]+?(?=;|$)/g;
var regEx_preColon = /[\w]+?(?=:)/g;
var regEx_postColon = /:(\w*)+?(?=;|\b)/g;
var customAttrVal = attrValues.match(regEx_attrVal);
var customAttrVal_string = customAttrVal.toString();
console.log('customAttrVal:');
console.log(customAttrVal);
console.log('customAttrVal_string: '+customAttrVal_string);
var preColon = customAttrVal_string.match(regEx_preColon);
preColon_string =preColon.toString();
console.log('preColon');
console.log(preColon);
console.log('preColon_string: '+preColon_string);
var postColon = customAttrVal_string.match(regEx_postColon);
postColon_string = postColon.toString();
console.log('postColon');
console.log(postColon);
console.log('postColon_string: '+postColon_string);
console.log('pre: '+preColon_string);
console.log('post: '+postColon_string);
});
});
</script>
</head>
<body>
<div customAttr="val1:valOne">
Test custom attr
</div>
</body>
</html>
When you use String#match() with a regex with a global modifier, all the capture groups (those strings in the regex101.com right-hand bottom 'MATCH INFORMATION' pane are the values captured into Groups with ID 1 and higher) defined in the pattern are lost, and you only get an array of matched values.
You need to remove /g from your regexps and fix them as follows:
var regEx_attrVal = /[\w:]+(?=;|$)/;
var regEx_preColon = /\w+(?=:)/;
var regEx_postColon = /:(\w+)(?=;|\b)/;
Then, when getting the regEx_postColon captured value, use
var postColon = customAttrVal_string.match(regEx_postColon);
var postColon_string = postColon !== null ? postColon[1] : "";
First, check if there is a postColon regex match, then access the captured value with postColon[1].
See the whole updated code:
$(document).ready(function() {
$('[customAttr]').each(function() {
var attrValues = $(this).attr('customAttr');
var regEx_attrVal = /[\w:]+(?=;|$)/;
var regEx_preColon = /\w+(?=:)/;
var regEx_postColon = /:(\w+)(?=;|\b)/;
var customAttrVal = attrValues.match(regEx_attrVal);
var customAttrVal_string = customAttrVal.toString();
console.log('customAttrVal:');
console.log(customAttrVal);
console.log('customAttrVal_string: ' + customAttrVal_string);
var preColon = customAttrVal_string.match(regEx_preColon);
preColon_string = preColon.toString();
console.log('preColon');
console.log(preColon);
console.log('preColon_string: ' + preColon_string);
var postColon = customAttrVal_string.match(regEx_postColon);
var postColon_string = postColon !== null ? postColon[1] : "";
console.log('postColon');
console.log(postColon);
console.log('postColon_string: ' + postColon_string);
console.log('pre: ' + preColon_string);
console.log('post: ' + postColon_string);
});
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div customAttr="val1:valOne">
Test custom attr
</div>
I haven't trudged through all the code, but something you need to understand about regexes is the difference between $0 and $1.
$0 is highlighted in blue. That is the entire part the regex matched.
You want $1. That's where the matches captured by the parenthesis are.
Read more about capture groups here.
var match = myRegexp.exec(myString);
alert(match[1]); // This accesses $1
use data attributes. you can store json strings in them and access them like objects.
HTML
<div id='div' data-custom='{"val1":"valOne","a":"b"}'></div>
jQ
$("#div").data("custom").val1; //valOne
$("#div").data("custom").a; //b
I guess this is the regex pattern that you're looking for:
(?!(.*?):).*
Explanation
(.*?:) Select all type of values and any number of times and a match that contains (:) simbol
(?! :) select inverse values of the first pattern, its kinda negation
( ).* Select all type of values after the evaluations
Also you can do the same with Jquery substring which for me the most simple way to do it, just like this:
How to substring in jquery

Javascript to extract *.com

I am looking for a javascript function/regex to extract *.com from a URI... (to be done on client side)
It should work for the following cases:
siphone.com = siphone.com
qwr.siphone.com = siphone.com
www.qwr.siphone.com = siphone.com
qw.rock.siphone.com = siphone.com
<http://www.qwr.siphone.com> = siphone.com
Much appreciated!
Edit: Sorry, I missed a case:
http://www.qwr.siphone.com/default.htm = siphone.com
I guess this regex should work for a few cases:
/[\w]+\.(com|ca|org|net)/
I'm not good with JavaScript, but there should be a library for splitting URIs out there, right?
According to that link, here's a "strict" regex:
/^(?:([^:\/?#]+):)?(?:\/\/((?:(([^:#]*)(?::([^:#]*))?)?#)?([^:\/?#]*)(?::(\d*))?))?((((?:[^?#\/]*\/)*)([^?#]*))(?:\?([^#]*))?(?:#(.*))?)/
As you can see, you're better off just using the "library". :)
This should do it. I added a few cases for some nonmatches.
var cases = [
"siphone.com",
"qwr.siphone.com",
"www.qwr.siphone.com",
"qw.rock.siphone.com",
"<http://www.qwr.siphone.com>",
"hamstar.corm",
"cheese.net",
"bro.at.me.come",
"http://www.qwr.siphone.com/default.htm"];
var grabCom = function(str) {
var result = str.match("(\\w+\\.com)\\W?|$");
if(result !== null)
return result[1];
return null;
};
for(var i = 0; i < cases.length; i++) {
console.log(grabCom(cases[i]));
}
var myStrings = [
'siphone.com',
'qwr.siphone.com',
'www.qwr.siphone.com',
'qw.rock.siphone.com',
'<http://www.qwr.siphone.com>'
];
for (var i = 0; i < myStrings.length; i++) {
document.write( myStrings[i] + '=' + myStrings[i].match(/[\w]+\.(com)/gi) + '<br><br>');
}
I've placed given demo strings to the myStrings array.
i - is index to iterate through this array. The following line does the matching trick:
myStrings[i].match(/[\w]+\.(com)/gi)
and returns the value of siphone.com. If you'd like to match .net and etc. - add (com|net|other) instead of just (com).
Also you may find the following link useful: Regular expressions Cheat Sheet
update: missed case works too %)
You could split the string then search for the .com string like so
var url = 'music.google.com'
var parts = url.split('.');
for(part in parts) {
if(part == 'com') {
return true;
}
{
uri = "foo.bar.baz.com"
uri.split(".").slice(-2).join(".") // returns baz.com
This assumes that you want just the hostname and tld. It also assumes that there is no path information either.
Updated now that you also need to handle uris with paths you could do:
uri.split(".").slice(-2).join(".").split("/")[0]
Use regexp to do that. This way modifications to the detections are quite easy.
var url = 'www.siphone.com';
var domain = url.match(/[^.]\.com/i)[0];
If you use url.match(/(([^.]+)\.com)[^a-z]/i)[1] instead. You can assure that the ".com" is not followed by any other characters.

Matching a string only if it is not in <script> or <a> tags

I'm working on a browser plugin that replaces all instances of "someString" (as defined by a complicated regex) with $1. This generally works ok just doing a global replace on the body's innerHTML. However it breaks the page when it finds (and replaces) the "someString" inside <script> tags (i.e. as a JS variable or other JS reference). It also breaks if "someString" is already part of an anchor.
So basically I want to do a global replace on all instances of "someString" unless it falls inside a <script></script> or <a></a> tag set.
Essentially what I have now is:
var body = document.getElementsByTagName('body')[0].innerHTML;
body = body.replace(/(someString)/gi, '$1');
document.getElementsByTagName('body')[0].innerHTML = body;
But obviously that's not good enough. I've been struggling for a couple hours now and reading all of the answers here (including the many adamant ones that insist regex should not be used with HTML), so I'm open to suggestions on how to do this. I'd prefer using straight JS, but can use jQuery if necessary.
Edit - Sample HTML:
<body>
someString
<script type="text/javascript">
var someString = 'blah';
console.log(someString);
</script>
someString
</body>
In that case, only the very first instance of "someString" should be replaced.
Try this and see if it meets your needs (tested in IE 8 and Chrome).
<script src="jquery-1.4.4.js" type="text/javascript"></script>
<script>
var pattern = /(someString)/gi;
var replacement = "$1";
$(function() {
$("body :not(a,script)")
.contents()
.filter(function() {
return this.nodeType == 3 && this.nodeValue.search(pattern) != -1;
})
.each(function() {
var span = document.createElement("span");
span.innerHTML = " " + $.trim(this.nodeValue.replace(pattern, replacement));
this.parentNode.insertBefore(span, this);
this.parentNode.removeChild(this);
});
});
</script>
The code uses jQuery to find all the text nodes within the document's <body>that are not in <anchor> or <script> blocks, and contain the search pattern. Once those are found, a span is injected containing the target node's modified content, and the old text node is removed.
The only issue I saw was that IE 8 handles text nodes containing only whitespace differently than Chrome, so sometimes a replacement would lose a leading space, hence the insertion of the non-breaking space before the text containing the regex replacements.
Well, You can use XPath with Mozilla (assuming you're writing the plugin for FireFox). The call is document.evaluate. Or you can use an XPath library to do it (there are a few out there)...
var matches = document.evaluate(
'//*[not(name() = "a") and not(name() = "script") and contains(., "string")]',
document,
null,
XPathResult.UNORDERED_NODE_ITERATOR_TYPE
null
);
Then replace using a callback function:
var callback = function(node) {
var text = node.nodeValue;
text = text.replace(/(someString)/gi, '$1');
var div = document.createElement('div');
div.innerHTML = text;
for (var i = 0, l = div.childNodes.length; i < l; i++) {
node.parentNode.insertBefore(div.childNodes[i], node);
}
node.parentNode.removeChild(node);
};
var nodes = [];
//cache the tree since we want to modify it as we iterate
var node = matches.iterateNext();
while (node) {
nodes.push(node);
node = matches.iterateNext();
}
for (var key = 0, length = nodes.length; key < length; key++) {
node = nodes[key];
// Check for a Text node
if (node.nodeType == Node.TEXT_NODE) {
callback(node);
} else {
for (var i = 0, l = node.childNodes.length; i < l; i++) {
var child = node.childNodes[i];
if (child.nodeType == Node.TEXT_NODE) {
callback(child);
}
}
}
}
I know you don't want to hear this, but this doesn't sound like a job for a regex. Regular expressions don't do negative matches very well before becoming complicated and unreadable.
Perhaps this regex might be close enough though:
/>[^<]*(someString)[^<]*</
It captures any instance of someString that are inbetween a > and a <.
Another idea is if you do use jQuery, you can use the :contains pseudo-selector.
$('*:contains(someString)').each(function(i)
{
var markup = $(this).html();
// modify markup to insert anchor tag
$(this).html(markup)
});
This will grab any DOM item that contains 'someString' in it's text. I dont think it will traverse <script> tags or so you should be good.
You could try the following:
/(someString)(?![^<]*?(<\/a>|<\/script>))/
I didn't test every schenario, but it is basically using a negative lookahead to look for the next opening bracket following someString, and if that bracket is part of an anchor or script closing tag, it does not match.
Your example seems to work in this fiddle, although it certainly doesn't cover all possibilities. In cases where the innerHTML in your <a></a> contains tags (like <b> or <span>), or the code in your script tags generates html (contains strings with tags in it), you would need something more complex.

Regex: how to get contents from tag inner (use javascript)?

page contents:
aa<b>1;2'3</b>hh<b>aaa</b>..
.<b>bbb</b>
blabla..
i want to get result:
1;2'3aaabbb
match tag is <b> and </b>
how to write this regex using javascript?
thanks!
Lazyanno,
If and only if:
you have read SLaks's post (as well as the previous article he links to), and
you fully understand the numerous and wondrous ways in which extracting information from HTML using regular expressions can break, and
you are confident that none of the concerns apply in your case (e.g. you can guarantee that your input will never contain nested, mismatched etc. <b>/</b> tags or occurrences of <b> or </b> within <script>...</script> or comment <!-- .. --> tags, etc.)
you absolutely and positively want to proceed with regular expression extraction
...then use:
var str = "aa<b>1;2'3</b>hh<b>aaa</b>..\n.<b>bbb</b>\nblabla..";
var match, result = "", regex = /<b>(.*?)<\/b>/ig;
while (match = regex.exec(str)) { result += match[1]; }
alert(result);
Produces:
1;2'3aaabbb
You cannot parse HTML using regular expressions.
Instead, you should use Javascript's DOM.
For example (using jQuery):
var text = "";
$('<div>' + htmlSource + '</div>')
.find('b')
.each(function() { text += $(this).text(); });
I wrap the HTML in a <div> tag to find both nested and non-nested <b> elements.
Here is an example without a jQuery dependency:
// get all elements with a certain tag name
var b = document.getElementsByTagName("B");
// map() executes a function on each array member and
// builds a new array from the function results...
var text = b.map( function(element) {
// ...in this case we are interested in the element text
if (typeof element.textContent != "undefined")
return element.textContent; // standards compliant browsers
else
return element.innerText; // IE
});
// now that we have an array of strings, we can join it
var result = text.join('');
var regex = /(<([^>]+)>)/ig;
var bdy="aa<b>1;2'3</b>hh<b>aaa</b>..\n.<b>bbb</b>\nblabla..";
var result =bdy.replace(regex, "");
alert(result) ;
See : http://jsfiddle.net/abdennour/gJ64g/
Just use '?' character after the generating pattern for your inner text if you want to use Regular experssions.
for example:
".*" to "(.*?)"

Categories

Resources