How to check if an element has duplicated attributes with cheerio js

How to check if an element has duplicated attributes with cheerio js - javascript

I'm parsing HTML files with cheerio (to later test with Mocha), and the HTML elements in these files can have lots of attributes, I want to check if the attribute is repeated within the same element:
example partial file that has an element with repeated "class" attribute:
<div class="logo-center" data-something-very-long="something long" ... class="logo" data-more-stuff>
Here is the code that loads the file:
var fileContents = fs.readFileSync(file, "utf8");
var $ = cheerio.load(fileContents);
Note: it doesn't have to be a class attribute, it could be any other attribute that repeats.

Parse the element under test again. For that to work, you need to dive a bit into the raw DOM object produced by cheerio/htmlparser2. It uses properties that are documented for domhandler, but not for cheerio, so some care with the versions might be needed. I have tested with
└─┬ cheerio#1.0.0-rc.1
├─┬ htmlparser2#3.9.2
│ ├── domhandler#2.4.1
I have formulated this ES6-style, but you could do the same as easily with older, more conventional constructs.
The RegExp may need some refining, though, depending on your expectations on the files you are testing.
const fileContents = fs.readFileSync(file, "utf8");
const $ = cheerio.load(fileContents, {
useHtmlParser2: true,
withStartIndices: true,
withEndIndices: true
});
function getDuplicateAttributes ($elem) {
const dom = $elem.get(0);
// identify tag text position in string
const start = dom.startIndex;
const end = dom.children.length ? dom.children[0].startIndex : dom.endIndex + 1;
// extract
const html = fileContents.slice(start, end);
// generator function loops through all attribute matches on the html string
function* multivals (attr) {
const re = new RegExp(`\\s${attr}="(.*?)"`, 'g');
let match;
while((match = re.exec(html)) !== null) {
// yield each property value found for the attr name
yield match[1];
}
}
// the DOM will contain all attribute names once
const doubleAttributeList = Object.keys(dom.attribs)
// compound attribute names with all found values
.map((attr) => {
const matchIterator = multivals(attr);
return [attr, Array.from(matchIterator)];
})
// filter for doubles
.filter((entry) => entry[1].length > 1);
return new Map(doubleAttributeList);
}
You haven't stated what you want to do once you have found doubles, so they are just returned.

#ccprog answer worked, here is a small ES5 refactor:
var file = 'some file';
var fileContents = fs.readFileSync(file, 'utf8');
var $ = cheerio.load(fileContents, {
useHtmlParser2: true,
withStartIndices: true,
withEndIndices: true
});
function getDuplicateAttributes ($elem) {
var dom = $elem.get(0);
// identify tag text position in fileContents
var start = dom.startIndex;
var end = dom.children.length ? dom.children[0].startIndex : dom.endIndex + 1;
// extract
var html = fileContents.slice(start, end);
// the DOM will contain all attribute names once
return Object.keys(dom.attribs)
// compound attribute names with all found values
.map(function (attr) {
// modify regexp to capture values if needed
var regexp = new RegExp('\\s' + attr + '[\\s>=]', 'g');
return html.match(regexp).length > 1 ? attr : null;
})
// filter for doubles
.filter(function (attr) {
return attr !== null;
});
}
var duplicatedAttrs = getDuplicateAttributes($(".some-elem"));
The code:
removes generator
ES6 to ES5
improve RegExp
use string.match() instead of regexp.exec().

Related

How to extract the content of the first paragraph in html string react native

I am working on a react native project and I have an html string json api response.
I am using react-native-render-html to render it, and I can get all paragraphs and apply specific things like number of lines ,etc.. . However I want to get only the first paragraph in the response.
str response='<p>text1</p> <p>text2</p> <p>text3</p>';
Is it possible to write a regular expression to get only the content of first paragraph which is for example text1 ?

I don't use React Native but in javascript you could do something like that:
const paragraphs = response.split("</p>")
const firstParagraph = paragraphs[0]+'</p>';
Or with a regex you can do something like that:
// extract all paragraphe from the string
const matches = [];
response.replace(/<p>(.*?)<\/p>/g, function () {
//use arguments[0] if you need to keep <p></p> html tags
matches.push(arguments[1]);
});
// get first paragraph
const firstParagraph = (matches.length) ? matches[0] : ""
Or like that (I think it is the best way in your case)
const response='<p>text1</p> <p>text2</p> <p>text3</p>';
const regex = /<p>(.*?)<\/p>/;
const corresp = regex.exec(response);
const firstParagraph = (corresp) ? corresp[0] : "" // <p>text1</p>
const firstParagraphWithoutHtml = (corresp) ? corresp[1] : "" // text1

Hope it will help
var response='<p>text1</p> <p>text2</p> <p>text3</p>';
var firstParagraphElement=response.split("</p>")[0] //firstparagraphElement="<p>text1"
var paragraphContent=firstParagraphElement.replace("<p>","") //paragraphContent="text1"
javascript split() function reference click
javascript replace() function reference click

In React Native you can also use parse5 to extract a string from HTML code. I have used this code in a project for doing so:
import parse5 from 'parse5'
const isText = (tagName): Boolean => tagName === '#text'
const processNode = (node): String => {
const nodeName = node.nodeName
if (isText(nodeName)) {
return node.value
}
if (!node.childNodes) {
return ''
}
return node.childNodes.map((child, index) => processNode(child)).join(' ')
}
export const htmlToText = (html): String => {
const root = parse5.parseFragment(html)
return processNode(root).replace(/\s+/g, ' ').trim()
}
Here is a simple JEST test for the function above:
test('when full document htmlToText should get text', () => {
const htmlToText1 = htmlToText("<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>")
expect(htmlToText1)
.toBe(`titleTest test01 test02 test03`);
});

JSON route matching via regex

Consider I have following JSON object
var urls = {
"GET/users/:id":1,
"POST/users":0
}
and if I have string "GET/users/10". How can I use this as key to get the value from urls JSON i.e. "GET/users/10" should match "GET/users/:id".
I don't want to iterate urls JSON and use regex for every key.
Is there a way to access JSON object using regex?
Thanks in advance.

Here is something that should work for you. I took some of the pieces from the Durandal router's RegEx matching logic which basically dynamically creates a regular expression object based on a defined route string and then tests with it against a passed string.
Here is the working example:
var urls = {
"GET/users/:id": 1,
"POST/users": 0
}
const getRouteRegExp = (
routeString,
routesAreCaseSensitive = false,
optionalParam = /\((.*?)\)/g,
namedParam = /(\(\?)?:\w+/g,
splatParam = /\*\w+/g,
escapeRegExp = /[\-{}\[\]+?.,\\\^$|#\s]/g
) => {
routeString = routeString.replace(escapeRegExp, '\\$&')
.replace(optionalParam, '(?:$1)?')
.replace(namedParam, function(match, optional) {
return optional ? match : '([^\/]+)';
})
.replace(splatParam, '(.*?)');
return new RegExp('^' + routeString + '$', routesAreCaseSensitive ? undefined : 'i');
}
const getRouteByString = (string) => {
var resultArr = Object.entries(urls).find(([k, v]) => {
var regEx = getRouteRegExp(k)
return regEx.test(string)
}) || []
return resultArr[0]
}
console.log(getRouteByString('GET/users/10'))
console.log(getRouteByString('POST/users'))
console.log(getRouteByString('POST/users2'))
So what you have is the getRouteRegExp function which is the main thing here which would compose a regular expression object based on a passed route.
After that we go and for each existing route defined in urls we create one RegExp and try to match it against the provided string route. This is what the find does. If one is found we return it.
Since we are doing Object.entries we return the 0 index which contains the result.
Since this comes straight from the Durandal bits it supports all the route expressions that are built in Durandal ... like:
Static route: tickets
Parameterized: tickets/:id
Optional Parameter: users(/:id)
Splat Route: settings*details
You can read more about Durandal Router here

From your question what I can understand is your key is dynamic, so you can do something like this:
var urls = {
"GET/users/:id":1,
"POST/users":0
}
let id = 10
const yourValue = urls["GET/users/" + id]

You can use this code to
var urls = {
"GET/users/:id":1,
"POST/users":0
}
var regex = /"([^"]+?)"\s*/g;
var urlsJson = JSON.stringify(urls);
let result = regex.exec(urlsJson)
if(result && result.length > 0) {
var keyJson = result[1];
var value = urls[keyJson]
console.log('value', value)
}

Try Something like this:
const urls = (id) => ({
[`GET/users/${id}`]:1,
"POST/users":0,
});
console.log(urls(2));
I hope it may be helpful.

The json would look fine, just do a replace on the url, so replace the ending integer with :id and then you have the key by which you can directly access the value in the json.
So:
var url = "GET/users/10";
var urls = {
"GET/users/:id":1,
"POST/users":0
}
url = url.replace(/users\/\d+/, 'users/:id');
console.log(urls[url]);
Do as many replaces on the url to convert all possible url's to the keys in your json.

Add arrays into multi-dimensional array or object

I'm parsing content generated by a wysiwyg into a table of contents widget in React.
So far I'm looping through the headers and adding them into an array.
How can I get them all into one multi-dimensional array or object (what's the best way) so that it looks more like:
h1-1
h2-1
h3-1
h1-2
h2-2
h3-2
h1-3
h2-3
h3-3
and then I can render it with an ordered list in the UI.
const str = "<h1>h1-1</h1><h2>h2-1</h2><h3>h3-1</h3><p>something</p><h1>h1-2</h1><h2>h2-2</h2><h3>h3-2</h3>";
const patternh1 = /<h1>(.*?)<\/h1>/g;
const patternh2 = /<h2>(.*?)<\/h2>/g;
const patternh3 = /<h3>(.*?)<\/h3>/g;
let h1s = [];
let h2s = [];
let h3s = [];
let matchh1, matchh2, matchh3;
while (matchh1 = patternh1.exec(str))
h1s.push(matchh1[1])
while (matchh2 = patternh2.exec(str))
h2s.push(matchh2[1])
while (matchh3 = patternh3.exec(str))
h3s.push(matchh3[1])
console.log(h1s)
console.log(h2s)
console.log(h3s)

I don't know about you, but I hate parsing HTML using regexes. Instead, I think it's a better idea to let the DOM handle this:
const str = `<h1>h1-1</h1>
<h3>h3-1</h3>
<h3>h3-2</h3>
<p>something</p>
<h1>h1-2</h1>
<h2>h2-2</h2>
<h3>h3-2</h3>`;
const wrapper = document.createElement('div');
wrapper.innerHTML = str.trim();
let tree = [];
let leaf = null;
for (const node of wrapper.querySelectorAll("h1, h2, h3, h4, h5, h6")) {
const nodeLevel = parseInt(node.tagName[1]);
const newLeaf = {
level: nodeLevel,
text: node.textContent,
children: [],
parent: leaf
};
while (leaf && newLeaf.level <= leaf.level)
leaf = leaf.parent;
if (!leaf)
tree.push(newLeaf);
else
leaf.children.push(newLeaf);
leaf = newLeaf;
}
console.log(tree);
This answer does not require h3 to follow h2; h3 can follow h1 if you so please. If you want to turn this into an ordered list, that can also be done:
const str = `<h1>h1-1</h1>
<h3>h3-1</h3>
<h3>h3-2</h3>
<p>something</p>
<h1>h1-2</h1>
<h2>h2-2</h2>
<h3>h3-2</h3>`;
const wrapper = document.createElement('div');
wrapper.innerHTML = str.trim();
let tree = [];
let leaf = null;
for (const node of wrapper.querySelectorAll("h1, h2, h3, h4, h5, h6")) {
const nodeLevel = parseInt(node.tagName[1]);
const newLeaf = {
level: nodeLevel,
text: node.textContent,
children: [],
parent: leaf
};
while (leaf && newLeaf.level <= leaf.level)
leaf = leaf.parent;
if (!leaf)
tree.push(newLeaf);
else
leaf.children.push(newLeaf);
leaf = newLeaf;
}
const ol = document.createElement("ol");
(function makeOl(ol, leaves) {
for (const leaf of leaves) {
const li = document.createElement("li");
li.appendChild(new Text(leaf.text));
if (leaf.children.length > 0) {
const subOl = document.createElement("ol");
makeOl(subOl, leaf.children);
li.appendChild(subOl);
}
ol.appendChild(li);
}
})(ol, tree);
// add it to the DOM
document.body.appendChild(ol);
// or get it as text
const result = ol.outerHTML;
Since the HTML is parsed by the DOM and not by a regex, this solution will not encounter any errors if the h1 tags have attributes, for example.

You can simply gather all h* and then iterate over them to construct a tree as such:
Using ES6 (I inferred this is ok from your usage of const and let)
const str = `
<h1>h1-1</h1>
<h2>h2-1</h2>
<h3>h3-1</h3>
<p>something</p>
<h1>h1-2</h1>
<h2>h2-2</h2>
<h3>h3-2</h3>
`
const patternh = /<h(\d)>(.*?)<\/h(\d)>/g;
let hs = [];
let matchh;
while (matchh = patternh.exec(str))
hs.push({ lev: matchh[1], text: matchh[2] })
console.log(hs)
// constructs a tree with the format [{ value: ..., children: [{ value: ..., children: [...] }, ...] }, ...]
const add = (res, lev, what) => {
if (lev === 0) {
res.push({ value: what, children: [] });
} else {
add(res[res.length - 1].children, lev - 1, what);
}
}
// reduces all hs found into a tree using above method starting with an empty list
const tree = hs.reduce((res, { lev, text }) => {
add(res, lev-1, text);
return res;
}, []);
console.log(tree);
But because your html headers are not in a tree structure themselves (which I guess is your use case) this only works under certain assumptions, e.g. you cannot have a <h3> unless there's a <h2> above it and a <h1> above that. It will also assume a lower-level header will always belong to the latest header of an immediately higher level.
If you want to further use the tree structure for e.g. rendering a representative ordered-list for a TOC, you can do something like:
// function to render a bunch of <li>s
const renderLIs = children => children.map(child => `<li>${renderOL(child)}</li>`).join('');
// function to render an <ol> from a tree node
const renderOL = tree => tree.children.length > 0 ? `<ol>${tree.value}${renderLIs(tree.children)}</ol>` : tree.value;
// use a root node for the TOC
const toc = renderOL({ value: 'TOC', children: tree });
console.log(toc);
Hope it helps.

What you want to do is known as (a variant of a) document outline, eg. creating a nested list from the headings of a document, honoring their hierarchy.
A simple implementation for the browser using the DOM and DOMParser APIs goes as follows (put into a HTML page and coded in ES5 for easy testing):
<!DOCTYPE html>
<html>
<head>
<title>Document outline</title>
</head>
<body>
<div id="outline"></div>
<script>
// test string wrapped in a document (and body) element
var str = "<html><body><h1>h1-1</h1><h2>h2-1</h2><h3>h3-1</h3><p>something</p><h1>h1-2</h1><h2>h2-2</h2><h3>h3-2</h3></body></html>";
// util for traversing a DOM and emit SAX startElement events
function emitSAXLikeEvents(node, handler) {
handler.startElement(node)
for (var i = 0; i < node.children.length; i++)
emitSAXLikeEvents(node.children.item(i), handler)
handler.endElement(node)
}
var outline = document.getElementById('outline')
var rank = 0
var context = outline
emitSAXLikeEvents(
(new DOMParser()).parseFromString(str, "text/html").body,
{
startElement: function(node) {
if (/h[1-6]/.test(node.localName)) {
var newRank = +node.localName.substr(1, 1)
// set context li node to append
while (newRank <= rank--)
context = context.parentNode.parentNode
rank = newRank
// create (if 1st li) or
// get (if 2nd or subsequent li) ol element
var ol
if (context.children.length > 0)
ol = context.children[0]
else {
ol = document.createElement('ol')
context.appendChild(ol)
}
// create and append li with text from
// heading element
var li = document.createElement('li')
li.appendChild(
document.createTextNode(node.innerText))
ol.appendChild(li)
context = li
}
},
endElement: function(node) {}
})
</script>
</body>
</html>
I'm first parsing your fragment into a Document, then traverse it to create SAX-like startElement() calls. In the startElement() function, the rank of a heading element is checked against the rank of the most recently created list item (if any). Then a new list item is appended at the correct hierarchy level, and possibly an ol element is created as container for it. Note the algorithm as it is won't work with "jumping" from h1 to h3 in the hierarchy, but can be easily adapted.
If you want to create an outline/table of content on node.js, the code could be made to run server-side, but requires a decent HTML parsing lib (a DOMParser polyfill for node.js, so to speak). There are also the https://github.com/h5o/h5o-js and the https://github.com/hoyois/html5outliner packages for creating outlines, though I haven't tested those. These packages supposedly can also deal with corner cases such as heading elements in iframe and quote elements which you generally don't want in the the outline of your document.
The topic of creating an HTML5 outline has a long history; see eg. http://html5doctor.com/computer-says-no-to-html5-document-outline/. HTML4's practice of using no sectioning roots (in HTML5 parlance) wrapper elements for sectioning and placing headings and content at the same hierarchy level is known as "flat-earth markup". SGML has the RANK feature for dealing with H1, H2, etc. ranked elements, and can be made to infer omitted section elements, thus automatically create an outline, from HTML4-like "flat earth markup" in simple cases (eg. where only section or another single element is allowed as sectioning root).

I'll use a single regex to get the <hx></hx> contents and then sort them by x using methods Array.reduce.
Here is the base but it's not over yet :
// The string you need to parse
const str = "\
<h1>h1-1</h1>\
<h2>h2-1</h2>\
<h3>h3-1</h3>\
<p>something</p>\
<h1>h1-2</h1>\
<h2>h2-2</h2>\
<h3>h3-2</h3>";
// The regex that will cut down the <hx>something</hx>
const regex = /<h[0-9]{1}>(.*?)<\/h[0-9]{1}>/g;
// We get the matches now
const matches = str.match(regex);
// We match the hx togethers as requested
const matchesSorted = Object.values(matches.reduce((tmp, x) => {
// We get the number behind hx ---> the x
const hNumber = x[2];
// If the container do not exist, create it
if (!tmp[hNumber]) {
tmp[hNumber] = [];
}
// Push the new parsed content into the array
// 4 is to start after <hx>
// length - 9 is to get all except <hx></hx>
tmp[hNumber].push(x.substr(4, x.length - 9));
return tmp;
}, {}));
console.log(matchesSorted);
As you are parsing html content I want to aware you about special cases like presency of \n or space. For example look at the following non-working snippet :
// The string you need to parse
const str = "\
<h1>h1-1\n\
</h1>\
<h2> h2-1</h2>\
<h3>h3-1</h3>\
<p>something</p>\
<h1>h1-2 </h1>\
<h2>h2-2 \n\
</h2>\
<h3>h3-2</h3>";
// The regex that will cut down the <hx>something</hx>
const regex = /<h[0-9]{1}>(.*?)<\/h[0-9]{1}>/g;
// We get the matches now
const matches = str.match(regex);
// We match the hx togethers as requested
const matchesSorted = Object.values(matches.reduce((tmp, x) => {
// We get the number behind hx ---> the x
const hNumber = x[2];
// If the container do not exist, create it
if (!tmp[hNumber]) {
tmp[hNumber] = [];
}
// Push the new parsed content into the array
// 4 is to start after <hx>
// length - 9 is to get all except <hx></hx>
tmp[hNumber].push(x.substr(4, x.length - 9));
return tmp;
}, {}));
console.log(matchesSorted);
We gotta add .replace() and .trim() in order to remove unwanted \n and spaces.
Use this snippet
// The string you need to parse
const str = "\
<h1>h1-1\n\
</h1>\
<h2> h2-1</h2>\
<h3>h3-1</h3>\
<p>something</p>\
<h1>h1-2 </h1>\
<h2>h2-2 \n\
</h2>\
<h3>h3-2</h3>";
// Remove all unwanted \n
const preparedStr = str.replace(/(\r\n\t|\n|\r\t)/gm, "");
// The regex that will cut down the <hx>something</hx>
const regex = /<h[0-9]{1}>(.*?)<\/h[0-9]{1}>/g;
// We get the matches now
const matches = preparedStr.match(regex);
// We match the hx togethers as requested
const matchesSorted = Object.values(matches.reduce((tmp, x) => {
// We get the number behind hx ---> the x
const hNumber = x[2];
// If the container do not exist, create it
if (!tmp[hNumber]) {
tmp[hNumber] = [];
}
// Push the new parsed content into the array
// 4 is to start after <hx>
// length - 9 is to get all except <hx></hx>
// call trim() to remove unwanted spaces
tmp[hNumber].push(x.substr(4, x.length - 9).trim());
return tmp;
}, {}));
console.log(matchesSorted);

I write this code works with JQuery. (Please don't DV. Maybe someone needs a jquery answer later)
This recursive function creates lis of string and if one item has some childern, it will convert them to an ol.
const str =
"<div><h1>h1-1</h1><h2>h2-1</h2><h3>h3-1</h3></div><p>something</p><h1>h1-2</h1><h2>h2-2</h2><h3>h3-2</h3>";
function strToList(stri) {
const tags = $(stri);
function partToList(el) {
let output = "<li>";
if ($(el).children().length) {
output += "<ol>";
$(el)
.children()
.each(function() {
output += partToList($(this));
});
output += "</ol>";
} else {
output += $(el).text();
}
return output + "</li>";
}
let output = "<ol>";
tags.each(function(itm) {
output += partToList($(this));
});
return output + "</ol>";
}
$("#output").append(strToList(str));
li {
padding: 10px;
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="output"></div>
(This code can be converted to pure JS easily)

Getting Attributes from User submitted text! RegExp?

I am trying to pull the attributes out of piece of submitted text in Javascript and change it to an array.
So the user submits this:
<iframe src="http://www.stackoverflow.com/" width="123" height="123" frameborder="1"></iframe>
and I would get:
arr['src'] = http://www.stackoverflow.com/
arr['width'] = 123
arr['height'] = 123
arr['frameborder'] = 1
Just need a regexp I think but any help would be great!

I recommend to use a RegExp to parse user-inputed HTML, instead of creating a DOM object, because it's not desired to load external content (iframe, script, link, style, object, ...) when performing a "simple" task such as getting attribute values of a HTML string.
Using similar (although similarcontradiction?) methods as in my previous answer, I've created a function to match quoted attribute values. Both quoted, as non-quoted attributes are matched.
The code currently returns an object with attributes from the first tag, but it's easily extensible to retrieve all HTML elements (see bottom of answer).
Fiddle: http://jsfiddle.net/BP4nF/1/
// Example:
var htmlString = '<iframe src="http://www.stackoverflow.com/" width="123" height="123" frameborder="1" non-quoted=test></iframe>';
var arr = parseHTMLTag(htmlString);
//arr is the desired object. An easy method to verify:
alert(JSON.stringify(arr));
function parseHTMLTag(htmlString){
var tagPattern = /<[a-z]\S*(?:[^<>"']*(?:"[^"]*"|'[^']*'))*?[^<>]*(?:>|(?=<))/i;
var attPattern = /([-a-z0-9:._]+)\s*=(?:\s*(["'])((?:[^"']+|(?!\2).)*)\2|([^><\s]+))/ig;
// 1 = attribute, 2 = quote, 3 = value, 4=non-quoted value (either 3 or 4)
var tag = htmlString.match(tagPattern);
var attributes = {};
if(tag){ //If there's a tag match
tag = tag[0]; //Match the whole tag
var match;
while((match = attPattern.exec(tag)) !== null){
//match[1] = attribute, match[3] = value, match[4] = non-quoted value
attributes[match[1]] = match[3] || match[4];
}
}
return attributes;
}
The output of the example is equivalent to:
var arr = {
"src": "http://www.stackoverflow.com/",
"width": "123",
"height": "123",
"frameborder": "1",
"non-quoted": "test"
};
Extra: Modifying the function to get multiple matches (only showing code to update)
function parseHTMLTags(htmlString){
var tagPattern = /<([a-z]\S*)(?:[^<>"']*(?:"[^"]*"|'[^']*'))*?[^<>]*(?:>|(?=<))/ig;
// 1 = tag name
var attPattern = /([-a-z0-9:._]+)\s*=(?:\s*(["'])((?:[^"']+|(?!\2).)*)\2|([^><\s]+))/ig;
// 1 = attribute, 2 = quote, 3 = value, 4=non-quoted value (either 3 or 4)
var htmlObject = [];
var tag, match, attributes;
while(tag = tagPattern.exec(htmlString)){
attributes = {};
while(match = attPattern.exec(tag)){
attributes[match[1]] = match[3] || match[4];
}
htmlObject.push({
tagName: tag[1],
attributes: attributes
});
}
return htmlObject; //Array of all HTML elements
}

Assuming you're doing this client side, you're better off not using RegExp, but using the DOM:
var tmp = document.createElement("div");
tmp.innerHTML = userStr;
tmp = tmp.firstChild;
console.log(tmp.src);
console.log(tmp.width);
console.log(tmp.height);
console.log(tmp.frameBorder);
Just make sure you don't add the created element to the document without sanitizing it first. You might also need to loop over the created nodes until you get to an element node.

Assuming they will always enter an HTML element you could parse it and read the elements from the DOM, like so (untested):
var getAttributes = function(str) {
var a={}, div=document.createElement("div");
div.innerHTML = str;
var attrs=div.firstChild.attributes, len=attrs.length, i;
for (i=0; i<len; i++) {
a[attrs[i].nodeName] = attrs[i].nodeValue];
}
return a;
};
var x = getAttributes(inputStr);
x; // => {width:'123', height:123, src:'http://...', ...}

Instead of regexp, use pure JavaScript:
Grab iframe element:
var iframe = document.getElementsByTagName('iframe')[0];
and then access its properties using:
var arr = {
src : iframe.src,
width : iframe.width,
height : iframe.height,
frameborder : iframe.frameborder
};

I would personally do this with jQuery, if possible. With it, you can create a DOM element without actually injecting it into your page and creating a potential security hazard.
var userTxt = '<iframe src="http://www.stackoverflow.com/" width="123" height="123" frameborder="1"></iframe>';
var userInput = $(userTxt);
console.log(userInput.attr('src'));
console.log(userInput.attr('width'));
console.log(userInput.attr('height'));
console.log(userInput.attr('frameborder'));

regular expression to extract all the attributes of a div

I have a requirement to extract the all the attributes of some tag. so i want to go for regex for this.for example <sometag attr1="val1" attr2="val2" ></sometag>. i want the attributes and values as name value pairs.
Any help appreciated
thanks

var s = '<sometag attr1="val1" attr2="val2" ></sometag>';
var reg = /\s(\w+?)="(.+?)"/g;
while( true ) {
var res = reg.exec( s );
if( res !== null ) {
alert( 'name = '+res[1] );
alert( 'value = '+res[2] );
} else {
break;
}
}

preg_match_all( '/\s(\w+?)="(.+?)"/', '<sometag attr1="val1" attr2="val2" ></sometag>', $matches );
for( $i = 0; $i < count( $matches[1] ); ++$i ) {
$name = $matches[1][$i];
$value = $matches[2][$i];
echo 'name'.$i.' = "'.$name.'", value'.$i.' = "'.$value.'", ';
}
result:
name0 = "attr1", value0 = "val1", name1 = "attr2", value1 = "val2",
of course you need to tweak this to fit your need and deal with bad html.

You could use [jquery][1] to get all attrubutes of an element
$('sometag').getAttributes();
http://plugins.jquery.com/project/getAttributes

A regex is not required. Much easier, use Element.attributes():
var attributes = element.attributes();
"Returns an array (NamedNodeMap) containing all the attributes defined for the element in question, including custom attributes." See the link for examples on how to access each attribute and it's value.

You can't do this in native JavaScript by using a regular expression. Using native JavaScript you have a couple of basic options. You can enumerate all of the node's properties and intelligently filter to get just the things you want, like:
window.extractAttributes = function(node) {
var attribString = "";
var template = document.createElement(node.tagName);
template.innerHTML = node.innerHTML;
for (var key in node) {
if (typeof node[key] == "string" && node[key] != "" && ! template[key]) {
if (attribString.length > 0) {
attribString += ", ";
}
attribString += key + "=" + node[key];
}
}
return attribString;
};
Or you can use Element.attributes to iterate the list of declared attributes (note that this may not detect non-standard attribute values that are added dynamically at runtime), like:
window.extractAttributesAlternate = function(node) {
var attribString = "";
for (var index = 0; index < node.attributes.length; index++) {
if (attribString.length > 0) {
attribString += ", ";
}
attribString += node.attributes[index].name+ "=" + node.attributes[index].nodeValue;
}
return attribString;
};
Note that the first approach may not pick up custom attributes that have been defined in the page markup, and that the second approach may not pick up custom attributes that have been defined dynamically by JavaScript on the page.
Which gives us option 3. You can enumerate the attributes both ways, and then merge the results. This has the benefit of being able to reliably pick up upon any custom attributes no matter when/how they were added to the element.
Here's an example of all 3 options: http://jsfiddle.net/cgj5G/3/

You can use XML parser, because provided input is well-formed XML.

Do not use Regex for this! Javacript's DOM already has all the information you need, easily accessible.
List all attributes of a DOM element:
var element = document.getElementById('myElementName');
var attributes = element.attributes;
for(var attr=0; attr<attributes.length; attr++) {
alert(attributes[attr].name+" = "+attributes[attr].nodeValue);
}
(tested above code in FF5, IE8, Opera11.5, Chrome12: Works in all of them, even with non-standard attributes)

Given the text of a single element which includes its start and end tags (equivalent of the element's outerHTML), the following function will return an object containing all of the attribute name=value pairs. Each attribute value can be single quoted, double quoted or un-quoted. Attribute values are optional and if not present will take on the attribute's name.
function getElemAttributes(elemText) {
// Regex to pick out start tag from start of element's HTML.
var re_start_tag = /^<\w+\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[\w\-.:]+))?)*\s*\/?>/;
var start_tag = elemText.match(re_start_tag);
start_tag = start_tag ? start_tag[0] : '';
// Regex to pick out attribute name and (optional) value from start tag.
var re_attribs = /\s+([\w\-.:]+)(\s*=\s*(?:"([^"]*)"|'([^']*)'|([\w\-.:]+)))?/g;
var attribs = {}; // Store attribute name=value pairs in object.
var match = re_attribs.exec(start_tag);
while (match != null) {
var attrib = match[1]; // Attribute name in $1.
var value = match[1]; // Assume no value specified.
if (match[2]) { // If match[2] is set, then attribute has a value.
value = match[3] ? match[3] : // Attribute value is in $3, $4 or $5.
match[4] ? match[4] : match[5];
}
attribs[attrib] = value;
match = re_attribs.exec(start_tag);
}
return attribs;
}
Given this input:
<sometag attr1="val1" attr2='val2' attr3=val3 attr4 >TEST</sometag>
This is the output:
attrib = {
attr1: "val1",
attr2: "val2",
attr3: "val3",
attr4: "attr4"
};

Develop Reference

JavaScript is the programming language of the Web.

How to check if an element has duplicated attributes with cheerio js - javascript

Related

How to extract the content of the first paragraph in html string react native

JSON route matching via regex

Add arrays into multi-dimensional array or object

Getting Attributes from User submitted text! RegExp?

regular expression to extract all the attributes of a div

Categories

Resources