Regex for capturing repeated groups Javascript - javascript

I have some test data in the following format -
"lorem ipsum <img src='some_url' class='some_class' /> lorem ipsum <img src='some_url' class='some_class' /> ipsum <img src='some_url' class='some_class' />"
Now, my goal is to identify all the image tags along with their respective source urls and css classes and store them together with the remaining text in an ordered array like -
["lorem ipsum", {imageObject1}, "lorem ipsum", {imageObject2}, "ipsum", {imageObject3}]
Now for this I tried to create a sample regex
var regex = /(.*(<img\s+src=['"](.+)['"]\s+(class=['"].+['"])?\s+\/>)+?.*)+/ig
Now when I try this regex with the sample text i am getting -
regex.exec(sample_text) => [0:"lorem ipsum <img src='some_url1' class='some_class1' /> lorem ipsum <img src='some_url2' class='some_class2' /> ipsum <img src='some_url3' class='some_class3' />"
1:"lorem ipsum <img src='some_url1' class='some_class1' /> lorem ipsum <img src='some_url2' class='some_class2' /> ipsum <img src='some_url3' class='some_class3' />"
2:"<img src='some_url3' class='some_class3' />"
3:"some_url3"
4:"class='some_class3'"]
How in javascript can I transform the sample html text
into an array of tagged html objects with their attributes.

Do not use regular expressions to parse HTML. Use a DOMParser to parse the string and then CSS queries to get the images from the DOM, it will be much more reliable and easier to read.
var html = "lorem ipsum <img src='some_url' class='some_class' /> lorem ipsum <img src='some_url' class='some_class' /> ipsum <img src='some_url' class='some_class' />"
var nodes = new DOMParser().parseFromString(html, "text/html").body.childNodes
That will get you almost what you wanted (just some empty Text nodes you can filter out).
Or do something a little bit more accurate like this in case you don't have just images and text in the HTML:
var images = new DOMParser().parseFromString(html, "text/html").querySelectorAll("img")
var array = new Map([...images].map(img => [img.previousSibling.nodeValue, img]))

Related

Regex Match everything except words of particular flag format

I need a regex that can match everything else except the random flags..
All flags have this format, starts with FLAG and ends in ;
FLAG:random_token;
Example:
hello
hello world
lorem ipsum dolor sit amet
FLAG:xyz6767abcd45xyz; and lorem
lorem ipsum dolor
FLAG:abc123; and hello there,..
hello there....
output Im trying to obtain:
hello
hello world
lorem ipsum dolor sit amet
and lorem
lorem ipsum dolor
and hello there,..
hello there....
So far I've tried:
^(?!FLAG:(.*?);).*
and
(?!.*\bFLAG:.*$)^.*$
But it fails to extract the strings after the semicolon in FLAG:random_token;
Any help would be appreciated
And I've tried deleting all Flags from the block, but I needed the token values later and Also thought regex would be the best fit.
One way to do this would be to remove the flags from the input string, using String.replace and a regex to match the FLAG: and random token (everything to the next ;), you can then use a callback function to store the tokens as they are found:
str = `hello
hello world
lorem ipsum dolor sit amet
FLAG:xyz6767abcd45xyz; and lorem
lorem ipsum dolor
FLAG:abc123; and hello there,..
hello there....`;
const tokens = [];
str = str.replace(/FLAG:([^;]+);/g, (_, p1) => {
tokens.push(p1);
return '';
});
console.log(str);
console.log(tokens);

Cheerio unmatched selector error while selecting plain text

I'm scraping a web page with cheerio's .map method. The page's html code looks like this:
<div class="foo">
<h1>Lorem</h1>
<p>Lorem ipsum dolor sit amet.</p>
TEXT WITHOUT TAG
<p>Lorem ipsum dolor sit amet.</p>
</div>
Here is what I do:
let $ = cheerio.load(body);
let contentHtml = $('foo').html();
$(contentHtml).map((index, element) => {
console.log(element);
});
When .map see the 'TEXT WITHOUT TAG', it throws an error like this:
Unmatched selector: ...
Which is expected because it hasn't any selectors. I want to wrap that plain text with <p> tags but I couldn't figure out how.
Your element has class foo and selector not:
let contentHtml = $('.foo').html();

ES6 - Parse HTML string to Array

I have an HTML formatted string:
let dataString = '<p>Lorem ipsum</p> <figure><img src="" alt=""></figure> <p>Lorem ipsum 2</p> <figure><img src="" alt=""></figure>';
How can I parse this string to get an array of tags as below?
let dataArray = [
'<p>Lorem ipsum</p>',
'<figure><img src="" alt=""></figure>',
'<p>Lorem ipsum 2</p>',
'<figure><img src="" alt=""></figure>',
];
Turn it into a document with DOMParser, then take the children of the body and .map their .outerHTML:
const str = '<p>Lorem ipsum</p> <figure><img src="" alt=""></figure> <p>Lorem ipsum 2</p> <figure><img src="" alt=""></figure>';
const doc = new DOMParser().parseFromString(str, 'text/html');
const arr = [...doc.body.children].map(child => child.outerHTML);
console.log(arr);
(you can also achieve this by creating an element and setting the innerHTML of the element to the string, and then iterating over its children, but that could allow for arbitrary code execution, if the input string isn't trustworthy)
Dom parsing is recommended.
Here using vanilla JS without the DOMParser used in the other answer
let dataString = `<p>Lorem ipsum</p> <figure><img src="" alt=""></figure> <p>Lorem ipsum 2</p> <figure><img src="" alt=""></figure>`;
let domFragment = document.createElement("div");
domFragment.innerHTML = dataString;
const arr = [...domFragment.querySelectorAll("div>p,div>figure")].map(el => el.outerHTML)
console.log(arr)
If you cannot use that, then your SPECIFIC string can be split like this after fixing your nested quotes.
Note any change for example adding a space after the <img..> will break such a script
let dataString = `<p>Lorem ipsum</p> <figure><img src="" alt=""></figure> <p>Lorem ipsum 2</p> <figure><img src="" alt=""></figure>`;
dataString = dataString.replace(/> /g,">|").split("|")
console.log(dataString)
I am not clear with your question. Is that a random string or a html string? The split rule is slice the origin string into html element parts?
If true, I think we can handle it with a dummy element.
For convenient, I use jQuery selector:
let stringToSplit = `<p>Lorem ipsum</p> <figure><img src="" alt=""></figure> <p>Lorem ipsum 2</p> <figure><img src="" alt=""></figure>`
$dummy = $("<div/>"); // create a dummy
$dummy.html(stringToSplit);
var dataArray = [];
var dummyChildren = $dummy.children();
for (var i = 0; i < dummyChildren.length; i++) {
dataArray[i] = dummyChildren[i].outerHTML
}
$dummy = null; // remove from memory
console.log(dataArray)
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>

Find the index position of every element starting with `icon-`in a block of text using javascript

What is the best way of looping through a text block to find the index position of every element starting with icon- using javascript or jQuery.
I also want to ignore any <br> tags in the index position calculation.
I have thought about using substring to find the position of the elements.
Here is an example text block
<div class="intro">
Lorem dolor sit<br>
<span class="icon-pin"></span> consectetur<br>
adiposcing elit, sed do <span class="icon-hand"></span> lorem<br>
ipsum dolor sit amet.
</div>
What I want to get out of this is how many characters in (minus white space and tags) each [class^=icon-] is.
For example the first [class^=icon-] is 14 characters in
Thanks
I think this is what your looking for, it will find the index of the spans and ignore br
$(".intro [class^=icon-]").each(function() {
var i = $(".intro *:not(br)").index(this)
console.log(i)
})
Demo
$(".intro [class^=icon-]").each(function() {
var i = $(".intro *:not(br)").index(this)
console.log(i)
})
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div class="intro">
Lorem dolor sit<br>
<span class="icon-pin"></span> consectetur<br> adiposcing elit, sed do <span class="icon-hand"></span> lorem<br> ipsum dolor sit amet.
</div>
You can achieve it with jquery each like in the example
$('[class^="icon-"]','.intro').each(function(index, element){
console.log(index,element);
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div class="intro">
Lorem dolor sit<br>
<span class="icon-pin"></span> consectetur<br>
adiposcing elit, sed do <span class="icon-hand"></span> lorem<br>
ipsum dolor sit amet.
</div>
You can use more than 1 classes on a element. So, You can keep using your "icon-" classes and add another one to capture them like "grabber" and now you are good to go. Just find the "grabber" classes with a for loop like;
var y = "number of grabbers";
for(x:0;x<y;x++){
$('.grabber')[x].function.....
}

Javascript reg exp between closing tag to opening tag

How do I select with Regular Expression the text after the </h2> closing tag until the next <h2> opening tag
<h2>my title here</h2>
Lorem ipsum dolor sit amet <b>with more tags</b>
<h2>my title here</h2>
consectetur adipisicing elit quod tempora
In this case I want to select this text: Lorem ipsum dolor sit amet <b>with more tags</b>
Try this: /<\/h2>(.*?)</g
This finds a closing tag, then captures anything before a new opening tag.
in JS, you'd do this to get just the text:
substr = str.match(/<\/h2>(.*?)<h2/)[1];
Regex101
var str = '<h2>my title here</h2>Lorem ipsum <b>dolor</b> sit amet<h2>my title here</h2>consectetur adipisicing elit quod tempora';
var substr = str.match(/<\/h2>(.*?)<h2/)[1].replace(/<.*?>/g, '');
console.log(substr);
//returns: Lorem ipsum dolor sit amet
Try
/<\/h2>((?:\s|.)*)<h2/
And you can see it in action on this regex tester.
You can see it in this example below too.
(function() {
"use strict";
var inString, regEx, res, outEl;
outEl = document.getElementById("output");
inString = "<h2>my title here</h2>\n" +
"Lorem ipsum dolor sit amet <b>with more tags</b>\n" +
"<h2> my title here </h2>\n" +
"consectetur adipisicing elit quod tempora"
regEx = /<\/h2>((?:\s|.)*)<h2/
res = regEx.exec(inString);
console.log(res);
res.slice(1).forEach(function(match) {
var newEl = document.createElement("pre");
newEl.innerHTML = match.replace(/</g, "<").replace(/>/g, ">");
outEl.appendChild(newEl);
});
}());
<main>
<div id="output"></div>
</main>
I added \n to your example to simulate new lines. No idea why you aren't just selecting the <h2> with a querySelector() and getting the text that way.
Match the tags and remove them, by using string replace() function. Also this proposed solution removes any single closure tags like <br/>,<hr/> etc
var htmlToParse = document.getElementsByClassName('input')[0].innerHTML;
var htmlToParse = htmlToParse.replace(/[\r\n]+/g,""); // clean up the multiLine HTML string into singleline
var selectedRangeString = htmlToParse.match(/(<h2>.+<h2>)/g); //match the string between the h2 tags
var parsedString = selectedRangeString[0].replace(/((<\w+>(.*?)<\/\w+>)|<.*?>)/g, ""); //removes all the tags and string within it, Also single tags like <br/> <hr/> are also removed
document.getElementsByClassName('output')[0].innerHTML += parsedString;
<div class='input'>
<i>Input</i>
<h2>my title here</h2>
Lorem ipsum dolor sit amet <br/> <b>with more tags</b>
<hr/>
<h2>my title here</h2>
consectetur adipisicing elit quod tempora
</div>
<hr/>
<div class='output'>
<i>Output</i>
<br/>
</div>
Couple of things to remember in the code.
htmlToParse.match(/(<h2>.+<h2>)/g); returns an array of string, ie all the strings that was matched from this regex.
selectedRangeString[0] I am just using the first match for demo purspose. If you want to play with all the strings then you can just for loop it with the same logic.

Categories

Resources