How to remove <div> and <br> using Cheerio js? - javascript

I have the following html that I like to parse through Cheerios.
var $ = cheerio.load('<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div>This works well.</div><div><br clear="none"/></div><div>So I have been doing this for several hours. How come the space does not split? Thinking that this could be an issue.</div><div>Testing next paragraph.</div><div><br clear="none"/></div><div>Im testing with another post. This post should work.</div><div><br clear="none"/></div><h1>This is for test server.</h1></body></html>', {
normalizeWhitespace: true,
});
// trying to parse the html
// the goals are to
// 1. remove all the 'div'
// 2. clean up <br clear="none"/> into <br>
// 3. Have all the new 'empty' element added with 'p'
var testData = $('div').map(function(i, elem) {
var test = $(elem)
if ($(elem).has('br')) {
console.log('spaceme');
var test2 = $(elem).removeAttr('br');
} else {
var test2 = $(elem).removeAttr('div').add('p');
}
console.log(i +' '+ test2.html());
return test2.html()
})
res.send(test2.html())
My end goals are to try and parse the html
remove all the div
clean up <br clear="none"/> and change into <br>
and finally have all the empty 'element' (those sentences with 'div') remove to be added with 'p' sentence '/p'
I try to start with a smaller goal in the above code I have written. I tried to remove all the 'div' (it is a success) but I'm unable to to find the 'br. I been trying out for days and have no head way.
So I'm writing here to seek some help and hints on how can I get to my end goal.
Thank you :D

It's easier than it looks, first you iterate over all the DIV's
$('div').each(function() { ...
and for each div, you check if it has a <br> tag
$(this).find('br').length
if it does, you remove the attribute
$(this).find('br').removeAttr('clear');
if not you create a P with the same content
var p = $('<p>' + $(this).html() + '</p>');
and then just replace the DIV with the P
$(this).replaceWith(p);
and output
res.send($.html());
All together it's
$('div').each(function() {
if ( $(this).find('br').length ) {
$(this).find('br').removeAttr('clear');
} else {
var p = $('<p>' + $(this).html() + '</p>');
$(this).replaceWith(p);
}
});
res.send($.html());

You don't want to remove an attribute you want to remove the tag and so you want to switch removeAttr to remove, like so:
var testData = $('div').map(function(i, elem) {
var test = $(elem)
if ($(elem).has('br')) {
console.log('spaceme');
var test2 = $(elem).remove('br');
} else {
var test2 = $(elem).remove('div').add('p');
}
console.log(i +' '+ test2.html());
return test2.html()
})

Related

Howto exclude jpeg-names from regexp replace?

I'm using a search-function for a documentation site which upon selection of search hit shows page with text highlighted (just as a pdf-reader or netbeans would do).
To achive the highlight i use javascript with:
function searchHighlight(searchTxt) {
var target = $('#page').html();
var re = new RegExp(searchTxt, 'gi');
target = target.replace(
re,
'<span class="high">' + searchTxt + '</span>'
);
$('#page').html(target);
}
Problem / Question:
Since page incudes images with filenames based on md5, some searches messes up the image src.
Searching on "1000" will distort the
<img src="53451000abababababa---.jpg"
to
<img src="5334<span class="hl">1000</span>abababab--.jpg">
Is it possible to solve this with regexp, somehow excluding anything anjcent to ".jpg"?
Or would it be possible to, before highligting replace the images with placeholders, and after replace revert back to src?
Example:
replace all <img *> with {{I-01}}, {{I-02}} etc and keep the real src in a var.
Do the replace above.
Revert back from {{I-01}} to the <img src=".."/>
DOM-manipulation is of course an option, but I figure this could be done with regexp somehow, however, my regexp skills are lacking badly.
UPDATE
This code works for me now:
function searchHighlight(searchTxt) {
var stack = new Array();
var stackPtr = 0;
var target = $('#page').html();
//pre
target = target.replace(/<img.+?>/gi,function(match) {
stack[stackPtr] = match;
return '{{im' + (stackPtr++) + '}}';
});
//replace
var re = new RegExp(searchTxt, 'gi');
target = target.replace(re,'<span class="high">' + searchTxt + '</span>');
//post
stackPtr = 0;
target = target.replace(/{{im.+?}}/gi,function(match) {
return stack[stackPtr++];
});
$('#page').html(target);
}
One approach would be to create an array of all possible valid search terms. Set the terms as .textContent of <span> elements within #page parent element.
At searchHighlight function check if searchTxt matches an element within array. If searchTxt matches an element of array, select span element using index of matched array element, toggle "high" .className at matched #page span element, else notify user that searchTxt does not match any valid search terms.
$(function() {
var words = [];
var input = $("input[type=text]");
var button = $("input[type=button][value=Search]");
var reset = $("input[type=button][value=Reset]");
var label = $("label");
var page = $("#page");
var contents = $("h1, p", page).contents()
.filter(function() {
return this.nodeType === 3 && /\w+/.test(this.nodeValue)
}).map(function(i, text) {
var span = text.nodeValue.split(/\s/).filter(Boolean)
.map(function(word, index) {
words.push(word);
return "<span>" + word + "</span> "
});
$(text.parentElement).find(text).replaceWith(span);
})
var spans = $("span", page);
button.on("click", function(event) {
spans.removeClass("high");
label.html("");
if (input.val().length && /\w+/.test(input.val())) {
var terms = input.val().match(/\w+/g);
var indexes = $.map(terms, function(term) {
var search = $.map(words, function(word, index) {
return word.toLowerCase().indexOf(term.toLowerCase()) > -1 && index
}).filter(Boolean);
return search
});
if (indexes.length) {
$.each(indexes, function(_, index) {
spans.eq(index).addClass("high")
})
} else {
label.html("Search term <em>" + input.val() + "</em> not found.");
}
}
});
reset.on("click", function(event) {
spans.removeClass("high");
input.val("");
label.html("");
})
})
.high {
background-color: #caf;
}
label em {
font-weight: bold;
background-color: darkorange;
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input type="text" />
<input type="button" value="Search" />
<input type="button" value="Reset" />
<label></label>
<div id="page" style="max-width:500px;border:1px solid #ccc;">
<h1 style="margin:0px;">test of replace</h1>
<p>After Luke comes to Dagobah, Yoda initially withholds his true identity. He’s trying to get a sense of who Luke is as a person; Yoda understands that there’s a lot at risk in training Luke to be a Jedi, especially considering what happened with his
father.
<img style="float:right;" width="200" src="http://a.dilcdn.com/bl/wp-content/uploads/sites/6/2013/11/04-400x225.jpg">And Yoda is not impressed — Luke is impatient and selfish. With “Adventure. Excitement. A Jedi craves not these things,” the Jedi Master makes clear that Luke must understand the significance and meaning of the journey he thinks he wants to make.
It’s an important lesson for Luke and for audiences, because when Luke faces Vader at the film’s climax, we see the stakes involved in the life of a Jedi</p>
<p>Now Yoda-search works, however a search on "sites" will break the image-link. (Yes, I know this implementation isn't perfect but I'm dealing with reality)</p>
</div>

Function to remove <span></span> from string in an json object array in JavaScript

I know there are many similar questions posted, and have tried a couple solutions, but would really appreciate some guidance with my specific issue.
I would like to remove the following HTML markup from my string for each item in my array:
<SPAN CLASS="KEYWORDSEARCHTERM"> </SPAN>
I have an array of json objects (printArray) with a printArray.header that might contain the HTML markup.
The header text is not always the same.
Below are 2 examples of what the printArray.header might look like:
<SPAN CLASS="KEYWORDSEARCHTERM">MOST EMPOWERED</SPAN> COMPANIES 2016
RECORD WINE PRICES AT <SPAN CLASS="KEYWORDSEARCHTERM">NEDBANK</SPAN> AUCTION
I would like the strip the HTML markup, leaving me with the following results:
MOST EMPOWERED COMPANIES 2016
RECORD WINE PRICES AT NEDBANK AUCTION
Here is my function:
var newHeaderString;
var printArrayWithExtract;
var summaryText;
this.setPrintItems = function(printArray) {
angular.forEach(printArray, function(printItem){
if (printItem.ArticleText === null) {
summaryText = '';
}
else {
summaryText = '... ' + printItem.ArticleText.substring(50, 210) + '...';
}
// Code to replace the HTML markup in printItem.header
// and return newHeaderString
printArrayWithExtract.push(
{
ArticleText: printItem.ArticleText,
Summary: summaryText,
Circulation: printItem.Circulation,
Headline: newHeaderString,
}
);
});
return printArrayWithExtract;
};
Try this function. It will remove all markup tags...
function strip(html)
{
var tmp = document.createElement("DIV");
tmp.innerHTML = html;
return tmp.textContent || tmp.innerText || "";
}
Call this function sending the html as a string. For example,
var str = '<SPAN CLASS="KEYWORDSEARCHTERM">MOST EMPOWERED</SPAN> COMPANIES 2016';
var expectedText = strip(str);
Here you find your expected text.
It can be done using regular expressions, see below:
var s1 = '<SPAN CLASS="KEYWORDSEARCHTERM">MOST EMPOWERED</SPAN> COMPANIES 2016';
var s2 = 'RECORD WINE PRICES AT <SPAN CLASS="KEYWORDSEARCHTERM">NEDBANK</SPAN> AUCTION';
function removeSpanInText(s) {
return s.replace(/<\/?SPAN[^>]*>/gi, "");
}
$("#x1").text(removeSpanInText(s1));
$("#x2").text(removeSpanInText(s2));
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
1 ->
<span id="x1"></span>
<br/>2 ->
<span id="x2"></span>
For more info, see e.g. Javascript Regex Replace HTML Tags.
And jQuery is not needed, just used here to show the output.
I used this little replace function:
if (printItem.Headline === null) {
headlineText = '';
}
else {
var str = printItem.Headline;
var rem1 = str.replace('<SPAN CLASS="KEYWORDSEARCHTERM">', '');
var rem2 = rem1.replace('</SPAN>', '');
var newHeaderString = rem2;
}

Find ID in String using match

I'm building a widget that will generate a graph for an element when it is double clicked on the page. Without remaking all widgets this is the only way for me to tackle the problem.
I want find the ID of a widget from the html of an element.
All widgets I want to work are inside a div element panel_content_id_#
I want to find the number found on the line of code
var io_id=32715;
How can I search the string for this pattern and get the number (32715).
$('div[id^="panel_content_id_"]').dblclick(function(e){
console.log($(this).attr('id'));
var code = $(this).html();
// Find ID
var id = -1;
var search = code.match("var io_id=");
if(search > -1){
}
console.log($(this).html());
});
The line of code im looking for will look like so
var io_id=xxxxx;
Where xxxxx = some random number I dont know
I want to find xxxxx
Split it in two parts - All the code before the var io_id= and the other part is after that.
And then you know that the line ends with ;, so from that second part you cut of the stuff that is before the semicolon.
CODE
$('div[id^="panel_content_id_"]').dblclick(function(e){
console.log($(this).attr('id'));
var code = $(this).html();
// Find ID
var id = -1;
if (code.indexOf("var io_id")>-1) {
id = parseInt(code.split("var io_id=")[1].split(";")[0]);
}
if(search > -1){
console.log("The code betrayed me");
}
console.log("The id is: " +id);
});
Maybe you could try this regex pattern:
\d+(?! var io_id=)
Used like this:
$('div[id^="panel_content_id_"]').dblclick(function(e) {
console.log($(this).attr('id'));
var code = $(this).html();
// Find ID
var id = -1;
var search = code.match("var io_id=");
if (search) { // Edited
// New code
alert(code.match(/\d+(?! var io_id=)/gim));
}
console.log($(this).html());
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="panel_content_id_32715">
Div content
<br><br>
var io_id=32715;
</div>

Extract the text out of HTML string using JavaScript

I am trying to get the inner text of HTML string, using a JS function(the string is passed as an argument). Here is the code:
function extractContent(value) {
var content_holder = "";
for (var i = 0; i < value.length; i++) {
if (value.charAt(i) === '>') {
continue;
while (value.charAt(i) != '<') {
content_holder += value.charAt(i);
}
}
}
console.log(content_holder);
}
extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>");
The problem is that nothing gets printed on the console(*content_holder* stays empty). I think the problem is caused by the === operator.
Create an element, store the HTML in it, and get its textContent:
function extractContent(s) {
var span = document.createElement('span');
span.innerHTML = s;
return span.textContent || span.innerText;
};
alert(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>"));
Here's a version that allows you to have spaces between nodes, although you'd probably want that for block-level elements only:
function extractContent(s, space) {
var span= document.createElement('span');
span.innerHTML= s;
if(space) {
var children= span.querySelectorAll('*');
for(var i = 0 ; i < children.length ; i++) {
if(children[i].textContent)
children[i].textContent+= ' ';
else
children[i].innerText+= ' ';
}
}
return [span.textContent || span.innerText].toString().replace(/ +/g,' ');
};
console.log(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>. Nice to <em>see</em><strong><em>you!</em></strong>"));
console.log(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>. Nice to <em>see</em><strong><em>you!</em></strong>",true));
One line (more precisely, one statement) version:
function extractContent(html) {
return new DOMParser()
.parseFromString(html, "text/html")
.documentElement.textContent;
}
textContext is a very good technique for achieving desired results but sometimes we don't want to load DOM. So simple workaround will be following regular expression:
let htmlString = "<p>Hello</p><a href='http://w3c.org'>W3C</a>"
let plainText = htmlString.replace(/<[^>]+>/g, '');
use this regax for remove html tags and store only the inner text in html
it shows the HelloW3c only check it
var content_holder = value.replace(/<(?:.|\n)*?>/gm, '');
Try This:-
<!DOCTYPE html>
<html>
<body>
<script type="text/javascript">
function extractContent(value){
var div = document.createElement('div')
div.innerHTML=value;
var text= div.textContent;
return text;
}
window.onload=function()
{
alert(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>"));
};
</script>
</body>
</html>
For Node.js
This will use the jsdom library, since node.js doesn't have dom features as in browser.
import * as jsdom from "jsdom";
const html = "<h1>Testing<h1>";
const text = new jsdom.JSDOM(html).window.document.textContent;
console.log(text);
Use match() function to bring out HTML tags
const text = `<div>Hello World</div>`;
console.log(text.match(/<[^>]*?>/g));
You could temporarily write it out to a block level element that is positioned off the page .. some thing like this:
HTML:
<div id="tmp" style="position:absolute;top:-400px;left:-400px;">
</div>
JavaScript:
<script type="text/javascript">
function extractContent(value){
var div=document.getElementById('tmp');
div.innerHTML=value;
console.log(div.children[0].innerHTML);//console out p
}
extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>");
</script>
Using jQuery, in jQuery we can add comma seperated tags.
var readableText = [];
$("p, h1, h2, h3, h4, h5, h6").each(function(){
readableText.push( $(this).text().trim() );
})
console.log( readableText.join(' ') );
you need array to hold values
function extractContent(value) {
var content_holder = new Array();
for(var i=0;i<value.length;i++) {
if(value.charAt(i) === '>') {
continue;
while(value.charAt(i) != '<') {
content_holder.push(value.charAt(i));
console.log(content_holder[i]);
}
}
}
}extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>");

Remove node function on parent element

I'm new to JS. I'm trying to delete the parent node with all the children by clicking a button. But the console tells me that undefined is not a function. What am I missing?
Fiddle:
http://jsfiddle.net/vy0d8bqt/
HTML:
<button type="button" id="output">Get contacts</button>
<button type="button" id="clear_contacts">clear contact</button>
<div id="output_here"></div>
JS:
// contact book, getting data from JSON and outputting via a button
// define a JSON structure
var contacts = {
"friends" :
[
{
"name" : "name1",
"surname" : "surname1"
},
{
"name" : "name2",
"surname" : "surname2"
}
]
};
//get button ID and id of div where content will be shown
var get_contacts_btn = document.getElementById("output");
var output = document.getElementById("output_here");
var clear = document.getElementById("clear_contacts");
var i;
// get length of JSON
var contacts_length = contacts.friends.length;
get_contacts_btn.addEventListener('click', function(){
//console.log("clicked");
for(i = 0; i < contacts_length; i++){
var data = contacts.friends[i];
var name = data.name;
var surname = data.surname;
output.style.display = 'block';
output.innerHTML += "<p> name: " + name + "| surname: " + surname + "</p>";
}
});
//get Children of output div to remove them on clear button
//get output to clear
output_to_clear = document.getElementById("output_here");
clear.addEventListener('click', function(){
output_to_clear.removeNode(true);
});
You should use remove() instead of removeNode()
http://jsfiddle.net/vy0d8bqt/1/
However, this also removes the output_to_clear node itself. You can use output_to_clear.innerHTML = '' if you like to just delete all content of the node, but not removing the node itself (so you can click 'get contacts' button again after clearing it)
http://jsfiddle.net/vy0d8bqt/3/
You want this for broad support:
output_to_clear.parentNode.removeChild(output_to_clear);
Or this in modern browsers only:
output_to_clear.remove();
But either way, make sure you don't try to remove it after it has already been removed. Since you're caching the reference, that could be an issue, so this may be safer:
if (output_to_clear.parentNode != null) {
output_to_clear.remove();
}
If you were hoping to empty its content, then do this:
while (output_to_clear.firstChild) {
output_to_clear.removeChild(output_to_clear.firstChild);
}
I think using jQuery's $.remove() is probably the best choice here. If you can't or don't want to use jQuery, The Mozilla docs for Node provides a function to remove all child nodes.
Element.prototype.removeAll = function () {
while (this.firstChild) { this.removeChild(this.firstChild); }
return this;
};
Which you would use like:
output_to_clear.removeAll();
For a one-off given the example provided:
while (output_to_clear.firstChild) { output_to_clear.removeChild(output_to_clear.firstChild); }

Categories

Resources