Search body for document for {~ contents ~}

Search body for document for {~ contents ~} - javascript

Alright, so basically I would like to search the Body tags for {~ , then get whatever follows that until ~} and turn that into a string (not including the {~ or ~} ).

const match = document.body.innerHTML.match(/\{~(.+)~\}/);
if (match) console.log(match[1]);
else console.log('No match found');
<body>text {~inner~} text </body>

$(function(){
var bodyText = document.getElementsByTagName("body")[0].innerHTML;
found=bodyText.match(/{~(.*?)~}/gi);
$.each(found, function( index, value ) {
var ret = value.replace(/{~/g,'').replace(/~}/g,'');
console.log(ret);
});
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js"></script>
<body> {~Content 1~}
{~Content 2~}
</body>
There you go, put gi at the end of the regex.

This is a harder problem to solve than it would first appear; things like script tags and comments can throw a wrench into things if you just grab the innerHTML of the body. The following function takes a base element to search, in your case you'll want to pass in document.body, and returns an array containing any of the strings found.
function getMyTags (baseElement) {
const rxFindTags = /{~(.*?)~}/g;
// .childNodes contains not only elements, but any text that
// is not inside of an element, comments as their own node, etc.
// We will need to filter out everything that isn't a text node
// or a non-script tag.
let nodes = baseElement.childNodes;
let matches = [];
nodes.forEach(node => {
let nodeType = node.nodeType
// if this is a text node or an element, and it is not a script tag
if (nodeType === 3 || nodeType === 1 && node.nodeName !== 'SCRIPT') {
let html;
if (node.nodeType === 3) { // text node
html = node.nodeValue;
} else { // element
html = node.innerHTML; // or .innerText if you don't want the HTML
}
let match;
// search the html for matches until it can't find any more
while ((match = rxFindTags.exec(html)) !== null) {
// the [1] is to get the first capture group, which contains
// the text we want
matches.push(match[1]);
}
}
});
return matches;
}
console.log('All the matches in the body:', getMyTags(document.body));
console.log('Just in header:', getMyTags(document.getElementById('title')));
<h1 id="title"><b>{~Foo~}</b>{~bar~}</h1>
Some text that is {~not inside of an element~}
<!-- This {~comment~} should not be captured -->
<script>
// this {~script~} should not be captured
</script>
<p>Something {~after~} the stuff that shouldn't be captured</p>
The regular expression /{~(.*?)~}/g works like this:
{~ start our match at {~
(.*?) capture anything after it; the ? makes it "non-greedy" (also known as "lazy") so, if you have two instances of {~something~} in any of the strings we are searching it captures each individually instead of capturing from the first {~ to the last ~} in the string.
~} says there has to be a ~} after our match.
The g option makes it a 'global' search, meaning it will look for all matches in the string, not just the first one.
Further reading
childNodes
nodeType
Regular-Expressions.info has a great regular expression tutorial.
MDN RegExp documentation
Tools
There are lots of different tools out there to help you develop regular expressions. Here are a couple I've used:
RegExr has a great tool that explains how a particular regular expression works.
RegExPal

Related

replace and regex exception

I want to wrap all the words of a text in a <trans> tag, to be able to work on each words. Hover them, translate on click etc.
For that I need an exception in my replace function to ignore html tags like <br> or <span>.
Here is the function I have :
function wrapWords(str, tmpl) {
return str.replace(/(?![<br>\<span class="gras">\</span>])[a-zA-ZÀ-ÿ]+/gi, tmpl || "<trans>$&</trans>");
}
This function is working well with russian characters but not with french ones. The problem is that the <br> and <span> exception is excluding french characters b,r,s,p,a... Because of that some words are not wrapped correctly in my <trans> tag.
Does anyone knows how could I exclude a group of characters like specific tags <br> for example without affecting letters b and r in french ?
Thanks for any answer!

Properly using DOM, it is a bit more complex, but no corner cases to worry about, as it is very straightforward.
You want to split the text, thus it makes sense to only operate on text nodes. To find all text nodes, we could evaluate an XPath, or we could construct a TreeWalker.
Once we know which nodes we want to operate on, we take one node at a time and get all-space and no-space sequences. Each will be transformed into another text node, but the no-space sequences will additionally be wrapped inside a <span>. We append them one by one in front of the original node, which will guarantee the correct order, then finally we'll remove the original node, when the replacement nodes are all in their place.
function getTextNodes(node) {
let walker = document.createTreeWalker(node, NodeFilter.SHOW_TEXT, null, false);
let textnodes = [];
let textnode;
while (textnode = walker.nextNode()) {
textnodes.push(textnode);
}
return textnodes;
}
function wrap(element) {
getTextNodes(element).forEach(node => {
node.textContent.replace(/(\S+)|(\s+)/g, (match, word, space) => {
let textnode = document.createTextNode(match);
let newnode;
if (word) {
newnode = document.createElement('trans');
newnode.appendChild(textnode);
} else {
newnode = textnode;
}
node.parentNode.insertBefore(newnode, node);
});
node.remove();
});
}
wrap(document.getElementById('wrapthis'));
trans {
background-color: pink;
}
Not affected<br/>
<div id="wrapthis">
This is affected<br>
<span class="gras">HTML tags are fine</span><br/>
This as well<br/>
</div>
Not affected<br/>

Here's a quick way:
"foo bar baz".split(" ").map(w => "<trans>" + w + "</trans>").join(" ");
Explanation:
sentence is splitted by space character, which gives an Array. Each element of this Array is then wrapped in <trans> tags. Then everything is joined to create back a string.
Edit: usage in the DOM:
var sourceTextNode = document.createElement("div"); // here you're supposed to get an existing node...
sourceTextNode.textContent = "foo bar baz"; // ... and doing this is for the example purposes
sourceTextNode.innerHTML = sourceTextNode.textContent.split(" ").map(w => "<trans>" + w + "</trans>").join(" ");
sourceTextNode is:
<div>
<trans>foo</trans>
<trans>bar</trans>
<trans>baz</trans>
</div>
Note: You may want to exclude empty elements in the splitted Array that you'll get when there are multiple consecutive space charcaters.
One way to do this is testing the non-emptiness of the elements in a filter:
sourceText.split(" ").filter(Boolean)...

How to get numbers in elements' inner text by javascript's regex

I want to get numbers in the inner text of an html by javascript regex to replace them.
for example in the below code I want to get 1,2,3,4,5,6,1,2,3,1,2,3, but not the 444 inside of the div tag.
<body>
aaaa123aaa456
<div style="background: #444">aaaa123aaaa</div>
aaaa123aaa
</body>
What could be the regular expression?

Your best bet is to use innerText or textContent to get at the text without the tags and then just use the regex /\d/g to get the numbers.
function digitsInText(rootDomNode) {
var text = rootDomNode.textContent || rootDomNode.innerText;
return text.match(/\d/g) || [];
}
For example,
alert(digitsInText(document.body));
If your HTML is not in the DOM, you can try to strip the tags yourself : JavaScript: How to strip HTML tags from string?
Since you need to do a replacement, I would still try to walk the DOM and operate on text nodes individually, but if that is out of the question, try
var HTML_TOKEN = /(?:[^<\d]|<(?!\/?[a-z]|!--))+|<!--[\s\S]*?-->|<\/?[a-z](?:[^">']|"[^"]*"|'[^']*')*>|(\d+)/gi;
function incrementAllNumbersInHtmlTextNodes(html) {
return html.replace(HTML_TOKEN, function (all, digits) {
if ("string" === typeof digits) {
return "" + (+digits + 1);
}
return all;
});
}
then
incrementAllNumbersInHtmlTextNodes(
'<b>123</b>Hello, World!<p>I <3 Ponies</p><div id=123>245</div>')
produces
'<b>124</b>Hello, World!<p>I <4 Ponies</p><div id=123>246</div>'
It will get confused around where special elements like <script> end and won't recognize digits that are entity encoded, but should work otherwise.

You don't necessarily need RegExp to get the text contents of an element excluding its descendant elements' — in fact I'd advise against it as RegExp matching for HTML is notoriously difficult — there are DOM solutions:
function getImmediateText(element){
var text = '';
// Text and elements are all DOM nodes. We can grab the lot of immediate descendants and cycle through them.
for(var i = 0, l = element.childNodes.length, node; i < l, node = element.childNodes[i]; ++i){
// nodeType 3 is text
if(node.nodeType === 3){
text += node.nodeValue;
}
}
return text;
}
var bodyText = getImmediateText(document.getElementsByTagName('body')[0]);
So here there's a function that will return only the immediate text content as a string. Of course, you could then strip that for numbers with the RegExp using something like this:
var numberString = bodyText.match(/\d+/g).join('');

Just to answer my old question:
It is possible to achieve it by lookahead.
/\d(?=[^<>]*(<|$))/g
to replace the numbers
html.replace(/\d(?=[^<>]*(<|$))/g, function($0) {
return map[$0]
});
the source of the answer https://www.drupal.org/node/619198#comment-5710052

How can I remove all instances of a specific text-phrase?

In a situation that the body area of a webpage is the only accessible part, is there a way to remove all instances of a particular text-phrase (written in HTML) using inline JavaScript or another inline capable language?
This could be useful in many situations, such as people using a Tiny.cc/customurl and wanting to remove the portion stating "tiny.cc/"
If specifics are allowed, we're modifying a calendar plugin using Tiny.cc to create a custom URLs (tiny.cc/customurl). The plugin shows the full URL by default so we'd like to strip the text "tiny.cc/" and keep the "customurl" portion in our code:
<div class="ews_cal_grid_custom_item_3">
<div class="ews_cal_grid_select_checkbox_clear" id="wGridTagChk" onclick="__doPostBack('wGridTagChk', 'tiny.cc/Baseball-JV');" > </div>
tiny.cc/Baseball-JV
</div>
The part we'd like to remove is the http://tiny.cc/ on the 3rd line by itself.

To do this without replacing all the HTML (which wrecks all event handlers) and to do it without recursion (which is generally faster), you can do this:
function removeText(top, txt) {
var node = top.firstChild, index;
while(node && node != top) {
// if text node, check for our text
if (node.nodeType == 3) {
// without using regular expressions (to avoid escaping regex chars),
// replace all copies of this text in this text node
while ((index = node.nodeValue.indexOf(txt)) != -1) {
node.nodeValue = node.nodeValue.substr(0, index) + node.nodeValue.substr(index + txt.length);
}
}
if (node.firstChild) {
// if it has a child node, traverse down into children
node = node.firstChild;
} else if (node.nextSibling) {
// if it has a sibling, go to the next sibling
node = node.nextSibling;
} else {
// go up the parent chain until we find a parent that has a nextSibling
// so we can keep going
while ((node = node.parentNode) != top) {
if (node.nextSibling) {
node = node.nextSibling;
break;
}
}
}
}
}
Working demo here: http://jsfiddle.net/jfriend00/2y9eH/
To do this on the entire document, you would just call:
removeText(document.body, "http://tiny.cc/Baseball-JV");

As long as you can supply the data in string format, you can use Regular Expressions to do this for you.
You could parse the whole innerHTML of the body tag, if that is all that you can access. This is a slow and kinda-bad-practice method, but for explanation's sake:
document.body.innerHTML = document.body.innerHTML.replace(
/http:\/\/tiny\.cc\//i, // The regular expression to search for
""); // Waht to replace with (nothing).
The whole expression is contained within forward slashes, so any forward slashes inside the regexp need to be escaped with a backslash.
This goes for other characters that have special meaning in regexp, such as the period. A single period (.) denotes matching 'any' character. To match a period, it must be escaped (\.)
EDIT:
If you wish to keep the reference to the URL in the onclick, you can modify the regexp to not match when inside single quotes (as your example):
/([^']http:\/\/tiny\.cc\/[^'])/i

If you don't want to replace all the instances of that string in the HTML, then you'll have to recursively iterate over the node structure, for instance:
function textFilter(element, search, replacement) {
for (var i = 0; i < element.childNodes.length; i++) {
var child = element.childNodes[i];
var nodeType = child.nodeType;
if (nodeType == 1) { // element
textFilter(child, search, replacement);
} else if (nodeType == 3) { // text node
child.nodeValue = child.nodeValue.replace(search, replacement);
}
}
}
Then you just grab hold of the appropriate element, and call this function on it:
var el = document.getElementById('target');
textFilter(el, /http:\/\/tiny.cc\//g, ""); // You could use a regex
textFilter(el, "Baseball", "Basketball"); // or just a simple string

Help write regex that will surround certain text with <strong> tags, only if the <strong> tag isn't present

I have several posts on a website; all these posts are chat conversations of this type:
AD: Hey!
BC: What's up?
AD: Nothing
BC: Okay
They're marked up as simple paragraphs surrounded by <p> tags.
Using the javascript replace function, I want all instances of "AD" in the beginning of a conversation (ie, all instances of "AD" at the starting of a line followed by a ":") to be surrounded by <strong> tags, but only if the instance isn't already surrounded by a <strong> tag.
What regex should I use to accomplish this? Am I trying to do what this advises against?
The code I'm using is like this:
var posts = document.getElementsByClassName('entry-content');
for (var i = 0; i < posts.length; i++) {
posts[i].innerHTML = posts[i].innerHTML.replace(/some regex here/,
'replaced content here');
}

If AD: is always at the start of a line then the following regex should work, using the m switch:
.replace(/^AD:/gm, "<strong>AD:</strong>");
You don't need to check for the existence of <strong> because ^ will match the start of the line and the regex will only match if the sequence of characters that follows the start of the line are AD:.
You're not going against the "Don't use regex to parse HTML" advice because you're not parsing HTML, you're simply replacing a string with another string.
An alternative to regex would be to work with ranges, creating a range selecting the text and then using execCommand to make the text bold. However, I think this would be much more difficult and you would likely face differences in browser implementations. The regex way should be enough.
After seeing your comment, the following regex would work fine:
.replace(/<(p|br)>AD:/gm, "<$1><strong>AD:</strong>");

Wouldn't it be easier to set the class or style property of found paragraph to text-weight: bold or a class that does roughly the same? That way you wouldn't have to worry about adding in tags, or searching for existing tags. Might perform better, too, if you don't have to do any string replaces.
If you really want to add the strong tags anyway, I'd suggest using DOM functions to find childNodes of your paragraph that are <strong>, and if you don't find one, add it and move the original (text) childNode of the paragraph into it.

Using regular expressions on the innerHTML isn't reliable and will potentially lead to problems. The correct way to do this is a tiresome process but is much more reliable.
E.g.
for (var i = 0, l = posts.length; i < l; i++) {
findAndReplaceInDOM(posts[i], /^AD:/g, function(match, node){
// Make sure current node does note have a <strong> as a parent
if (node.parentNode.nodeName.toLowerCase() === 'strong') {
return false;
}
// Create and return new <strong>
var s = document.createElement('strong');
s.appendChild(document.createTextNode(match[0]));
return s;
});
}
And the findAndReplaceInDOM function:
function findAndReplaceInDOM(node, regex, replaceFn) {
// Note: regex MUST have global flag
if (!regex || !regex.global || typeof replaceFn !== 'function') {
return;
}
var start, end, match, parent, leftNode,
rightNode, replacementNode, text,
d = document;
// Loop through all childNodes of "node"
if (node = node && node.firstChild) do {
if (node.nodeType === 1) {
// Regular element, recurse:
findAndReplaceInDOM(node, regex, replaceFn);
} else if (node.nodeType === 3) {
// Text node, introspect
parent = node.parentNode;
text = node.data;
regex.lastIndex = 0;
while (match = regex.exec(text)) {
replacementNode = replaceFn(match, node);
if (!replacementNode) {
continue;
}
end = regex.lastIndex;
start = end - match[0].length;
// Effectively split node up into three parts:
// leftSideOfReplacement + REPLACEMENT + rightSideOfReplacement
leftNode = d.createTextNode( text.substring(0, start) );
rightNode = d.createTextNode( text.substring(end) );
parent.insertBefore(leftNode, node);
parent.insertBefore(replacementNode, node);
parent.insertBefore(rightNode, node);
// Remove original node from document
parent.removeChild(node);
}
}
} while (node = node.nextSibling);
}

Regex: how to get contents from tag inner (use javascript)?

page contents:
aa<b>1;2'3</b>hh<b>aaa</b>..
.<b>bbb</b>
blabla..
i want to get result:
1;2'3aaabbb
match tag is <b> and </b>
how to write this regex using javascript?
thanks!

Lazyanno,
If and only if:
you have read SLaks's post (as well as the previous article he links to), and
you fully understand the numerous and wondrous ways in which extracting information from HTML using regular expressions can break, and
you are confident that none of the concerns apply in your case (e.g. you can guarantee that your input will never contain nested, mismatched etc. <b>/</b> tags or occurrences of <b> or </b> within <script>...</script> or comment <!-- .. --> tags, etc.)
you absolutely and positively want to proceed with regular expression extraction
...then use:
var str = "aa<b>1;2'3</b>hh<b>aaa</b>..\n.<b>bbb</b>\nblabla..";
var match, result = "", regex = /<b>(.*?)<\/b>/ig;
while (match = regex.exec(str)) { result += match[1]; }
alert(result);
Produces:
1;2'3aaabbb

You cannot parse HTML using regular expressions.
Instead, you should use Javascript's DOM.
For example (using jQuery):
var text = "";
$('<div>' + htmlSource + '</div>')
.find('b')
.each(function() { text += $(this).text(); });
I wrap the HTML in a <div> tag to find both nested and non-nested <b> elements.

Here is an example without a jQuery dependency:
// get all elements with a certain tag name
var b = document.getElementsByTagName("B");
// map() executes a function on each array member and
// builds a new array from the function results...
var text = b.map( function(element) {
// ...in this case we are interested in the element text
if (typeof element.textContent != "undefined")
return element.textContent; // standards compliant browsers
else
return element.innerText; // IE
});
// now that we have an array of strings, we can join it
var result = text.join('');

var regex = /(<([^>]+)>)/ig;
var bdy="aa<b>1;2'3</b>hh<b>aaa</b>..\n.<b>bbb</b>\nblabla..";
var result =bdy.replace(regex, "");
alert(result) ;
See : http://jsfiddle.net/abdennour/gJ64g/

Just use '?' character after the generating pattern for your inner text if you want to use Regular experssions.
for example:
".*" to "(.*?)"

Develop Reference

JavaScript is the programming language of the Web.

Search body for document for {~ contents ~} - javascript

Alright, so basically I would like to search the Body tags for {~ , then get whatever follows that until ~} and turn that into a string (not including the {~ or ~} ).

const match = document.body.innerHTML.match(/\{~(.+)~\}/); if (match) console.log(match[1]); else console.log('No match found'); <body>text {~inner~} text </body>

Related

replace and regex exception

How to get numbers in elements' inner text by javascript's regex

How can I remove all instances of a specific text-phrase?

Help write regex that will surround certain text with <strong> tags, only if the <strong> tag isn't present

Regex: how to get contents from tag inner (use javascript)?

Categories

Resources