Get javascript node raw content - javascript

I have a javascript node in a variable, and if I log that variable to the console, I get this:
"​asekuhfas eo"
Just some random string in a javascript node. I want to get that literally to be a string. But the problem is, when I use textContent on it, I get this:
​asekuhfas eo
The special character is converted. I need to get the string to appear literally like this:
​asekuhfas eo
This way, I can deal with the special character (recognize when it exists in the string).
How can I get that node object to be a string LITERALLY as it appears?

As VisionN has pointed out, it is not possible to reverse the UTF-8 encoding.
However by using charCodeAt() you can probably still achieve your goal.
Say you have your textContent. By iterating through each character, retrieving its charCode and prepending "&#" as well as appending ";" you can get your desired result. The downside of this method obviously being that you will have each and every character in this annotation, even those do not require it. By introducing some kind of threshold you can restrict this to only the exotic characters.
A very naive approach would be something like this:
var a = div.textContent;
var result = "";
var treshold = 1000;
for (var i = 0; i < a.length; i++) {
if (a.charCodeAt(i) > 1000)
result += "&#" + a.charCodeAt(i) + ";";
else
result += a[i];
}

textContent returns everything correctly, as ​ is the Unicode Character 'ZERO WIDTH SPACE' (U+200B), which is:
commonly abbreviated ZWSP
this character is intended for invisible word separation and for line break control; it has no width, but its presence between two characters does not prevent increased letter spacing in justification
It can be easily proven with:
var div = document.createElement('div');
div.innerHTML = '​xXx';
console.log( div.textContent ); // "​xXx"
console.log( div.textContent.length ); // 4
console.log( div.textContent[0].charCodeAt(0) ); // 8203
As Eugen Timm mentioned in his answer it is a bit tricky to convert UTF characters back to HTML entities, and his solution is completely valid for non standard characters with char code higher than 1000. As an alternative I may propose a shorter RegExp solution which will give the same result:
var result = div.textContent.replace(/./g, function(x) {
var code = x.charCodeAt(0);
return code > 1e3 ? '&#' + code + ';' : x;
});
console.log( result ); // "​xXx"
For a better solution you may have a look at this answer which can handle all HTML special characters.

Related

JavaScript not removing text when a uppercase letter involved

So I have a text box on my website and I have coded this to prevent certain words from being used.
window.onload = function() {
var banned = ['MMM', 'XXX'];
document.getElementById('input_1_17').addEventListener('keyup', function(e) {
var text = document.getElementById('input_1_17').value;
for (var x = 0; x < banned.length; x++) {
if (text.toLowerCase().search(banned[x]) !== -1) {
alert(banned[x] + ' is not allowed!');
}
var regExp = new RegExp(banned[x]);
text = text.replace(regExp, '');
}
document.getElementById('input_1_17').value = text;
}, false);
}
The code works perfectly and removes the text from the text box when all the letters typed are lowercase. The problem is when the text contained an uppercase letter it will give the error but the word will not be removed from the text box.
The RegExp is a good direction, just you need some flags (to make it case-insensitive, and global - so replace all occurrences):
var text="Under the xxx\nUnder the XXx\nDarling it's MMM\nDown where it's mmM\nTake it from me";
console.log("Obscene:",text);
var banned=["XXX","MMM"];
banned.forEach(nastiness=>{
text=text.replace(new RegExp(nastiness,"gi"),"");
});
console.log("Okay:",text);
Normally you should use .toLowerCase() with both sides when comparing the strings so they can logically be matched.
But the problem actually comes from the Regex you are using, where you are ignoring case sensitivity, you just need to add the i flag to it:
var regExp = new RegExp(banned[x], 'gi');
text = text.replace(regExp, '');
Note:
Note also that using an alert() in a loop is not recommended, you can change your logic to alert all the matched items in only one alert().
You seem to have been expecting something unreasonable. Lowercase strings will never match strings containing uppercase letters.
Either convert both for comparison or use lowercase banned strings. The former would be more reliable, taking future human error out of the process.
What you can do is actually convert both variables to either all caps or all lowercase.
if (text.toLowerCase().includes(banned[x].toLowerCase())) {
alert(banned[x] + ' is not allowed!');
}
Not tested but it should work. No need to use search since you don't need the index anyway. using includes is cleaner. includes docs

Performance issue using regex to replace/clear substring

I have a string containing things like this:
<a#{style}#{class}#{data} id="#{attr:id}">#{child:content} #{child:whatever}</a>
Everything to do here is just clear #{xxx}, except sub-strings starting with #{child: .
I used str.match() to get all sub-strings "#{*}" in an array to search and keep all #{child: substrings:
var matches = str.match(new RegExp("#\{(.*?)\}",'g'));
if (matches && matches.length){
for(var i=0; i<matches.length; i++){
if (matches[i].search("#{child:") == -1) str = str.replace(matches[i],'');
}
}
I got it running ok, but it's too slow when string becomes bigger (~2 seconds / +1000 nodes like this one on top)
Is there some alternative to do it, maybe using a rule (if exists) to escape #{child: direct in regex and improve performance?
If I understand your question correctly you don't want to remove the #{child:...} sub-strings but everything else of the format #{...} should go. In which case can you could change the regular expression to check that child: is not matched when you perform the replace:
var str = '<a#{style}#{class}#{data} id="#{attr:id}">#{child:content} #{child:whatever}</a>';
str = str.replace(/#\{((?!child:)[\s\S])+?\}/g, '');
This seems pretty fast.

Reveal all non-printing ANSI characters and metacharacters in a Javascript string

I'm receiving piped stdout output from a multitude of fairly random shell processes, all as input (stdin) on a single node.js process. For debugging and for parsing, I need to be handle different special character codes that are being piped into the process. It would really help me to see invisible characters (for debugging mostly) and to deal with them accordingly once I've identified the patterns in which they are used.
Given a javascript string with ANSI special characters \u001b* and/or metacharacters such as \n, \t, \r etc., how can one reveal these special characters so they aren't actually rendered, but rather exposed as their code value instead.
For example, let's say I have the following string printed in green (can't show the green colour on SO):
This is a string.
We are now using the green color.
I would like to be able to do a console.log (for example) on this string and have it replace the non-printing characters, metacharacters/newlines, color codes etc with their ANSI codes:
"\u001b[32m\tThis is a string.\nWe are now using the green color.\n"
I can do something like the following, but it is too specific, hard-coded, and inefficient:
line = line.replace(/[\f]/g, '\\n');
line = line.replace(/\u0008/g, '\\b');
line = line.replace(/\u001b|\u001B/g, '\\u001b');
line = line.replace(/\r\n|\r|\n/g, '\\n');
...
Try this:
var map = { // Special characters
'\\': '\\',
'\n': 'n',
'\r': 'r',
'\t': 't'
};
str = str.replace(/[\\\n\r\t]/g, function(i) {
return '\\'+map[i];
});
str = str.replace(/[^ -~]/g, function(i){
return '\\u'+("000" + i.charCodeAt(0).toString(16)).slice(-4);
});
Here's a version that loops through the string, tests to see if it's a normal printable character and, if not, looks it up in a special table for your own representation of that character and if not found in the table, displays whatever default representation you want:
var tagKeys = {
'\n': 'New Line \n',
'\u0009': 'Tab',
'\u2029': 'Line Separator'
/* and so on */
};
function tagSpecialChars(str) {
var output = "", ch, replacement;
for (var i = 0; i < str.length; i++) {
ch = str.charAt(i);
if (ch < ' ' || ch > '~') {
replacement = tagKeys[ch];
if (replacement) {
ch = replacement;
} else {
// default value
// could also use charCodeAt() to get the numeric value
ch = '*****';
}
}
output += ch;
}
return output;
}
Demo: http://jsfiddle.net/jfriend00/bCYa4/
This is obviously not some fancy regex solution, but you said performance was important and you rarely find the best performing operation using a regex and certainly not if you're going to use a whole bunch of them. Plus every regex replace has to loop through the whole string anyway.
This workman-like solution just loops through the input string once and lets you customize the display conversion for any non-printable character you want and also determine what you want to display when it's a non-printable character that you don't have a special display representation for.

extracting middle OR final part of a string

I want to extract only the first fontname out of a URL-string from the Google Webfont Directory. Here are some examples of possible strings and what part should be returned:
fonts.googleapis.com/css?family=Raleway // "Raleway"
fonts.googleapis.com/css?family=Caesar+Dressing // "Caesar Dressing"
fonts.googleapis.com/css?family=Raleway:300,400 // "Raleway"
fonts.googleapis.com/css?family=Raleway|Fondamento // "Raleway"
fonts.googleapis.com/css?family=Caesar+Dressing|Raleway:300,400|Fondamento // "Caesar Dressing"
So sometimes it's just one fontname, sometimes it has a weight indicated by a colon (:) and sometimes there are more fontnames divided by a pipe (|).
I have tried /family=(\S*)[:|]/ but it only matches the strings with :or |. I could do it like this, but it's not a nice solution:
var fontUrl = "fonts.googleapis.com/css?family=Caesar+Dressing|Raleway:300,400|Fondamento";
var fontName = /family=(\S*)/.exec(fontUrl)[1].replace(/\+/, " ");
if (fontName.indexOf(':') != -1){
fontName = fontName.split(':')[0];
}
if (fontName.indexOf('|') != -1){
fontName = fontName.split('|')[0];
}
console.log(fontName);
Is there a nice regex solution to this?
Instead of matching the character that (might) follow the string you want, match only the string you want except those characters:
/family=([^\s:|]*)/
Alternatively, you'd use a lookahead like this:
/family=(\S*?)(?=$|[:|])/
That should be better:
/family=([^:|]*)/
Of course for the + case, you'll have to replace it afterwards (or before maybe).
You can use (choose the i and m modifier in all case):
family=([a-z]+\+?[a-z]+)
or more simply
family=([a-z+]+)
or to avoid matching the + char:
family=([a-z]+)\+?([a-z]+)?
but it is an easyer way to use the second solution, and to replace the + chars with a space after.
try this:
/family\=(\S+?)[\:\|,]{0,2}\S*/ims
No regex is required in this case, unless you are good with regex's or test them thoroughly then you are likely to make mistakes.
var fontUrls = [];
fontUrls.push("fonts.googleapis.com/css?family=Raleway");
fontUrls.push("fonts.googleapis.com/css?family=Caesar+Dressing");
fontUrls.push("fonts.googleapis.com/css?family=Raleway:300,400");
fontUrls.push("fonts.googleapis.com/css?family=Raleway|Fondamento");
fontUrls.push("fonts.googleapis.com/css?family=Caesar+Dressing|Raleway:300,400|Fondamento");
function getFirstFont(url) {
return url.split("=")[1].split("|")[0].split(":")[0];
}
fontUrls.forEach(function (fontUrl) {
console.log(getFirstFont(fontUrl));
});
on jsfiddle

Javascript regexp replace, multiline

I have some text content (read in from the HTML using jQuery) that looks like either of these examples:
<span>39.98</span><br />USD
or across multiple lines with an additional price, like:
<del>47.14</del>
<span>39.98</span><br />USD
The numbers could be formatted like
1,234.99
1239,99
1 239,99
etc (i.e. not just a normal decimal number). What I want to do is get just whatever value is inside the <span></span>.
This is what I've come up with so far, but I'm having problems with the multiline approach, and also the fact that there's potentially two numbers and I want to ignore the first one. I've tried variations of using ^ and $, and the "m" multiline modifier, but no luck.
var strRegex = new RegExp(".*<span>(.*?)</span>.*", "g");
var strPrice = strContent.replace(strRegex, '$1');
I could use jQuery here if there's a way to target the span tag inside a string (i.e. it's not the DOM we're dealing with at this point).
You could remove all line breaks from the string first and then run your regex:
strContent = strContent.replace(/(\r\n|\n|\r)/gm,"");
var strRegex = new RegExp(".*<span>(.*?)</span>.*", "g");
var strPrice = strContent.replace(strRegex, '$1');
This is pretty easy with jQuery. Simply wrap your HTML string inside a div and use jQuery as usual:
var myHTML = "<span>Span 1 HTML</span><span>Span 2 HTML</span><br />USD";
var $myHTML = $("<div>" + myHTML + "</div>");
$myHTML.find("span").each(function() {
alert($(this).html());
});
Here's a working fiddle.
try using
"[\s\S]*<span>(.*?)</span>[\s\S]*"
instead of
".*<span>(.*?)</span>.*"
EDIT: since you're using a string to define your regex don't forget to esacpe your backslashes, so
[\s\S]
would be
[\\s\\S]
You want this?
var str = "<span>39.98</span><br />USD\n<del>47.14</del>\n\n<span>40.00</span><br />USD";
var regex = /<span>([^<]*?)<\/span>/g;
var matches = str.match(regex);
for (var i = 0; i < matches.length; i++)
{
document.write(matches[i]);
document.write("<br>");
}
Test here: http://jsfiddle.net/9LQGK/
The matches array will contain the matches. But it isn't really clear what you want. What does there's potentially two numbers and I want to ignore the first one means?

Categories

Resources