Using javascript regex to replace a character entity - javascript

I have a set of notes which aren't displaying properly. Looking at the source, there is
where the line breaks should be. I want to replace these instances with <br>. I've used a bookmarklet which will work for a plaintext string such as ABC but it won't work for the given string.
For example, this one works fine:
javascript:(function(){var re = new RegExp("ABC", "g"); document.getElementById('my_span').innerHTML=document.getElementById('my_span').innerHTML.replace(re, "<br>" );})();
Whereas this doesn't:
javascript:(function(){var re = new RegExp("
", "g"); document.getElementById('my_span').innerHTML=document.getElementById('my_span').innerHTML.replace(re, "<br>" );})();
Oddly the latter will take the first instance of
and just remove it without replacing it.
The page I'm working with does not load jQuery and I'm working in a corporate environment so I'm loathe to call something external.
Is this to do with with innerHTML not receiving the
as I expect it to? Thanks very much for your help.

It's exactly like you said: innerHTML is not receiving those characters as you expect it to. Inserting character entities into HTML causes them to be interpreted, and innerHTML returns the interpreted version, not the raw version. If you have:
<p id="test">Cookies & Cream</p>
and you use:
document.getElementById('test').innerHTML;
you don't get:
Cookies & Cream
you get:
Cookies & Cream
Similarly, when you're using innerHTML in your bookmarklet, you're not getting
: you're getting a carriage return and a line feed.
To match those characters with regex, use \r\n. \r will match the carriage return (
). While \n will match the line feed (&xa;).
var regex = new RegExp('\r\n', 'g');

Related

Why do I need to replace \n with \n?

I have a line of data like this:
1•#00DDDD•deeppink•1•100•true•25•100•Random\nTopics•1,2,3,0•false
in a text file.
Specifically, for my "problem", I am using Random\nTopics as a piece of text data, and I then search for '\n', and split the message up into two lines based on the placement of '\n'.
It is stored in blockObj.msg, and I search for it using blockObj.msg.split('\n'), but I kept getting an array of 1 (no splits). I thought I was doing something fundamentally wrong and spent over an hour troubleshooting, until on a whim, I tried
blockObj.msg = blockObj.msg.replace(/\\n/g, "\n")
and that seemed to solve the problem. Any ideas as to why this is needed? My solution works, but I am clueless as to why, and would like to understand better so I don't need to spend so long searching for an answer as bizarre as this.
I have a similar error when reading "text" from an input text field. If I type a '\n' in the box, the split will not find it, but using a replace works (the replace seems pointless, but apparently isn't...)
obj.msg = document.getElementById('textTextField').value.replace(/\\n/g, "\n")
Sorry if this is jumbled, long time user of reading for solutions, first time posting a question. Thank you for your time and patience!
P.S. If possible... is there a way to do the opposite? Replace a real "\n" with a fake "\n"? (I would like to have my dynamically generated data file to have a "\n" instead of a new line)
It is stored in blockObj.msg, and I search for it using blockObj.msg.split('\n'),
In a JavaScript string literal, \n is an escape sequence representing a new line, so you are splitting the data on new lines.
The data you have doesn't have new lines in it though. It has slash characters followed by n characters. They are data, not escape sequences.
Your call to replace (blockObj.msg = blockObj.msg.replace(/\\n/g, "\n")) works around this by replacing the slashes and ns with new lines.
That's an overcomplicated approach though. You can match the characters you have directly. blockObj.msg.split('\\n')
in your text file
1•#00DDDD•deeppink•1•100•true•25•100•Random\nTopics•1,2,3,0•false
means that there are characters which are \ and n thats how they are stored, but to insert a new line character by replacement, you are then searching for the \ and the n character pair.
obj.msg = document.getElementById('textTextField').value.replace(/\\n/g, "\n")
when you do the replace(/\\n/g, "\n")
you are searching for \\n this is the escaped version of the string, meaing that the replace must find all strings that are \n but to search for that you need to escape it first into \\n
EDIT
/\\n/g is the regex string..... \n is the value... so /\REGEXSTUFFHERE/g the last / is followed by regex flags, so g in /g would be global search
regex resources
test regex online

Arabic text zero width joiners not working between elements

I am trying to implement a "Smart Search" feature which highlights text matches in a div as a user types a keyword. The highlighting works by using a regular expression to match the keyword in the div and replace it with
<span class="highlight">keyword</span>
The application supports both English and Arabic text. English works just fine, but when highlighting Arabic, the word "breaks" the word connection on the span rather than staying a single continuous word.
I'm trying to fix the issue by using 3 separate Regex expressions and adding zero width joiners appropriately to each case:
Match at the Beginning of a word
var startsWithRegex = new RegExp("((^|\\s)" + keyword + ")", "gi");
var newSpan = "<span class='highlight'>$1‍</span>‍";
Match in the Middle of a word (Note: There can be multiple middleOf matches in a single word)
var middleOfRegex = new RegExp("([^(^|\\s)])(" + keyword + ")([^($|\\s)])", "gi");
var newSpan = "‍$1‍<span class='highlight'>‍$2‍</span>‍$3‍";
Match at the End of a word
var endsWithRegex = new RegExp("(" + keyword + "($|\\s))", "gi");
var newSpan = "‍<span class='highlight'>‍$1</span>";
Both startsWithRegex and endsWithRegex appear to work as expected, but middleOfRegex is not. For example:
للأبد
transforms into:
ل‍‍ل‍‍أ‍بد
when the keyword is:
ل
I've tried other various combinations of ‍ but nothing seems to be working. Is this a limitation of webkit? Is there another implementation I can use to get my desired result?
Thanks!
A few extra notes:
This is only happening for Webkit based browsers (Chrome specifically in my case) and we cannot use an alternative. I believe this bug is the root cause of the issue:
https://bugs.webkit.org/show_bug.cgi?id=6148
This question is an extension on these two stackoverflow questions:
Inserting HTML tag in the middle of Arabic word breaks word connection (cursive)
Partially colored Arabic word in HTML
Arabic language is a special case because the letter has different forms depending on its position in the word, I remember I solved such a problem using its Unicode, each letter’s form has different Unicode.
You can find the Unicode table here
https://en.wikipedia.org/wiki/Arabic_script_in_Unicode
You can get the Unicode value using
var code = $(selector).text().charCodeAt(0);
I suggest not to separate this ligature, but to extend the <span> tag to enclose the entire lam+alif structure for highlighting.
According to http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf#G25237, ZWJ works as ZWJ+ZWNJ+ZWJ between ل(lam) and ا(alif). It should be rendered as a connected lam followed by a connected alif (ل‍‌‍ا), not like the required ligature (لا).
Seems to me most browsers/fonts adhere to this requirement.
My answer applies to other ligatures as well, if you use them in your application (non-required ones, e.g.: mim + mim).

dynamically creating a regex with forward slashes in javascript

Im trying to match lines that have this format. Writing a static regEx is fine but I need to do this using 2 variables to create the regex dynamically.
I cant seem to get how to escape the forward brackets properly ive tried escaping them, not escaping them and even double escaping them (just for the heck of it) but fireBug shows the actual regEx being created the same no mater how I do itand it doesnt match my input.
Lines to match look like this:
9.SSRDOCSWSHK1/////23NOV96/M//YEUNG/WINSTON/JEREMY-5.1
What Ive tried:
var regString ='\\d{1,2}.SSRDOCS[0-9A-Z]{2}HK1[/]{5}\\d\\d[A-Z]{3}\\d\\d/[MF]//'+curGstNme+'([/A-Z]+)?-'+pax.slice(0,1)+'\.'
var namedGdocRegEx = new RegExp(regString,"g");
FireBug gives RegExp /\d{1,2}.SSRDOCS[0-9A-Z]{2}HK1[\/]{5}\d\d[A-Z]{3}\d\d\/[MF]\/\/CASTANEDA\/HAZEL([\/A-Z]+)?-1./g
---------------------------
var regString ='\\d{1,2}.SSRDOCS[0-9A-Z]{2}HK1[\/]{5}\\d\\d[A-Z]{3}\\d\\d\/[MF]\/\/'+curGstNme+'([\/A-Z]+)?-'+pax.slice(0,1)+'\.'
var namedGdocRegEx = new RegExp(regString,"g");
FireBug gives RegExp /\d{1,2}.SSRDOCS[0-9A-Z]{2}HK1[\/]{5}\d\d[A-Z]{3}\d\d\/[MF]\/\/CASTANEDA\/HAZEL([\/A-Z]+)?-1./g
---------------------------
var regString ='\\d{1,2}.SSRDOCS[0-9A-Z]{2}HK1[\\/]{5}\\d\\d[A-Z]{3}\\d\\d\\/[MF]\\/\\/'+curGstNme+'([\\/A-Z]+)?-'+pax.slice(0,1)+'\.'
var namedGdocRegEx = new RegExp(regString,"g");
FireBug gives RegExp /\d{1,2}.SSRDOCS[0-9A-Z]{2}HK1[\/]{5}\d\d[A-Z]{3}\d\d\/[MF]\/\/CASTANEDA\/HAZEL([\/A-Z]+)?-1./g
In regex you need to escape the DOTs since DOT will mean any character.
Use this regex:
regString ='\\d{1,2}\\.SSRDOCS[0-9A-Z]{2}HK1/{5}\\d\\d[A-Z]{3}\\d\\d/[MF]//'+ curGstNme + '([/A-Z]+)?-' + pax.slice(0,1) + '\\.';
Actually the problem was elsewhere. Apparently fireBug just shows the created regex (in the "watch" panel) with the escaping sashes visible. This made me think the regex was not being created correctly.
Your regex could be reduced like this:
var regString ='\\d{1,2}\\.SSRDOCS[0-9A-Z]{2}HK1/{5}\\d{2}[A-Z]{3}\\d{2}/[MF]//'+curGstNme+'([/A-Z]+)?-'+pax.slice(0,1)+'\.'
What did I change ?
\\d\\d can be replaced efficiently with \d{2}.
. replaced with \\. for matching exactly dot.
[/]{5} replaced by /{5}
This visually gives you:

Regex appears to ignore multiple piped characters

Apologies for the awkward question title, I have the following JavaScript:
var wordRe = new RegExp('\\b(?:(?![<^>"])fox|hello(?![<\/">]))\\b', 'g'); // Words regex
console.log('<span>hello</span> <hello>fox</hello> fox link hello my name is fox'.replace(wordRe, 'foo'));
What I'm trying to do is replace any word that isn't nested in a HTML tag, or part of a HTML tag itself. I.e I want to only match "plain" text. The expression seems to be ignoring the rule for the first piped match "fox", and replacing it when it shouldn't be.
Can anyone point out why this is? I think I might have organised the expression incorrectly (at least the negative lookahead).
Here is the JSFiddle.
I'd also like to add that I am aware of the implications of using regex with HTML :)
For your regex work, you want lookbehind. However, as of this writing, this feature is not supported in Javascript.
Here is a workaround:
Instead of matching what we want, we will match what we don't want and remove it from our input string. Later, we can perform the replace on the cleaned input string.
var nonWordRe = new RegExp('<([^>]+).*?>[^<]+?</\\1>', 'g');
var test = '<span>hello</span> <hello>fox</hello> fox link hello my name is fox';
var cleanedTest = test.replace(nonWordRe, '');
var final = cleanedTest.replace(/fox|hello/, 'foo'); // once trimmed final=='foo my name is foo'
NOTA:
I have build this workaround based on your sample. But here are some points that may need to be explored if you face them:
you may need to remove self closing tags (<([^>]+).*?/\>) from the test string
you may need to trim the final string (final)
you may need a descent html parser if tags can contain other tags as HTML allow this.
Javascript doesn't, again as of this writing, recursive patterns.
Demo
http://jsfiddle.net/yXd82/2/

How to use a javascript Regex to globally find and replace the beginning of a string from a textarea

I am trying to replace any non encoded ampersands in a string in JavaScript and was wondering if this was possible. I have the regex build to detect this in the string, but when I do a replace, I will lose the parameter name.
Current input:
http://www.somesite.com/id/2343?paramA=1&paramB=asdf
From a textarea
<textarea id='test-box'>http://www.somesite.com/id/2343?paramA=1&paramB=asdf</textarea>
var str = $('#test-box').val();;
var regex = /&[a-z]+=/gi;
str = str.replace(regex, [REPLACE &'s WITH &'s]);
console.log(str);
Desired output:
http://www.somesite.com/id/2343?paramA=1&paramB=asdf
How can I then use JavaScript to keep the name of the parameter and simply replace the '&' with '&'?
Try this regex: /&(?=[a-z]+=)/ and this replacement: &
This uses a lookahead assertion rather than eating up the parameter name.
If you have a URL which might be partially encoded in HTML, and you're trying to make a best effort at producing XHTML validating textarea content, then you can use the list of HTML character references to identify ampersands which are not part of an HTML character reference:
str.replace(/&(?!#(?:[0-9]|[xX][0-9A-Fa-f])|lt;|gt;|amp|...)/g, '&')
where ... is replaced with the set of entities from that list that you care to recognize.
Note that most of those character references end in semicolon, so are not allowed to be followed immediately by an equals sign, so are not ambiguous with URL parameters. Only certain entities can appear without a semicolon for backwards compatibility.
If you don't care about validating, then you can just let the browser take care of it by ensuring that your URL doesn't contain the substring </textarea by doing something like
str.replace(/</g, '%3c')
Apart from lookahead assert, you can also use a backreference:
var regex = /&([a-z]+)=/gi;
str = str.replace(/&([a-z]+)=/gi,'&$1');
When $n appears in the replace string, it will be replaced by the n'th parenthesized pattern in the regexp.
Who needs regex when you've got jQuery html(). Especially since you've got a jquery tag on your question :D
What this does is leverage the browser's innerHTML property. see api
Fiddle
var str = 'http://www.somesite.com/id/2343?paramA=1&paramB=asd';
$('#test-box').text(str);
$('#html-box').text($('#test-box').html());

Categories

Resources