appearing in textarea elements but not in string - javascript

I am working on an autocomplete used inside a textarea. I know there is some autocompletes already created, but anyway.
It works well, but if when I'm typing something and I select one or many characters and delete it, a  appears at the end of my string (or where I was inside it). I tried to replace it while retrieving my html with replaceAll, but it doesn't work (There is not this special char when I use an indexOf). The problem is he doesn't find any result because of this char. Let's see an exemple :
This is my array (a little bit cut but we don't really care)
let array = [{
name: "test",
value: "I'm a test value"
},
{
name: "valueorange",
value: "I'm just an orange"
},
// This is how I get the contents of my span (I tried both innerHTML and innerText, same results).
// Same while using .text() or .html() with jquery
let value = jqElement.find("#searching-span")[0].innerHTML.substring(1).toLowerCase();
value = value.replaceAll(" ", " ");
value = value.replaceAll("", "");
I can replace every without any problems. Finally I check with a loop if there is some value with indexOf on each value, and if it returns anything I push it and get it in a new array. But when I have  I have no results.
Any idea how I can resolve it ?
I tried to be clear, I hope my english wasn't so bad, sorry if I made many mistakes !

Character entities and HTML escaped characters like and  appearing in HTML source code are converted by the HTML parser into unicode characters like \u00a0 and \ufeff before being inserted into the DOM.
If replacing them in JavaScript, use their unicode characters, not HTML escape sequences, to match them in DOM strings. For example:
p.textContent = p.textContent.replaceAll("\ufeff", '*'); // zwj
p.textContent = p.textContent.replaceAll("\xa0", '-'); // nbsp
<p id="p">   </p>
Note that zero width joiners are uses a lot in emoji character sequences and arbitrarily removing may break emoji character decoding (although decoding badly formed emoji strings is almost a prerequisite for handling emojis in the wild).
Second note: I am not suggesting this as a means of circumventing badly decoding characters that have been encoded using a Unicode Transform Format. Making sure decoding is performed correctly is always a better option.

Related

Match between simple delimiters, but not delimiters themselves

I was looking at JSON data that was just in a text file. I don't want to do anything aside from just use regex to get the values in between quotes. I'm just using this as a way to help practice regex and got to this point that seems like it should be simple, but it turns out it's not (at least to me and a few other people at the office). I've matched complicated urls with ease in regex so I'm not completely new to regex. This just seems like a weird case for me.
I've tried:
/(?:")(.*?)(?:")/
/"(.*?)"/
and several others but these got me the closest.
Basically we can forget that it's JSON and just say I want to match the words value and stuff out of "value" and "stuff". Everything I try includes the quotes, so I'd have to clean the strings afterwards of the delimiters or else the string is literally "value" with the quotes.
Any help would be much appreciated, whether this is simple or complicated, I'd love to know! Thanks
Update: Alright so I think I'll go with (?<=")(.*?)(?=") and read things by line without the global setting on so I just get the first match on each line. In my code I was just plopping in a huge string into a var in the code instead of actually opening a file with ajax/filereader or having a form setup to input data. I think I'll mark this as solved, much appreciated!
You have two choices to solve this problem:
Use capturing groups
You can match the delimiters and use capturing groups to get the text within. In this case your two regexes will work, but you need to use access capturing group 1 to get the results (demo). See How do you access the matched groups in a JavaScript regular expression? for how to do that.
Use zero-width assertions
You can use zero-width assertions to match only the text within, require delimiters around them without actually matching them (demo):
(?<=")(.*?)(?=")
but now since I'm not consuming the quotes it'll find instances between each quote, not just between pairs of quotes: e.g., a"b"c" would find b and c.
As for getting just the first match, I think that'll happen by default in JavaScript. You'd have to ask for repeated matching before you see the subsequent ones. So if you process your file one line at a time, you should get what you want.
get the values in between quotes
One thing to keep in mind is that valid JSON accepts escaped quotes inside the quoted values. Therefore, the RegEx should take this into account when capturing the groups which is done with the “unrolling-the-loop” pattern.
var pattern = /"[^"\\]*(?:\\.[^"\\]*)*"/g;
var data = {
"value": "This is \"stuff\".",
"empty": "",
"null": null,
"number": 50
};
var dataString = JSON.stringify(data);
console.log(dataString);
var matched = dataString.match(pattern);
matched.map(item => console.log(JSON.parse(item)));

Match attribute value of XML string in JS

I've researched stackoverflow and find similar results but it is not really what I wanted.
Given an xml string: "<a b=\"c\"></a>" in javascript context, I want to create a regex that will capture the attribute value including the quotation marks.
NOTE: this is similar if you're using single quotation marks.
Currently I have a regular expression tailored to the XML specification:
[_A-Za-z][\w\.\-]*(?:=\"[^\"]*\")?
[_A-Za-z][\w\.\-]* //This will match the attribute name.
(?:=\"[^\"]*\")? //This will match the attribute value.
\"[^\"]*\" //This part concerns me.
My question now is, what if the xml string looks like this:
<shout statement="Hi! \"Richeve\"."></shout>
I know this is a dumb question to ask but I just want to capture rare cases that this scenario might happen (I know the coder can use single quotes on this scenario) but there are cases that we don't know the current value of the attribute given that the attribute value changes dynamically at runtime.
So to make this clearer, the result of that using the correct regex should be:
"Hi! \"Richeve\"."
I hope my question is clear. Thanks for all the help!
PS: Note that the language context is Javascript and I know it is tempting to use lookbehinds but currently lookbehinds are not supported.
PS: I know it is really hard to parse XML but I have an elegant solution to this :) so I just need this small problem to be solved. So this problem only main focus is capturing quotation marked string tokens containing quotation marks inside the string token.
The standard pattern for content with matching delimiters and embedded escaped delimiters goes like this:
"[^"\\]*(?:\\.[^"\\]*)*"
Ignoring the obvious first and last characters in the pattern, here's how the rest of the pattern works:
[^"\\]*: Consume all characters until a delimiter OR backslash (matching Hi! in your example)
(?:\\.[^"\\]*)* Try to consume a single escaped character \\. followed by a series of non delimiter/backslash characters, repeatedly (matching \"Richeve first and then \". next in your example)
That's it.
You can try to use a more generic delimiter approach using (['"]) and back references, or you can just allow for an alternate pattern with single quotes like so:
("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')
Here's another description of this technique that might also help (see the section called Strings): http://www.regular-expressions.info/examplesprogrammer.html
Description
I'm pretty really sure embedding double quotes inside a double quoted attribute value is not legal. You could use the unicode equivalent of a double quote \x22 inside the value.
However to answer the question, this expression will:
allow escaped quotes inside attribute values
capture the attribute statement 's value
allow attributes to appear in any order inside the tag
will avoid many of the edge cases which will trip up pattern matching inside html text
doesn't use lookbehinds
<shout\b(?=\s)(?=(?:[^>=]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*?\sstatement=(['"])((?:\\['"]|.)*?)\1(?:\s|\/>|>))(?:[^>=]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>.*?<\/shout>
Example
Pretty Rubular
Ugly RegexPlanet set to Javascript
Sample Text
Note the difficult edge case in the first attribute :)
<shout onmouseover=' statement="He said \"I am Inside the onMouseOver\" " ; if ( 6 > a ) { funRotate(statement) } ; ' statement="Hi! \"Richeve\"." title="sometitle">SomeString</shout>
Matches
Group 0 gets the entire tag from open to close
Group 1 gets the quote surrounding the statement attribute value, this is used to match the closing quote correctly
Group 2 gets the statement attribute value which may include escaped quotes like \" but not including the surrounding quotes
[0][0] = <shout onmouseover=' statement="He said \"I am Inside the onMouseOver\" " ; if ( 6 > a ) { funRotate(statement) } ; ' statement="Hi! \"Richeve\"." title="sometitle">SomeString</shout>
[0][1] = "
[0][2] = Hi! \"Richeve\".

Need to identify the non-matching character in regex

We use a regex to test for 'illegal' characters when a user provides a 'Version Name' before we save their content. The accepted characters are: A-Z, 0-9 and blank space. We test this using the following:
var version_name = document.getElementById('txtSaveVersionName').value;
if(version_name.search(/[^A-Za-z0-9\s]/)!= -1){
alert("Warning illegal characters have been removed etc");
version_name.replace(/[^A-Za-z0-9\s]/g,'');
document.getElementById('txtSaveVersionName').value = version_name;
}
This works fine when a user keys their version name. However the version name can also be populated from data taken from a dynamically populated select box - version names loaded in from our system.
When this occurs, the regexp throws out the space in the name. So "My Version" becomes "MyVersion"? This does not occur when the user types "My Version".
So it appears that the value taken from the select box contains a character that looks like a space but is not. I have copied this value from the text box into a unicode converter (http://rishida.net/tools/conversion/) that identifies the characters underlying values and both sets are reported as 0020 (space), yet only ones throws an exception??
Is there a way to identify what the character is that is causing this issue?
Any thoughts greatly appreciated!
Cheers
Mark
Try:
var str= getSelectBoxValue();
var rez = "";
for (var i=0;i<str.length;i++)
rez = rez+str[i]+"["+str.charCodeAt(i)+"]";
alert(rez);
It should give you the unicode values of all the characters in the string the way Javascript sees them. When you copy it from the screen, it could be the browser/OS that converts some weird UTF character into regular "0x20" character for some reason.
I noticed you have a bug in your code:
version_name.replace(/[^A-Za-z0-9\s]/g,'');
Should be
version_name = version_name.replace(/[^A-Za-z0-9\s]/g,'');
As, of course, replace creates a new string, it doesn't modify the existing string.
As you are finding that the replace sometimes works and sometimes doesn't I
would suspect that you have implimented this correctly in one place and incorrectly in another.

Remove a long dash from a string in JavaScript?

I've come across an error in my web app that I'm not sure how to fix.
Text boxes are sending me the long dash as part of their content (you know, the special long dash that MS Word automatically inserts sometimes). However, I can't find a way to replace it; since if I try to copy that character and put it into a JavaScript str.replace statement, it doesn't render right and it breaks the script.
How can I fix this?
The specific character that's killing it is —.
Also, if it helps, I'm passing the value as a GET parameter, and then encoding it in XML and sending it to a server.
This code might help:
text = text.replace(/\u2013|\u2014/g, "-");
It replaces all – (–) and — (—) symbols with simple dashes (-).
DEMO: http://jsfiddle.net/F953H/
That character is call an Em Dash. You can replace it like so:
str.replace('\u2014', '');​​​​​​​​​​
Here is an example Fiddle: http://jsfiddle.net/x67Ph/
The \u2014 is called a unicode escape sequence. These allow to to specify a unicode character by its code. 2014 happens to be the Em Dash.
There are three unicode long-ish dashes you need to worry about: http://en.wikipedia.org/wiki/Dash
You can replace unicode characters directly by using the unicode escape:
'—my string'.replace( /[\u2012\u2013\u2014\u2015]/g, '' )
There may be more characters behaving like this, and you may want to reuse them in html later. A more generic way to to deal with it could be to replace all 'extended characters' with their html encoded equivalent. You could do that Like this:
[yourstring].replace(/[\u0080-\uC350]/g,
function(a) {
return '&#'+a.charCodeAt(0)+';';
}
);
With the ECMAScript 2018 standard, JavaScript RegExp now supports Unicode property (or, category) classes. One of them, \p{Dash}, matches any Unicode character points that are dashes:
/\p{Dash}/gu
In ES5, the equivalent expression is:
/[-\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u2E5D\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]|\uD803\uDEAD/g
See the Unicode Utilities reference.
Here are some JavaScript examples:
const text = "Dashes: \uFF0D\uFE63\u058A\u1400\u1806\u2010-\u2013\uFE32\u2014\uFE58\uFE31\u2015\u2E3A\u2E3B\u2053\u2E17\u2E40\u2E5D\u301C\u30A0\u2E1A\u05BE\u2212\u207B\u208B\u3030𐺭";
const es5_dash_regex = /[-\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u2E5D\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]|\uD803\uDEAD/g;
console.log(text.replace(es5_dash_regex, '-')); // Normalize each dash to ASCII hyphen
// => Dashes: ----------------------------
To match one or more dashes and replace with a single char (or remove in one go):
/\p{Dash}+/gu
/(?:[-\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u2E5D\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]|\uD803\uDEAD)+/g

Is there a javascript function that converts characters to the &code; equivalent?

I have text that was created created with CKeditor, and it seems to be inserting where spaces should be. And it appears to do similar conversions with >, <, &, etc. which is fine, except, when I make a DOMSelection, those codes are removed.
So, this is what is selected:
beforeHatchery (2)
But this is what is actually in the DOM:
beforeHatchery (2)
note that I outputted the selection and the original text stored in the database using variable.inspect, so all the quotes are escaped (they wouldn't be when sent to the browser).
To save everyone the pain of looking for the difference:
From the first: Hatchery</a> (2) (The Selection)
From the second: Hatchery</a> (2) (The original)
These differences are at the very end of the selection.
So... there are three ways, I can see of, to approach this.
1) - Replace all characters commonly replaced with codes with their codes,
and hope for the best.
2) - Javascript may have some uncommon function / a library may exist that
replaces these characters for me (I think this might be the way CKeditor
does its character conversion).
3) - Figure out the way CKeditor converts and do the conversion exactly that way.
I'm using Ruby on Rails, but that shouldn't matter for this problem.
Some other things that I found out that it converts:
1: It seems to only convert spaces to if the space(s) is before or after a tag:
e.g.: "With quick <a href..."
2: It changes apostrophes to the hex value
e.g.: "opponent's"
3: It changes "&" to "&"
4: It changes angle brackets to ">" and "<" appropriately.
Does anyone have any thoughts on this?
To encode html entities in str (your question title asks for this, if I understand correctly):
$('<div/>').text(str).html();
To decode html entities in str:
$('<div/>').html(str).text();
These rely on jQuery, but vanilla alternatives are basically the same but more verbose.
To encode html entities in str:
var el = document.createElement('div');
el.innerText = str;
el.innerHTML;
To decode html entities in str:
var el = document.createElement('div');
el.innerHTML = str;
el.innerText;
Conversion of spaces to is usually done by the browser while editing content.
Conversion of ' to ' can be controled with http://docs.cksource.com/ckeditor_api/symbols/CKEDITOR.config.html#.entities_additional
and 4. are usually needed to avoid breaking code that it's written in design view when loading again that content. You can try to change http://docs.cksource.com/ckeditor_api/symbols/CKEDITOR.config.html#.basicEntities but that usually can lead to problems in the future.

Categories

Resources