Need to identify the non-matching character in regex - javascript

We use a regex to test for 'illegal' characters when a user provides a 'Version Name' before we save their content. The accepted characters are: A-Z, 0-9 and blank space. We test this using the following:
var version_name = document.getElementById('txtSaveVersionName').value;
if(version_name.search(/[^A-Za-z0-9\s]/)!= -1){
alert("Warning illegal characters have been removed etc");
version_name.replace(/[^A-Za-z0-9\s]/g,'');
document.getElementById('txtSaveVersionName').value = version_name;
}
This works fine when a user keys their version name. However the version name can also be populated from data taken from a dynamically populated select box - version names loaded in from our system.
When this occurs, the regexp throws out the space in the name. So "My Version" becomes "MyVersion"? This does not occur when the user types "My Version".
So it appears that the value taken from the select box contains a character that looks like a space but is not. I have copied this value from the text box into a unicode converter (http://rishida.net/tools/conversion/) that identifies the characters underlying values and both sets are reported as 0020 (space), yet only ones throws an exception??
Is there a way to identify what the character is that is causing this issue?
Any thoughts greatly appreciated!
Cheers
Mark

Try:
var str= getSelectBoxValue();
var rez = "";
for (var i=0;i<str.length;i++)
rez = rez+str[i]+"["+str.charCodeAt(i)+"]";
alert(rez);
It should give you the unicode values of all the characters in the string the way Javascript sees them. When you copy it from the screen, it could be the browser/OS that converts some weird UTF character into regular "0x20" character for some reason.

I noticed you have a bug in your code:
version_name.replace(/[^A-Za-z0-9\s]/g,'');
Should be
version_name = version_name.replace(/[^A-Za-z0-9\s]/g,'');
As, of course, replace creates a new string, it doesn't modify the existing string.
As you are finding that the replace sometimes works and sometimes doesn't I
would suspect that you have implimented this correctly in one place and incorrectly in another.

Related

 appearing in textarea elements but not in string

I am working on an autocomplete used inside a textarea. I know there is some autocompletes already created, but anyway.
It works well, but if when I'm typing something and I select one or many characters and delete it, a  appears at the end of my string (or where I was inside it). I tried to replace it while retrieving my html with replaceAll, but it doesn't work (There is not this special char when I use an indexOf). The problem is he doesn't find any result because of this char. Let's see an exemple :
This is my array (a little bit cut but we don't really care)
let array = [{
name: "test",
value: "I'm a test value"
},
{
name: "valueorange",
value: "I'm just an orange"
},
// This is how I get the contents of my span (I tried both innerHTML and innerText, same results).
// Same while using .text() or .html() with jquery
let value = jqElement.find("#searching-span")[0].innerHTML.substring(1).toLowerCase();
value = value.replaceAll(" ", " ");
value = value.replaceAll("", "");
I can replace every without any problems. Finally I check with a loop if there is some value with indexOf on each value, and if it returns anything I push it and get it in a new array. But when I have  I have no results.
Any idea how I can resolve it ?
I tried to be clear, I hope my english wasn't so bad, sorry if I made many mistakes !
Character entities and HTML escaped characters like and  appearing in HTML source code are converted by the HTML parser into unicode characters like \u00a0 and \ufeff before being inserted into the DOM.
If replacing them in JavaScript, use their unicode characters, not HTML escape sequences, to match them in DOM strings. For example:
p.textContent = p.textContent.replaceAll("\ufeff", '*'); // zwj
p.textContent = p.textContent.replaceAll("\xa0", '-'); // nbsp
<p id="p">   </p>
Note that zero width joiners are uses a lot in emoji character sequences and arbitrarily removing may break emoji character decoding (although decoding badly formed emoji strings is almost a prerequisite for handling emojis in the wild).
Second note: I am not suggesting this as a means of circumventing badly decoding characters that have been encoded using a Unicode Transform Format. Making sure decoding is performed correctly is always a better option.

Regex in Google Apps Script practical issue. Forms doesn't read regex as it should

I hope its just something i'm not doing right.
I've been using a simple script to create a form out of a spreadsheet. The script seems to be working fine. The output form is going to get some inputs from third parties so i can analyze them in my consulting activity.
Creating the form was not a big deal, the structure is good to go. However, after having the form creator script working, i've started working on its validations, and that's where i'm stuck at.
For text validations, i will need to use specific Regexes. Many of the inputs my clients need to give me are going to be places' and/or people's names, therefore, i should only allow them usign A-Z, single spaces, apostrophes and dashes.
My resulting regexes are:
//Regex allowing a **single name** with the first letter capitalized and the occasional use of "apostrophes" or "dashes".
const reg1stName = /^[A-Z]([a-z\'\-])+/
//Should allow (a single name/surname) like Paul, D'urso, Mac'arthur, Saint-Germaine ecc.
//Regex allowing **composite names and places names** with the first letter capitalized and the occasional use of "apostrophes" or "dashes". It must avoid double spaces, however.
const regNamesPlaces = /^[^\s]([A-Z]|[a-z]|\b[\'\- ])+[^\s]$/
//This should allow (names/surnames/places' names) like Giulius Ceasar, Joanne D'arc, Cosimo de'Medici, Cosimo de Medici, Jean-jacques Rousseau, Firenze, Friuli Venezia-giulia, L'aquila ecc.
Further in the script, these Regexes are called as validation pattern for the forms text items, in accordance with each each case.
//Validation for single names
var val1stName = FormApp.createTextValidation()
.setHelpText("Only the person First Name Here! Use only (A-Z), a single apostrophe (') or a single dash (-).")
.requireTextMatchesPattern(reg1stName)
.build();
//Validation for composite names and places names
var valNamesPlaces = FormApp.createTextValidation()
.setHelpText(("Careful with double spaces, ok? Use only (A-Z), a single apostrophe (') or a single dash (-)."))
.requireTextMatchesPattern(regNamesPlaces)
.build();
Further yet, i have a "for" loop that creates the form based on the spreadsheets fields. Up to this point, things are working just fine.
for(var i=0;i<numberRows;i++){
var questionType = data[i][0];
if (questionType==''){
continue;
}
else if(questionType=='TEXTNamesPlaces'){
form.addTextItem()
.setTitle(data[i][1])
.setHelpText(data[i][2])
.setValidation(valNamesPlaces)
.setRequired(false);
}
else if(questionType=='TEXT1stName'){
form.addTextItem()
.setTitle(data[i][1])
.setHelpText(data[i][2])
.setValidation(val1stName)
.setRequired(false);
}
The problem is when i run the script and test the resulting form.
Both validations types get imported just fine (as can be seen in the form's edit mode), but when testing it in preview mode i get an error, as if the Regex wasn't matching (sry the error message is in portuguese, i forgot to translate them as i did with the code up there):
A screenshot of the form in edit mode
A screeshot of the form in preview mode
However, if i manually remove the bars out of this regex "//" it starts working!
A screenshot of the form in edit mode, Regex without bars
A screenshot of the form in preview mode, Regex without bars
What am i doing wrong? I'm no professional dev but in my understanding, it makes no sense to write a Regex without bars.
If this is some Gforms pattern of reading regexes, i still need all of this to be read by the Apps script that creates this form after all. If i even try to pass the regex without the bars there, the script will not be able to read it.
const reg1stName = ^[A-Z]([a-z\'])+
const regNamesPlaces = ^[^\s]([A-Z]|[a-z]|\b[\'\- ])+[^\s]$
//Can't even be saved. Returns: SyntaxError: Unexpected token '^' (line 29, file "Code.gs")
Passing manually all the validations is not an option. Can anybody help me?
Thanks so much
This
/^[A-Z]([a-z\'\-])+/
will not work because the parser is trying to match your / as a string literal.
This
^[A-Z]([a-z\'\-])+
also will not work, because if the name is hyphenated, you will only match up to the hyphen. This will match the 'Some-' in 'Some-Name', for example. Also, perhaps you want a name like 'Saint John' to pass also?
I recommend the following :)
^[A-Z][a-z]*[-\.' ]?[A-Z]?[a-z]*
^ anchors to the start of the string
[A-Z] matches exactly 1 capital letter
[a-z]* matches zero or more lowercase letters (this enables you to match a name like D'Urso)
[-\.' ]? matches zero or 1 instances of - (hyphen), . (period), ' (apostrophe) or a single space (the . (period) needs to be escaped with a backslash because . is special to regex)
[A-Z]? matches zero or 1 capital letter (in case there's a second capital in the name, like D'Urso, St John, Saint-Germaine)

Arabic text zero width joiners not working between elements

I am trying to implement a "Smart Search" feature which highlights text matches in a div as a user types a keyword. The highlighting works by using a regular expression to match the keyword in the div and replace it with
<span class="highlight">keyword</span>
The application supports both English and Arabic text. English works just fine, but when highlighting Arabic, the word "breaks" the word connection on the span rather than staying a single continuous word.
I'm trying to fix the issue by using 3 separate Regex expressions and adding zero width joiners appropriately to each case:
Match at the Beginning of a word
var startsWithRegex = new RegExp("((^|\\s)" + keyword + ")", "gi");
var newSpan = "<span class='highlight'>$1‍</span>‍";
Match in the Middle of a word (Note: There can be multiple middleOf matches in a single word)
var middleOfRegex = new RegExp("([^(^|\\s)])(" + keyword + ")([^($|\\s)])", "gi");
var newSpan = "‍$1‍<span class='highlight'>‍$2‍</span>‍$3‍";
Match at the End of a word
var endsWithRegex = new RegExp("(" + keyword + "($|\\s))", "gi");
var newSpan = "‍<span class='highlight'>‍$1</span>";
Both startsWithRegex and endsWithRegex appear to work as expected, but middleOfRegex is not. For example:
للأبد
transforms into:
ل‍‍ل‍‍أ‍بد
when the keyword is:
ل
I've tried other various combinations of ‍ but nothing seems to be working. Is this a limitation of webkit? Is there another implementation I can use to get my desired result?
Thanks!
A few extra notes:
This is only happening for Webkit based browsers (Chrome specifically in my case) and we cannot use an alternative. I believe this bug is the root cause of the issue:
https://bugs.webkit.org/show_bug.cgi?id=6148
This question is an extension on these two stackoverflow questions:
Inserting HTML tag in the middle of Arabic word breaks word connection (cursive)
Partially colored Arabic word in HTML
Arabic language is a special case because the letter has different forms depending on its position in the word, I remember I solved such a problem using its Unicode, each letter’s form has different Unicode.
You can find the Unicode table here
https://en.wikipedia.org/wiki/Arabic_script_in_Unicode
You can get the Unicode value using
var code = $(selector).text().charCodeAt(0);
I suggest not to separate this ligature, but to extend the <span> tag to enclose the entire lam+alif structure for highlighting.
According to http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf#G25237, ZWJ works as ZWJ+ZWNJ+ZWJ between ل(lam) and ا(alif). It should be rendered as a connected lam followed by a connected alif (ل‍‌‍ا), not like the required ligature (لا).
Seems to me most browsers/fonts adhere to this requirement.
My answer applies to other ligatures as well, if you use them in your application (non-required ones, e.g.: mim + mim).

Strange unselectable character in JavaScript output

I have a JavaScript object which I'm using to populate a form element (with jQuery):
var attribute = { name : '̈́Type' };
$('#container').html('<input type="text" value="'+attribute.name+'/>');
But the output shows a strange character which is not selectable:
This character is also present when trying:
alert(attribute.name); //in Firefox
console.log(attribute.name); //in Chrome
My JavaScript file has UTF8 encoding.
What is this character and how do I make it go away?
The strange character is a unicode diacritic (\u0344) and is applied to the first single quote ' on the { name : '̈́Type' }declaration.
Just delete the offending single quote and retype it.
You have got something akin to this:
var strange_character = ' \u0344';
var attribute = { name : 'Type' };
$('#container').html('<input type="text" value="'+strange_character + attribute.name+'"/>');
This strange character has code U+0344 is called COMBINING GREEK DIALYTIKA TONOS.
Description:
U+0344 was added to Unicode in version 1.1. It belongs to the block
Combining Diacritical Marks in the Basic Multilingual Plane.
This character is a Nonspacing Mark and inherits its script property
from the preceding character.
The glyph is a Canonical composition of the glyphs U+0301 and U+0308. It has a
Ambiguous East Asian Width. In bidirectional context it acts as
Nonspacing Mark and is not mirrored. In text U+0344 behaves as
Combining Mark regarding line breaks. It has type Extend for sentence
and Extend for word breaks. The Grapheme Cluster Break is Extend.
REF: http://codepoints.net/U+0344
If you will zoom on your question really close, you will see its already there. You just need to retype it.

Why is regex failing when input contains a newline?

I've inherited this javascript regex from another developer and now, even though nothing has changed, it doesn't seem to match the required text. Here is the regex:
/^.*(already (active|exists|registered)).*$/i
I need it to match any text that looks like
stuff stuff already exists more stuff etc
It looks perfectly fine to me, it only looks for those 2 words together and should in theory ignore the rest of the string. In my script I check the text like this
var cardUsedRE = /^.*(already (active|exists|registered)).*$/i;
if(cardUsedRE.test(responseText)){
mdiv.className = 'userError';
mdiv.innerHTML = 'The card # has already been registered';
document.getElementById('cardErrMsg').innerHTML = arrowGif;
}
I've stepped through this in FireBug and I've seen it fail to test this string:
> Error: <detail>Card number already registered for CLP.\n</detail>
Am I missing something? What is the likely issue with this?
Here's a simplified but functionally-equivalent regex that should handle newlines:
/(already\s+(active|exists|registered))/i
Not sure why you'd ever want to lead with ^.* or end with .*$ unless your goal is specifically to prevent newlines. Otherwise it's just superfluous.
EDIT: I replaced the space with \s+ so it will be more liberal with how it handles whitespace (e.g. one space, two spaces, a tab, etc. should all match).
tldr; Use the m modifier to make . match newlines. See the MDC regular expression documentation.
Failing (note the "\n" in the string literal):
var str = "Error: <detail>Card number already registered for CLP.\n</detail>"
str.match(/^.*(already (active|exists|registered)).*$/i)
Working (note m flag for "multi-line" behavior of .):
var str = "Error: <detail>Card number already registered for CLP.\n</detail>"
str.match(/^.*(already (active|exists|registered)).*$/mi)
I would use a simpler form, however: (Adjust for definition of "space".)
var str = "Error: <detail>Card number already registered for CLP.\n</detail>";
str.match(/(?:already\s+(?:active|exists|registered))/i)
Happy coding.

Categories

Resources