JS - Iterate through text snippet character by character, skipping certain characters - javascript

I am using JS to loop through a given text, refered to in below pseudo as "input_a".
Based on the contents of another, and seperate text "input_b" I would like to manipulate the individual characters of text "input_a" by assigning them with a boolean value.
So far I've approached it the following way:
for (i=0; i < input_a.length; i++) {
if (input_b[i] == 0){
//do something to output
}
}
Now the issue with this is that the above loop, being that it uses .length also includes all blank/special characters whereas I would only like to include A-Z - ommitting all special characters (which in this case would not be applicable to recieve the boolean assigned to them).
How could I approach this efficiently and elegantly - and hopefully without reinventing the wheel or creating my own alphabet array?
Edit 1: Forgot to mention that the position of the special characters needs to be retained when the manipulated input_a is finally delivered as output. This makes an initial removal of all special characters from input_a a non viable option.

It sounds like you want input_a to retain only alphabetical characters - you can transform it easily with a regular expression. Match all non-alphabetical characters, and replace them with the empty string:
const input_a = 'foo_%%bar*&#baz';
const sanitizedInputA = input_a.replace(/[^a-z]+/gi, '');
console.log(sanitizedInputA);
// iterate through sanitizedInputA
If you want the do the same to happen with input_b before processing it, just use the same .replace that was used on a.
If you need the respective indicies to stay the same, then you can do a similar regular expression test while iterating - if the character being iterated over isn't alphabetical, just continue:
const input_a = 'foo_%%bar*&#baz';
for (let i = 0; i < input_a.length; i++) {
if (!/[a-z]/i.test(input_a[i])) continue;
console.log(input_a[i]);
}

You can check if the character at the current position is a letter, something like:
for (i=0; i < input_a.length; i++) {
if(/[a-z]/i.test(input_a[i])){
if (input_b[i] == 0){
//do something to output
}
}
}
the /[a-z]/i regex matches both upper and lower case letters.

Edited as per Edit 1 of PO
If you would like to do this without RegEx you can use this function:
function isSpecial(char) {
if(char.toLowerCase() != char.toUpperCase() || char.toLowerCase.trim() === ''){
return true;
}
return false;
}
You can then call this function for each character as it comes into the loop.

Related

JavaScript not removing text when a uppercase letter involved

So I have a text box on my website and I have coded this to prevent certain words from being used.
window.onload = function() {
var banned = ['MMM', 'XXX'];
document.getElementById('input_1_17').addEventListener('keyup', function(e) {
var text = document.getElementById('input_1_17').value;
for (var x = 0; x < banned.length; x++) {
if (text.toLowerCase().search(banned[x]) !== -1) {
alert(banned[x] + ' is not allowed!');
}
var regExp = new RegExp(banned[x]);
text = text.replace(regExp, '');
}
document.getElementById('input_1_17').value = text;
}, false);
}
The code works perfectly and removes the text from the text box when all the letters typed are lowercase. The problem is when the text contained an uppercase letter it will give the error but the word will not be removed from the text box.
The RegExp is a good direction, just you need some flags (to make it case-insensitive, and global - so replace all occurrences):
var text="Under the xxx\nUnder the XXx\nDarling it's MMM\nDown where it's mmM\nTake it from me";
console.log("Obscene:",text);
var banned=["XXX","MMM"];
banned.forEach(nastiness=>{
text=text.replace(new RegExp(nastiness,"gi"),"");
});
console.log("Okay:",text);
Normally you should use .toLowerCase() with both sides when comparing the strings so they can logically be matched.
But the problem actually comes from the Regex you are using, where you are ignoring case sensitivity, you just need to add the i flag to it:
var regExp = new RegExp(banned[x], 'gi');
text = text.replace(regExp, '');
Note:
Note also that using an alert() in a loop is not recommended, you can change your logic to alert all the matched items in only one alert().
You seem to have been expecting something unreasonable. Lowercase strings will never match strings containing uppercase letters.
Either convert both for comparison or use lowercase banned strings. The former would be more reliable, taking future human error out of the process.
What you can do is actually convert both variables to either all caps or all lowercase.
if (text.toLowerCase().includes(banned[x].toLowerCase())) {
alert(banned[x] + ' is not allowed!');
}
Not tested but it should work. No need to use search since you don't need the index anyway. using includes is cleaner. includes docs

Split Kannada word into syllabic clusters

We are wondering if there is any method to split a Kannada word to get the syllabic clusters using JavaScript.
For example, I want to split the word ಕನ್ನಡ into the syllabic clusters ["ಕ", "ನ್ನ", "ಡ"]. But when I split it with split, the actual array obtained is ["ಕ", "ನ", "್", "ನ", "ಡ"]
Example Fiddle
I cannot say that this is a complete solution. But works to an extent with some basic understanding of how words are formed:
var k = 'ಕನ್ನಡ';
var parts = k.split('');
arr = [];
for(var i=0; i< parts.length; i++) {
var s = k.charAt(i);
// while the next char is not a swara/vyanjana or previous char was a virama
while((i+1) < k.length && k.charCodeAt(i+1) < 0xC85 || k.charCodeAt(i+1) > 0xCB9 || k.charCodeAt(i) == 0xCCD) {
s += k.charAt(i+1);
i++;
}
arr.push(s);
}
console.log(arr);
As the comments in the code say, we keep appending chars to previous char as long as they are not swara or vyanjana or previous char was a virama. You might have to work with different words to make sure you cover different cases. This particular case doesn't cover the numbers.
For Character codes you can refer to this link:
http://www.unicode.org/charts/PDF/U0C80.pdf
Consider using the "inSC" property associated with Unicode characters--you can get this from a database--which indicates the Indic Syllabic Character. (You might also want to consult the "category", to see if it is "non-spacing mark"). For instance, ""್" has the type "Virama" (see http://graphemica.com/0CCD). To take another example, "ಿ" (KANNADA VOWEL SIGN I) has an InSC of "Vowel_Dependent" (and is also in the "non-spacing mark" category). You could potentially then detect which individual graphemes need to be combined with others, and put back together complete characters, as follows:
const graphemes = [..."ಕನ್ನಡ"];
console.log("graphemes are", graphemes);
const rebuild = [graphemes[0], graphemes.slice(1, 4).join(''), graphemes[4]];
console.log(rebuild);
Even if you can make this work, you'll have more work to do. It's unclear to me how you would detect that the three characters "ನ", ""್", and "ನ" are to be combined, rather than treated as the two characters "ನ್" and "ನ". The problem is that in this case the virama is used to indicate a consonant cluster, so you would need to identify the X-V-X pattern (where V is virama) and treat that as one combined character. There are probably many, many other such special cases.
This might be of interest: https://www.microsoft.com/typography/OpenTypeDev/kannada/intro.htmj. It talks about finding "syllable clusters", in this particular case as a prelude for rendering the characters graphically. You may also want to take a look at http://www.unicode.org/L2/L2003/03068-kannada.pdf.

Match returning 100 instead of actual value

I have an array that I'm cycling through. For each value in the array, I'm analyzing it and then shunting the value off into another array based on which conditions it meets. For the purpose of this question, though, I'm simply trying to count how many periods are in the current array item.
Here's the relevant part of the code I'm trying to use:
for(i = 0; i < (sortarray.length) -1; i++)
{
var count = (sortarray[i].match(/./g)||[]).length;
console.log(count + ' periods found in name' + sortarray[i]);
if (count > 1)
{
alert('Error: One or more filenames contain periods.');
return;
}
else ...
Most values are filenames and would have a single period, whereas folder names would have no periods. Anything with more than 1 period should pop up an alert box. Seems simple enough, but for some reason my variable keeps returning 100 instead of 1, and therefore the box always pops up.
Is there a better way to count the dots in each array value?
The problem is with your regexp. The dot (.) means any char. Furthermore (since you are using g option) your regex will match the whole string.
That's why you're getting 100: length is being called on your full string.
Thus you should escape dot so that it will really look for dots instead of any char.
sortarray[i].match(/\./g)
Instead of that logic you can just compare the first index of . and last index of ., if they are not equal that means the filename has more then one .
for(i = 0; i < (sortarray.length) -1; i++)
{
if (sortarray[i].indexOf(".")!=sortarray[i].lastIndexOf("."))
{
alert('Error: One or more filenames contain periods.');
return;
}
}

change regex to match some words instead of all words containing PRP

This regex matches all characters between whitespace if the word contains PRP.
How can I get it to match all words, or characters in-between whitepsace, if they contain PRP, but not if they contain me in any case.
So match all words containing PRP, but not containing ME or me.
Here is the regex to match words containing PRP: \S*PRP\S*
You can use negative lookahead for this:
(?:^|\s)((?!\S*?(?:ME|me))\S*?PRP\S*)
Working Demo
PS: Use group #1 for your matched word.
Code:
var re = /(?:^|\s)((?!\S*?(?:ME|me))\S*?PRP\S*)/;
var s = 'word abcPRP def';
var m = s.match(re);
if (m) console.log(m[1]); //=> abcPRP
Instead of using complicated regular expressions which would be confusing for almost anyone who's reading it, why don't you break up your code into two sections, separating the words into an array and filtering out the results with stuff you don't want?
function prpnotme(w) {
var r = w.match(/\S+/g);
if(r == null)
return [];
var i=0;
while(i<r.length) {
if(!r[i].contains('PRP') || r[i].toLowerCase().contains('me'))
r.splice(i,1);
else
i++;
}
return r;
}
console.log(prpnotme('whattttttt ok')); // []
console.log(prpnotme('MELOLPRP PRPRP PRPthemeok PRPmhm')); // ['PRPRP', 'PRPmhm']
For a very good reason why this is important, imagine if you ever wanted to add more logic. You're much more likely to make a mistake when modifying complicated regex to make it even more complicated, and this way it's done with simple logic that make perfect sense when reading each predicate, no matter how much you add on.

Regex for validation

Can anyone tell me how to write a regex for the following scenario. The input should only be numbers or - (hyphen) or , (comma). The input can be given as any of the following
23
23,26
1-23
1-23,24
24,25-56,58-40,45
Also when numbers is given in a range, the second number should be greater than the first one. 23-1 should not be allowed. If a number is already entered it should not be allowed again. Like 1-23,23 should not be allowed
I'm not going to quibble with "I think" or "maybe" -- you can not do this with a Regex.
Matching against a regex can validate that the form of the input is correct, and can also be used to extract pieces of the input, but it can not do value comparisons, or duplicate elimination (except in limited well defined circumstances), or range checking.
What you have as input I interpret as a comma-separated list of values or ranges of values; in BNFish notation:
value :: number
range :: value '-' value
term :: value | range
list :: term [','term]*
A regex can be built that will match this to verify correct structure, but you'll have to do other validation for the value comparisons and to prevent the duplicate numbers.
The most straigtforward regex I can think of (on short notice) is this
([0-9]+|[0-9]+-[0-9]+)(, *([0-9]+|[0-9]+-[0-9]+))*
You have digits or digits-digits, optionally followed by comma[optional space](digits or digits-digits) - repeated zero or more times.
I tested this regex at http://www.fileformat.info/tool/regex.htm with the input 3,4-12,6,2,90-221
Of course you can replace the [0-9] with [\d] for regex dialects that allow it.
var str = "24,25-56,24, 58- 40,a 45",
trimmed = str.replace(/\s+/g, '')
//test for correct characters
if (trimmed.match(/[^,\-\d]/)) alert("Please use only digits and hyphens, separated by commas.")
//test for duplicates
var split = trimmed.split(/-|,/)
split.sort()
for (var i = 0; i < split.length - 1; i++) {
if (split[i + 1] == split[i]) alert("Please avoid duplicate numbers.")
}
//test for ascending range
split = trimmed.split(/,/)
for (var i in split) {
if (split[i].match("-") && eval(split[i]) < 0) alert("Please use an ascending range.")
}
I don't think you will be able to do this with a RegEx. Especially not the part about set logic - number already used, valid sequential range.
My suggestion would be to have a Regex verify the format, at the least -, number, comma. Then use the split method on commas and loop over the input to verify the set. Something like:
var number_ranges = numbers.split(',');
for (var i = 0; i < number_ranges.length; ++i) {
// verify number ranges in set
}
That logic is not exactly trivial.
I think with regular expressions it is better to take the time to learn them than to throw someone elses script into yours without knowing exactly what it is doing. You have excellent resources out there to help you.
Try these sites:
regular-expressions.info
w3schools.com
evolt.org
Those are the first three results form a google search. All are good resources. Good luck. Remember to double check what your regex is actually matching by outputing it to the screen, don't assume you know (that has bitten me more than one time).

Categories

Resources