Split Kannada word into syllabic clusters - javascript

We are wondering if there is any method to split a Kannada word to get the syllabic clusters using JavaScript.
For example, I want to split the word ಕನ್ನಡ into the syllabic clusters ["ಕ", "ನ್ನ", "ಡ"]. But when I split it with split, the actual array obtained is ["ಕ", "ನ", "್", "ನ", "ಡ"]
Example Fiddle

I cannot say that this is a complete solution. But works to an extent with some basic understanding of how words are formed:
var k = 'ಕನ್ನಡ';
var parts = k.split('');
arr = [];
for(var i=0; i< parts.length; i++) {
var s = k.charAt(i);
// while the next char is not a swara/vyanjana or previous char was a virama
while((i+1) < k.length && k.charCodeAt(i+1) < 0xC85 || k.charCodeAt(i+1) > 0xCB9 || k.charCodeAt(i) == 0xCCD) {
s += k.charAt(i+1);
i++;
}
arr.push(s);
}
console.log(arr);
As the comments in the code say, we keep appending chars to previous char as long as they are not swara or vyanjana or previous char was a virama. You might have to work with different words to make sure you cover different cases. This particular case doesn't cover the numbers.
For Character codes you can refer to this link:
http://www.unicode.org/charts/PDF/U0C80.pdf

Consider using the "inSC" property associated with Unicode characters--you can get this from a database--which indicates the Indic Syllabic Character. (You might also want to consult the "category", to see if it is "non-spacing mark"). For instance, ""್" has the type "Virama" (see http://graphemica.com/0CCD). To take another example, "ಿ" (KANNADA VOWEL SIGN I) has an InSC of "Vowel_Dependent" (and is also in the "non-spacing mark" category). You could potentially then detect which individual graphemes need to be combined with others, and put back together complete characters, as follows:
const graphemes = [..."ಕನ್ನಡ"];
console.log("graphemes are", graphemes);
const rebuild = [graphemes[0], graphemes.slice(1, 4).join(''), graphemes[4]];
console.log(rebuild);
Even if you can make this work, you'll have more work to do. It's unclear to me how you would detect that the three characters "ನ", ""್", and "ನ" are to be combined, rather than treated as the two characters "ನ್" and "ನ". The problem is that in this case the virama is used to indicate a consonant cluster, so you would need to identify the X-V-X pattern (where V is virama) and treat that as one combined character. There are probably many, many other such special cases.
This might be of interest: https://www.microsoft.com/typography/OpenTypeDev/kannada/intro.htmj. It talks about finding "syllable clusters", in this particular case as a prelude for rendering the characters graphically. You may also want to take a look at http://www.unicode.org/L2/L2003/03068-kannada.pdf.

Related

JS - Iterate through text snippet character by character, skipping certain characters

I am using JS to loop through a given text, refered to in below pseudo as "input_a".
Based on the contents of another, and seperate text "input_b" I would like to manipulate the individual characters of text "input_a" by assigning them with a boolean value.
So far I've approached it the following way:
for (i=0; i < input_a.length; i++) {
if (input_b[i] == 0){
//do something to output
}
}
Now the issue with this is that the above loop, being that it uses .length also includes all blank/special characters whereas I would only like to include A-Z - ommitting all special characters (which in this case would not be applicable to recieve the boolean assigned to them).
How could I approach this efficiently and elegantly - and hopefully without reinventing the wheel or creating my own alphabet array?
Edit 1: Forgot to mention that the position of the special characters needs to be retained when the manipulated input_a is finally delivered as output. This makes an initial removal of all special characters from input_a a non viable option.
It sounds like you want input_a to retain only alphabetical characters - you can transform it easily with a regular expression. Match all non-alphabetical characters, and replace them with the empty string:
const input_a = 'foo_%%bar*&#baz';
const sanitizedInputA = input_a.replace(/[^a-z]+/gi, '');
console.log(sanitizedInputA);
// iterate through sanitizedInputA
If you want the do the same to happen with input_b before processing it, just use the same .replace that was used on a.
If you need the respective indicies to stay the same, then you can do a similar regular expression test while iterating - if the character being iterated over isn't alphabetical, just continue:
const input_a = 'foo_%%bar*&#baz';
for (let i = 0; i < input_a.length; i++) {
if (!/[a-z]/i.test(input_a[i])) continue;
console.log(input_a[i]);
}
You can check if the character at the current position is a letter, something like:
for (i=0; i < input_a.length; i++) {
if(/[a-z]/i.test(input_a[i])){
if (input_b[i] == 0){
//do something to output
}
}
}
the /[a-z]/i regex matches both upper and lower case letters.
Edited as per Edit 1 of PO
If you would like to do this without RegEx you can use this function:
function isSpecial(char) {
if(char.toLowerCase() != char.toUpperCase() || char.toLowerCase.trim() === ''){
return true;
}
return false;
}
You can then call this function for each character as it comes into the loop.

Is there an efficient way to test whether a string contains non-overlapping substrings to match the array of regular expressions?

All I want is to test whether a string contains non-overlapping substrings to match the array of regexes in the following way: if a substring matches some item of the array, remove the corresponding regex from the array, and continue. I will need a function func(arg1, arg2) that will take two arguments: the first one is the string itself, and the second one is an array of regular expressions to test.
I've read some explanations (such as Regular Expressions: Is there an AND operator?), but they do not answer this specific question. For example, in Javascript, the following three code snippets will return true:
/(?=ab)(?=abc)(?=abcd)/gi.test("eabzzzabcde");
/(?=.*ab)(?=.*abc)(?=.*abcd)/gi.test("eabzzzabcde");
/(?=.*?ab)(?=.*?abc)(?=.*?abcd)/gi.test("eabzzzabcde");
which is, obviously, not what I want (because "abc" and "abcd" in "eabzzzabcde" are just mixed together in an overlapping way). So, func("eabzzzabcde", [/ab/gi, /abc/gi, /abcd/gi]) should return false.
But, func("Fhh, fabcw wxabcdy yz... zab.", [/ab/gi, /abc/gi, /abcd/gi]) should return true because none of "ab", "abc" and "abcd" substrings overlap each other. The logic is the following. We have an array of regexes: [/ab/gi, /abc/gi, /abcd/gi], and some possible combination of three (where 3 is equal to the length of that array) non-overlapping, separate substrings of the original string: fabcw, xabcdy and zab. Does fabcw match /abc/gi? Yes. Okay, we remove /abc/gi from the array, and we have [/ab/gi, /abcd/gi] for xabcdy and zab. Does xabcdy match /abcd/gi? Yes. Okay, we remove /abcd/gi from the current array, and we have [/ab/gi] for zab. Does zab. match /ab/gi? Yes. No more regexes left in the current array, and we always answered "yes", so — return true.
The tricky part here is to find an efficient (such that performance is not too terrible) way to get at least one possible “good” combination of non-overlapping substrings.
The more complex case is e.g. func("acdxbaab ababaacb", [/.*?a.*?b.*?c/gi, /.*?c.*?b.*?a/gi]). Using the logic described above, we can see that if we take two non-overlapping parts of the original string — "acdxba" (or "cdxba") and "abaac" (or "abaacb", "babaac" etc.) — the first one matches /.*?c.*?b.*?a/gi, and the second one matches /.*?a.*?b.*?c/gi. So, func("acdxbaab ababaacb", [/.*?a.*?b.*?c/gi, /.*?c.*?b.*?a/gi]) should return true.
Is there any efficient way to solve such a problem?
Assuming each pattern should match exactly once, then we can construct a regexp of all of their permutations:
const patterns= ['ab', 'abc', 'abcd'];
const input = "Fhh, fabcw wxabcdy yz... zab.";
// Create a regexp of the form
// (.*?ab.*?abc.*?abcd.*?)
function build(patterns) {
return `(${['', ...patterns, ''].join('.*?')})`;
}
function match(input, patterns) {
const regexps = [...permute(patterns)].map(build);
// Create a regexp of the form
// /(.*?ab.*?abc.*?abcd.*?)|(.*?ab.*?abcd.*?abc.*?)|.../
const regexp = new RegExp(regexps.join('|'));
return regexp.test(input);
}
// Simple permutation generator.
function *permute(a, n = a.length) {
if (n <= 1) yield a.slice();
else for (let i = 0; i < n; i++) {
yield *permute(a, n - 1);
const j = n % 2 ? 0 : i;
[a[n-1], a[j]] = [a[j], a[n-1]];
}
}
console.log(match(input, patterns));
This will result in a very long regexp if there are more than a half-dozen or so patterns. To deal with this, we can test each permutation one at a time:
function match(input, patterns) {
return Array.from(permute(patterns))
.some(perm => input.match(build(perm)));
}
If there are ten patterns, we will end up doing a couple million tests.
Disclaimers
This uses ES6 features. Fall back to equivalent ES5 syntax if you need to.
The input patterns here are strings. To handle regexps instead would require a little bit of logic to extract the pattern from the regexp, and also escape any special regexp characters appearing in it.
Is there an efficient way to test whether a string contains non-overlapping substrings to match the array of regular expressions?
I doubt that you would call the above solution "efficient", but I don't know if there is a more efficient one. As far as I can see, any approach to this problem is going to involve backtracking. You could match the first nine of ten patterns, and then discover that the last one won't match because one of the earlier nine greedily ate up part of what the tenth needed, even though it could have matched itself somewhere later in the string. Therefore, I will go out on a limb and say that this problem is intrinsically of order O(n!).

Faster way match characters between strings than Regex?

The use case is I want to compare a query string of characters to an array of words, and return the matches. A match is when a word contains all the characters in the query string, order doesn't matter, repeated characters are okay. Regex seems like it may be too powerful (a sledgehammer where only a hammer is needed). I've written a solution that compares the characters by looping through them and using indexOf, but it seems consistently slower. (http://jsperf.com/indexof-vs-regex-inside-a-loop/10) Is Regex the fastest option for this type of operation? Are there ways to make my alternate solution faster?
var query = "word",
words = ['word', 'wwoorrddss', 'words', 'argument', 'sass', 'sword', 'carp', 'drowns'],
reStoredMatches = [],
indexOfMatches = [];
function match(word, query) {
var len = word.length,
charMatches = [],
charMatch,
char;
while (len--) {
char = word[len];
charMatch = query.indexOf(char);
if (charMatch !== -1) {
charMatches.push(char);
}
}
return charMatches.length === query.length;
}
function linearIndexOf(words, query) {
var wordsLen = words.length,
wordMatch,
word;
while (wordsLen--) {
word = words[wordsLen];
wordMatch = match(word, query);
if (wordMatch) {
indexOfMatches.push(word);
}
}
}
function linearRegexStored(words, query) {
var wordsLen = words.length,
re = new RegExp('[' + query + ']', 'g'),
match,
word;
while (wordsLen--) {
word = words[wordsLen];
match = word.match(re);
if (match !== null) {
if (match.length >= query.length) {
reStoredMatches.push(word);
}
}
}
}
Note that your regex is wrong, that's most certainly why it goes so fast.
Right now, if your query is "word" (as in your example), the regex is going to be:
/[word]/g
This means look for one of the characters: 'w', 'o', 'r', or 'd'. If one matches, then match() returns true. Done. Definitively a lot faster than the most certainly more correct indexOf(). (i.e. in case of a simple match() call the 'g' flag is ignored since if any one thing matches, the function returns true.)
Also, you mention the idea/concept of any number of characters, I suppose as shown here:
'word', 'wwoorrddss'
The indexOf() will definitively not catch that properly if you really mean "any number" for each and every character. Because you should match an infinite number of cases. Something like this as a regex:
/w+o+r+d+s+/g
That you will certainly have a hard time to write the right code in plain JavaScript rather than use a regex. However, either way, that's going to be somewhat slow.
From the comment below, all the letters of the word are required, in order to do that, you have to have 3! tests (3 factorial) for a 3 letter word:
/(a.*b.*c)|(a.*c.*b)|(b.*a.*c)|(b.*c.*a)|(c.*a.*b)|(c.*b.*a)/
Obviously, a factorial is going to very quickly grow your number of possibilities and blow away your memory in a super long regex (although you can simplify if a word has the same letter multiple times, you do not have to test that letter more than once).
1! = 1
2! = 2
3! = 6
4! = 24
5! = 120
6! = 720
...
That's probably why your properly written test in plain JavaScript is much slower.
Also, in your case you should write the words nearly as done in Scrabble dictionaries: all letters once in alphabetical order (Scrabble keeps duplicates). So the word "word" would be "dorw". And as you shown in your example, the word "wwoorrddss" would be "dorsw". You can have some backend tool to generate your table of words (so you still write them as "word" and "words", and your tool massage those and convert them to "dorw" and "dorsw".) Then you can sort the letters of the words you are testing in alphabetical order and the result is that you do not need a silly factorial for the regex, you can simply do this:
/d.*o.*r.*w/
And that will match any word that includes the word "word" such as "password".
One easy way to sort the letters will be to split your word in an array of letters, and then sort the array. You may still get duplicates, it will depend on the sort capabilities. (I don't think that the default JavaScript sort will remove duplicates automatically.)
One more detail, if you test is supposed to be case insensitive, then you want to transform your strings to lowercase before running the test. So something like:
query = query.toLowerCase();
early on in your top function.
You are trying to speed up the algorithm "chars in word are a subset of the chars of query." You can short circuit this check and avoid some assignments (that are more readable but not strictly needed). Try the following version of match
function match(word, query) {
var len = word.length;
while (len--) {
if (query.indexOf(word[len]) === -1) { // found a missing char
return false;
}
}
return true; // couldn't find any missing chars
}
This gives a 4-5X improvement
Depending on the application you could try presorting words and presorting each word in words as another optimization.
The regexp match algorithm constructs a finite state automaton and makes its decisions on the current state and character read from left to right. This involves reading each character once and make a decision.
For static strings (to look a fixed string on a couple of text) you have better algorithms, like Knuth-Morris that allow you to go faster than one character at a time, but you must understand that this algorithm is not for matching regular expressions, just plain strings.
if you are interested in Knuth-Morris (there are several other algorithms) just have a round in wikipedia. http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
A good thing you can do is to investigate if you regexp match routines do it with an DFA or a NDFA, as NDFAs occupy less memory and are easier to compute, but DFAs do it faster, but with some compilation penalties and more memory occupied.
Knuth-Morris algorithm also needs to compile the string into an automaton before working, so perhaps it doesn't apply to your problem if you are using it just to find one word in some string.

How to compare two Strings and get Different part

now I have two strings,
var str1 = "A10B1C101D11";
var str2 = "A1B22C101D110E1";
What I intend to do is to tell the difference between them, the result will look like
A10B1C101D11
A10 B22 C101 D110E1
It follows the same pattern, one character and a number. And if the character doesn't exist or the number is different between them, I will say they are different, and highlight the different part. Can regular expression do it or any other good solution? thanks in advance!
Let me start by stating that regexp might not be the best tool for this. As the strings have a simple format that you are aware of it will be faster and safer to parse the strings into tokens and then compare the tokens.
However you can do this with Regexp, although in javascript you are hampered by the lack of lookbehind.
The way to do this is to use negative lookahead to prevent matches that are included in the other string. However since javascript does not support lookbehind you might need to go search from both directions.
We do this by concatenating the strings, with a delimiter that we can test for.
If using '|' as a delimiter the regexp becomes;
/(\D\d*)(?=(?:\||\D.*\|))(?!.*\|(.*\d)?\1(\D|$))/g
To find the tokens in the second string that are not present in the first you do;
var bothstring=str2.concat("|",str1);
var re=/(\D\d*)(?=(?:\||\D.*\|))(?!.*\|(.*\d)?\1(\D|$))/g;
var match=re.exec(bothstring);
Subsequent calls to re.exec will return later matches. So you can iterate over them as in the following example;
while (match!=null){
alert("\""+match+"\" At position "+match.index);
match=re.exec(t);
}
As stated this gives tokens in str2 that are different in str1. To get the tokens in str1 that are different use the same code but change the order of str1 and str2 when you concatenate the strings.
The above code might not be safe if dealing with potentially dirty input. In particular it might misbehave if feed a string like "A100|A100", the first A100 will not be considered as having a missing object because the regexp is not aware that the source is supposed to be two different strings. If this is a potential issue then search for occurences of the delimiting character.
You call break the string into an array
var aStr1 = str1.split('');
var aStr2 = str2.split('');
Then check which one has more characters, and save the smaller number
var totalCharacters;
if(aStr1.length > aStr2.length) {
totalCharacters = aStr2.length
} else {
totalCharacters = aStr1.length
}
And loop comparing both
var diff = [];
for(var i = 0; i<totalCharacters; i++) {
if(aStr1[i] != aStr2[i]) {
diff.push(aStr1[i]); // or something else
}
}
At the very end you can concat those last characters from the bigger String (since they obviously are different from the other one).
Does it helps you?

Regex with limited use of specific characters (like a scrabble bank of letters)

I would like to be able to do a regex where I can identify sort of a bank of letters like [dgos] for example and use that within my regex... but whenever a letter from that gets used, it takes it away from the bank.
So lets say that somehow \1 is able to stand for that bank of letters (dgos). I could write a regex something like:
^\1{1}o\1{2}$
and it would match basically:
\1{1} = [dgos]{1}
o
\1{2} = [dgos]{2} minus whatever was used in the first one
Matches could include good, dosg, sogd, etc... and would not include sods (because s would have to be used twice) or sooo (because the o would have to be used twice).
I would also like to be able to identify letters that can be used more than once. I started to write this myself but then realized I didn't even know where to begin with this so I've also searched around and haven't found a very elegant way to do this, or a way to do it that would be flexible enough that the regex could easily be generated with minimal input.
I have a solution below using a combination of conditions and multiple regexs (feel free to comment thoughts on that answer - perhaps it's the way I'll have to do it?), but I would prefer a single regex solution if possible... and something more elegant and efficient if possible.
Note that the higher level of elegance and single regex part are just my preferences, the most important thing is that it works and the performance is good enough.
Assuming you are using javascript version 1.5+ (so you can use lookaheads), here is my solution:
^([dogs])(?!.*\1)([dogs])(?!.*\2)([dogs])(?!.*\3)[dogs]$
So after each letter is matched, you perform a negative lookahead to ensure that this matched letter never appears again.
This method wouldn't work (or at least, would need to be made a heck of a lot more complicated!) if you want to allow some letters to be repeated, however. (E.g. if your letters to match are "example".)
EDIT: I just had a little re-think, and here is a much more elegant solution:
^(?:([dogs])(?!.*\1)){4}
After thinking about it some more, I have thought of a way that this IS possible, although this solution is pretty ugly! For example, suppose you want to match the letters "goods" (i.e. including TWO "o"s):
^(?=.*g)(?=.*o.*o)(?=.*d)(?=.*s).{5}$
- So, I have used forward lookaheads to check that all of these letters are in the text somewhere, then simply checked that there are exactly 5 letters.
Or as another example, suppose we want to match the letters "banana", with a letter "a" in position 2 of the word. Then you could match:
^(?=.*b)(?=.*a.*a.*a)(?=.*n.*n).a.{4}$
Building on #TomLord's answer, taking into account that you don't necessarily need to exhaust the bank of letters, you can use negative lookahead assertions instead of positive ones. For your example D<2 from bank>R<0-5 from bank>, that regex would be
/^(?![^o]*o[^o]*o)(?![^a]*a[^a]*a)(?![^e]*e[^e]*e[^e]*e)(?![^s]*s[^s]*s)d[oaes]{2}r[oaes]{0,5}$/i
Explanation:
^ # Start of string
(?![^o]*o[^o]*o) # Assert no more than one o
(?![^a]*a[^a]*a) # Assert no more than one a
(?![^e]*e[^e]*e[^e]*e) # Assert no more than two e
(?![^s]*s[^s]*s) # Assert no more than one s
d # Match d
[oaes]{2} # Match two of the letters in the bank
r # Match r
[oaes]{0,5} # Match 0-5 of the letters in the bank
$ # End of string
You could also write (?!.*o.*o) instead of (?![^o]*o[^o]*o), but the latter is faster, just harder to read. Another way to write this is (?!(?:[^o]*o){2}) (useful if the number of repetitions increases).
And of course you need to account for the number of letters in your "fixed" part of the string (which in this case (d and r) don't interfere with the bank, but they might do so in other examples).
Something like that would be pretty simple to specify, just very long
gdos|gdso|gods|gosd|....
You'd basically specify every permutation. Just write some code to generate every permutation, combine with the alternate operator, and you're done!
Although... It might pay to actually encode the decision tree used to generate the permutations...
I swear I think I answered this before on stackoverflow. Let me get back to you...
I couldn't think of a way to do this completely in a regex, but I was able to come up with this: http://jsfiddle.net/T2TMd/2/
As jsfiddle is limited to post size, I couldn't do the larger dictionary there. Check here for a better example using a 180k word dictionary.
Main function:
/*
* filter = regex to filter/split potential matches
* bank = available letters
* groups = capture groups that use the bank
* dict = list of words to search through
*/
var matchFind = function(filter, bank, groups, dict) {
var matches = [];
for(var i=0; i < dict.length; i++) {
if(dict[i].match(filter)){
var fail = false;
var b = bank;
var arr = dict[i].split(filter);
//loop groups that use letter bank
for(var k=0; k<groups.length && !fail; k++) {
var grp = arr[groups[k]] || [];
//loop characters of that group
for(var j=0; j<grp.length && !fail; j++) {
var regex = new RegExp(b);
var currChar = grp.charAt(j);
if(currChar.match(regex)) {
//match found, remove from bank
b = b.replace(currChar,"");
} else {
fail = true;
}
}
}
if(!fail) {
matches.push(dict[i]);
}
}
}
return matches;
}
Usage:
$("#go").click( function() {
var f = new RegExp($("#filter").val());
var b = "["+$("#bank").val().replace(/[^A-Za-z]+/g,"").toUpperCase()+"]";
var g = $("#groups").val().replace(/[^0-9,]+/g,"").split(",") || [];
$("#result").text(matchFind(f,b,g,dict).toString());
});
To make it easier to create scenarios, I created this as well:
$("#build").click( function() {
var bank = "["+$("#buildBank").val().replace(/[^A-Za-z]+/g,"").toUpperCase()+"]";
var buildArr = $("#builder").val().split(",");
var groups = [];
var build = "^";
for(var i=0; i<buildArr.length; i++) {
var part = buildArr[i];
if(/\</.test(part)) {
part = "(" + bank + part.replace("<", "{").replace(">", "}").replace("-",",") + ")";
build = build + part;
groups.push(i+1);
} else {
build = build + "("+part+")";
}
}
build = build + "$";
$("#filter").val(build);
$("#bank").val(bank);
$("#groups").val(groups.toString());
$("#go").click();
});
This would be useful in Scrabble, so lets say that you are in a position where a word must start with a "D", there are two spaces between that "D" and an "R" from a parallel word, and you have OAEES for your letter bank. In the builder I could put D,<2>,R,<0-3> because it must start with a D, then it must have 2 letters from the bank, then there is an R, then I'd have up to 3 letters to use (since I'm using 2 in between D and R).
The builder would use the letter bank and convert D,<2>,R,<0-3> to ^(D)([OAEES]{2})(R)([OAEES]{0,5})$ which would be used to filter for possible matches. Then with those possible matches it will look at the capture groups that use the letter bank, character by character, removing letters from that regex when it finds them so that it won't match if there are more of the letter bank letters used than there is in the letter bank.
Test the above scenario here.

Categories

Resources