reversed regex mashine implementation - javascript

I'm trying to match a string starting from the last character to fail as soon as possible. This way I can fail a match with a custom string cstr (see specification below) with least amount of operations (4th property).
From a theoritical perspective the regex can be represented as a finite state mashine and the arrows can be flipped, creating the reversed regex.
I'm looking for an implementation of this. A library/program which I can give the string and the pattern. cstr is implemented in python, so if possible a python module. (For the curious i-th character is not calculated until needed.) For anything other I need to do much more work because of cstr's calculation is hard to port to another language.
The implementation doesn't have to cover all latex syntax. I'm looking for the basics. No lookaheads or fancy stuff. See specification below.
I may be lacking common knowledge. Please do comment obvious things, too.
Specification
The custom string cstr has the following properties:
String can be calculated in finite time.
String has finite length
The last character is known
Every previous character requires a costly calculation
Until the string is calculated fully, length is unknown
When the string is calcualted fully, I want to match it with a simple regex which may contain these from the syntax. No look aheads or fancy stuff.
alphanumeric characters
uinicode characters
., *, +, ?, \w, \W, [], |, escape char \, range specifitation with { , }
PS: This is not a homework question. I'm trying to formulate my question as clear as possible.

OP here. Here are some thougts:
Since I'm looking for an unoptimized regex mashine, I have to build it myself, which takes time.
Alternatively we can define an upperbound for cstr length and create all strings that matches given regex with length < upperbound. Then we put all solutions to a tire data structure and match it. This depends on the use case and maybe a cache can be involved.
What I'm going for is python module greenery
from greenery import parse
pattern = parse.Pattern(...)
pattern.reversed()
...
this sometimes provieds a good matching experience. Sometimes not but it is ok for me.

Related

Why does NFKC normalize of all digits not work?

In JavaScript I am using NFKC normalization via String.prototype.normalize to normalize fullwidth to standard ASCII halfwidth characters.
'1'.normalize('NFKC') === '1'
> true
However, looking at more obscure digits like ૫ which is the digit 5 in Gujarati it does not normalize.
'૫'.normalize('NFKC') === '5'
> false
What am I missing?
Unicode normalisation is meant for characters that are variants of each other, not for every set of characters that might have similar meanings.
The character ‘1’ (FULLWIDTH DIGIT ONE) is essentially just the character ‘1’ (DIGIT ONE) with slightly different styling and would not have been encoded if it was not necessary for compatibility. They are – in some contexts – completely interchangeable, so the former was assigned a decomposition mapping to the latter. The character ‘૫’ (GUJARATI DIGIT FIVE) does not have a decomposition mapping because it is not a variant of any other character; it is its own distinct thing.
You can consult the Unicode Character Database to see which characters decompose and which (i.e. most of them) don’t. The link to the tool you posted as part of your question shows you for example that ૫ does not change under any form of Unicode normalisation.
You are looking the wrong problem.
Unicode is main purpose is about encoding characters (without loosing information). Fonts and other programs should be able to interpret such characters and give a glyph (according combination code point, nearby characters, and other characteristics outside code points [like language, epoch, font characteristic [script and non-script, uppercase, italic, etc changes how to combine characters and ligature (and also glyph form).
There are two main normalization (canonical and compatible) [and two variant: decomposed, and composed when possible]. Canonical normalization remove unneeded characters (repetition) and order composing characters in a standard way. Compatible normalization remove "compatible characters": characters that are in Unicode just not to lose information on converting to and from other charset.
Some digits (like small 2 exponent) have compatible character as normal digit (this is a formatting question, unicode is not about formatting). But on the other cases, digits in different characters should be keep ad different characters.
That was about normalization.
But you want to get the numeric value of a unicode character (warning: it could depends on other characters, position, etc.).
Unicode database provides also such property.
With Javascript, you may use unicode-properties javasript package, which provide you also the function getNumericValue(codePoint). This packages seems to use efficient compression of database, but so I don't know how fast it could be. The database is huge.
The NFKC which you are using here stands for Compatibility Decomposition followed by Canonical Composition, which in trivial english means first break things to smaller more often used symbols and then combine them to find the equivalent simpler character. For example 𝟘->0, fi->fi (Codepoint fi=64257).
It does not do conversion to ASCII, for example in ख़(2393)-> ख़([2326, 2364])
Reference:https://unicode.org/reports/tr15/#Norm_Forms
For simpler understanding:https://towardsdatascience.com/difference-between-nfd-nfc-nfkd-and-nfkc-explained-with-python-code-e2631f96ae6c

Optimisation: (\\[abc] | [^abc])*

I have a long Regex (JavaScript), and it contains the following construct:
((\\\\)|(\\[abc])|([^abc]))*
The regex says:
Match any String, that doesn't contain the letters a,b and c.
In except if they're escaped by a backslash.
If the backslash is escaped (eg. \\a), also don't match these letters.
Here's a simple match-example:
eeeaeaee\aee\\\\ae\\\\\aee
I wonder if it's possible to optimise this regulat expression. This is only a little example, the actual regex I'm using is bigger, and I have lots of code twice.
I think a more logical (and likely faster) regexp would be something like:
(?:[^abc\\]|\\.)*
In other words, a backslash will escape anything, including another backslash.
Note a few things: first, if you don't need to capture parts of the match, use non-capturing groups. That buys you a little performance. Second, when there are multiple alternatives, put the most common one first.
You might get even better performance this way (try it):
[^abc\\]*(?:\\.[^abc\\]*)*
Rather than going through the alternation for each and every character, that will "eat" runs of non-special characters with a single step. Nested * can be bad news, leading to quadratic (or worse) runtime in cases where the regex doesn't match, but in this case that won't happen.
When writing this answer, I discovered that JS's regex engine has no possessive matchers. That sucks -- you could get better worst-case performance if they were available. (An important tip for working towards regex mastery: when performance testing a regex, always test cases where it does match AND where it doesn't match. The worst-case performance generally occurs when it doesn't.)
You can match any character after a backslash or any character that is not in [abc]:
(\\.|[^abc])*
That will match the exact same language.
I think it's actually more clear what you're intention is if you flip it around like:
([^abc]|\\.)*

Need help creating javascript regexp [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I have absolutly no experience with regular expressions and I need some help setting up one to match a string with. This is for phone number validation. I need to make sure that a string a user inputs has only upper case letters A-Z, numbers 0-9, open/close parentheses[()], and hyphens(-). I also don't know what string method I need to use either match or string.
RegEx is explained poorly all over the web. I don't fault anyone for asking more general questions about it and this is different from the other post which is more do-it-form-me google-evasion than specific question. The characters you asked about:
[A-Z]
[0-9] or \d
\(
\)
-
/matchme/ is a regular expression literal. This is preferable to useing the RegExp constructor because you end up having to escape your escape backslashes which gets real ugly.
You can actually use regEx literals in a lot of string methods, like replace, split, etc.
Without special characters following, any non-special character is about matching one character at that position in a string. Stuff in [] is a class and can match more than one KIND of character but only the character at that positions following the last position matched. You might [.- ] useful for identifying non-number characters for telephone numbers. You can also express ranges in character classes, e.g. [a-hA-H] or [4-9]
But one str position at a time goes out the window when you start using the follow-up characters:
? - one or none
* - 0 or many
+ - 1 or more
Avoid the . wildcard character. It is inefficient. For some reason that I suspect goes down all the way to implementation in assembly for efficiency's sake, it checks against every single possibility rather than the 1-2 teletype whitespace characters it actually doesn't represent and there is no honest use for on a computer. More importantly, the better-performing alternative is much more powerful and helpful. Negating character classes are much faster. [^<]* represents 0 or more positions of anything that is NOT a < character.
Very handy stuff for XML/SGML-style parsing which in spite of what many on Stack have said, is perfectly feasible with regEx, which is no longer technically confined to "regular" languages. You have to be aware of what your looking with something that allows as much sloppiness as somebody else's HTML but that's just a 'duh' in my book.
Crockford warns against negating character classes in JSlint. Crockford is painfully wrong on that count. They are not only much more efficient, they also make it much easier to think through how to tokenize stuff. If there is a security risk, you can set explicit limits to the number of characters matched with {} brackets, e.g. p{2,5} - which matches two to five p chars or {5} for exactly 5 or {,5} for up to 5 or {5,} at least 5 (I think - test those last two)
Other random stuff you should look up:
(ph|f) - ph or f - helpful for finding phish and fish (when a class won't do, basically)
^ - represents beginning of a string - think of as a condition for the next character more than a character itself. Yes, it also negates character classes.
$ - represents end of a string - same caveat as above but on the previous character.
\ - used to escape special symbols. Note: a lot of special symbols that have no meaning in character classes require no \ inside []
\s\w\d - These represent commonly used sets of characters. The first is pretty much all whitespace (js-style escapes typically have regEx equivalents) followed by w for word characters (class equivalent [a-zA-Z0-9_]) and d for digits [0-9]. Capitalize any of these for the exact opposite.
There's more, like back-references, and lookaheads whose use-case scenarios are worth knowing but this is the commonly used stuff I actually remember from regular experience (bwaahaahaa).
I assume you're looking for non US since you have that A-Z concern and I'm sure there's plenty of US phone-numbers regExes out there but I'd probably do something like this for US numbers:
/\(?\d{3}[)\-. ]?\d{3}[\-. ]?\d{4}/
to match:
123-456-7890
(123)456-7890
123.456.7890
123 456 7890
1234567890
But also perhaps messily allows:
(123456.7890
...which I'm willing to live with for the sake of avoiding complexity. Resist the temptation to do it all with one expression. Sometimes it's much cleaner to eliminate trailing/leading whitespace for instance, and then hit something with an expression. Split and join methods are very powerful for tokenizing
If this goes like a usual regEx conversation, somebody will shortly point out something I missed in my pattern. So yeah, test 'em out on stuff. There's sites that let you set the expression and then just plug in characters to try and break them.

Insert EBCDIC character into javascript string

I need to create an EBCDIC string within my javascript and save it into an EBCDIC database. A process on the EBCDIC system then uses the data. I haven't had any problems until I came across the character '¬'. In EBCDIC it is hex value of 5F. All of the usual letters and symbols seem to automagically convert with no problem. Any idea how I can create the EBCDIC value for '¬' within javascript so I can store it properly in the EBCDIC db?
Thanks!
If "all of the usual letters and symbols seem to automagically convert", then I very strongly suspect that you do not have to create an EBCDIC string in Javascript. The character codes for Latin letters and digits are completely different in EBCDIC than they are in Unicode, so something in your server code is already converting the strings.
Thus what you need to determine is how that process works, and specifically you need to find out how the translation maps character codes from Unicode source into the EBCDIC equivalents. Once you know that, you'll know what Unicode character to use in your Javascript code.
As a further note: every single time I've been told by an IT organization that their mainframe software requires that data be supplied in EBCDIC, that advice has been dead wrong. The fact that there's some external interface means that something in the pile of iron that makes up the mainframe and it's tentacles, something the IT people have forgotten about and probably couldn't find if they needed to, is already mapping "real world" character encodings like Unicode into EBCDIC. How does it work? Well, it may be impossible to figure out.
You might try whether this works: var notSign = "\u00AC";
edit: also: here's a good reference for HTML entities and Unicode glyphs: http://www.elizabethcastro.com/html/extras/entities.html The HTML/XML syntax uses decimal numbers for the character codes. For Javascript, you have to convert those to hex, and the notation in Javascript strings is "\u" followed by a 4-digit hex constant. (That reference isn't complete, but it's pretty easy to read and it's got lots of useful symbols.)

How to detect what allowed character in current Regular Expression by using JavaScript?

In my web application, I create some framework that use to bind model data to control on page. Each model property has some rule like string length, not null and regular expression. Before submit page, framework validate any binded control with defined rules.
So, I want to detect what character that is allowed in each regular expression rule like the following example.
"^[0-9]+$" allow only digit characters like 1, 2, 3.
"^[a-zA-Z_][a-zA-Z_\-0-9]+$" allow only a-z, - and _ characters
However, this function should not care about grouping, positioning of allowed character. It just tells about possible characters only.
Do you have any idea for creating this function?
PS. I know it easy to create specified function like numeric only for allowing only digit characters. But I need share/reuse same piece of code both data tier(contains all model validator) and UI tier without modify anything.
Thanks
You can't solve this for the general case. Regexps don't generally ‘fail’ at a particular character, they just get to a point where they can't match any more, and have to backtrack to try another method of matching.
One could make a regex implementation that remembered which was the farthest it managed to match before backtracking, but most implementations don't do that, including JavaScript's.
A possible way forward would be to match first against ^pattern$, and if that failed match against ^pattern without the end-anchor. This would be more likely to give you some sort of match of the left hand part of the string, so you could count how many characters were in the match, and say the following character was ‘invalid’. For more complicated regexps this would be misleading, but it would certainly work for the simple cases like [a-zA-Z0-9_]+.
I must admit that I'm struggling to parse your question.
If you are looking for a regular expression that will match only if a string consists entirely of a certain collection of characters, regardless of their order, then your examples of character classes were quite close already.
For instance, ^[A-Za-z0-9]+$ will only allow strings that consist of letters A through Z (upper and lower case) and numbers, in any order, and of any length.

Categories

Resources