Using RegEx in Javascript to match similar strings that have characters swapped - javascript

I am using RegEx's to find the frequency of occurrences of certain string values in a large data set. This was working fine until I found some of the years worth of data have been entered with a typo, meaning two characters have been swapped around. It is not feasible to edit the data sets to correct the typo. Therefore, is it possible to define a RegEx that will match the strings regardless of the index of just two characters within them?
The strings in question are:
"gcse/o-level/cse" and "gsce/o-level/cse"
I am aware I can simply search by the characters found after the typo, but I would like to know if there is a RegEx method to deal with this sort of occurrence as I could not find any mention of a solution anywhere else, and thought it posed an interesting challenge.

You can just use
/g(cs|sc)e\/o-level\/cse/
| here means "or", as you're used to.

Related

What is the safest/most reliable separator/delimiter to join string in javascript?

I have included special characters like #, |,~ but they also appeared as data in some values which breaks/fails my idea of joining and splitting values later in the code.
There isn't any "safest" and "most reliable" separator to join string in any languages, not just JavaScript.
It depends entirely on your dataset, meaning the "safest" choice will be different for every different set of data.
For example, if your dataset is guaranteed to contain only integers, then any non-numeric characters can be the safest choice.
However, if your dataset is a free text, then there will be no "safest" choice, because even if you choose an arbitrary combination of string as the separator, i.e. %%%, an end-user can still supply that data in a legit sentence like My preferred pronoun is "%%%", albeit highly unlikely. Thus using %%% as separator here would still break your logic.
Because of this, you can only choose a separator that gives you the least risk.
Depending on your use case, there probably are other simpler solutions that does not require separators.
Generally we avoid joining strings if you need to separate them again later, JSON notation from serializing the data is usually a good compromise and has best interoperability.
CSV can work well too, but don't just insert commas, make sure you properly escape the values if they need it.
If JSON or CSV isn't appropriate, then using a sequence of special characters is more likely to be unique, you could use || (double pipe) as that is very unlikely to occur in anything except C based code.
You could use other special characters but I would avoid $ or % as these are commonly used in replacement tokens. Also avoid any form of brackets as they are used for other container based replacement.
A 3 character code using multiple symbols is more unique again |:| just pick something that visually looks like a barrier between values and can't be confused with tokens.

reversed regex mashine implementation

I'm trying to match a string starting from the last character to fail as soon as possible. This way I can fail a match with a custom string cstr (see specification below) with least amount of operations (4th property).
From a theoritical perspective the regex can be represented as a finite state mashine and the arrows can be flipped, creating the reversed regex.
I'm looking for an implementation of this. A library/program which I can give the string and the pattern. cstr is implemented in python, so if possible a python module. (For the curious i-th character is not calculated until needed.) For anything other I need to do much more work because of cstr's calculation is hard to port to another language.
The implementation doesn't have to cover all latex syntax. I'm looking for the basics. No lookaheads or fancy stuff. See specification below.
I may be lacking common knowledge. Please do comment obvious things, too.
Specification
The custom string cstr has the following properties:
String can be calculated in finite time.
String has finite length
The last character is known
Every previous character requires a costly calculation
Until the string is calculated fully, length is unknown
When the string is calcualted fully, I want to match it with a simple regex which may contain these from the syntax. No look aheads or fancy stuff.
alphanumeric characters
uinicode characters
., *, +, ?, \w, \W, [], |, escape char \, range specifitation with { , }
PS: This is not a homework question. I'm trying to formulate my question as clear as possible.
OP here. Here are some thougts:
Since I'm looking for an unoptimized regex mashine, I have to build it myself, which takes time.
Alternatively we can define an upperbound for cstr length and create all strings that matches given regex with length < upperbound. Then we put all solutions to a tire data structure and match it. This depends on the use case and maybe a cache can be involved.
What I'm going for is python module greenery
from greenery import parse
pattern = parse.Pattern(...)
pattern.reversed()
...
this sometimes provieds a good matching experience. Sometimes not but it is ok for me.

Regex to match certain characters and exclude certain characters but without negative lookahead

I want a regex that matches all emojis (or most of them) but excludes certain characters (such as “|”|‘|’|…|—).
This regex does the job via negative lookahead:
/(?!\u201C|\u201D|\u2018|\u2019|\u2026|\u2014)(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])/
But apparently Google Scripts doesn't support this. Error:
Invalid regular expression pattern
(?!“|”|‘|’|…|—)(©|®|[ -㌀]|?[퀀-?]|?[퀀-?]|?[퀀-?])
Is there another way to achieve my goal (a regex that works with Google Script's findText)?
Option 1
Maybe,
[\u{1f300}-\u{1f5ff}\u{1f900}-\u{1f9ff}\u{1f600}-\u{1f64f}\u{1f680}-\u{1f6ff}\u{2600}-\u{26ff}\u{2700}-\u{27bf}\u{1f1e6}-\u{1f1ff}\u{1f191}-\u{1f251}\u{1f004}\u{1f0cf}\u{1f170}-\u{1f171}\u{1f17e}-\u{1f17f}\u{1f18e}\u{3030}\u{2b50}\u{2b55}\u{2934}-\u{2935}\u{2b05}-\u{2b07}\u{2b1b}-\u{2b1c}\u{3297}\u{3299}\u{303d}\u{00a9}\u{00ae}\u{2122}\u{23f3}\u{24c2}\u{23e9}-\u{23ef}\u{25b6}\u{23f8}-\u{23fa}]
might be working OK for your desired emojis.
Demo
Option 2
Otherwise, you might want to negate those undesired chars using char classes, such as:
[these unicode ranges &&[^these unicodes]]
which would become pretty complicated, yet possible.
Option 3
Using this option you can most likely solve your problem much simpler. I guess, your problem is that those undesired punctuations are already among the desired unicodes. Check to see if that'd be the case. For example, in
[\u100-\u200]
you might have \u150 and \u175 as undesired chars, which you want them to be removed from your desired ranges of unicodes that you already have.
You can then simply remove those from the range, such as with:
[\u100-\u149\u151-\u174\u176-\u200]
and as simple as that the problem would be solved.
Source
javascript unicode emoji regular expressions

Getting parts of a URL in JavaScript

I have to match URLs in a text, linkify them, and then display only the host--domain name or IP address--to the user. How can I proceed with JavaScript?
Thanks.
PS: please don't tell me about this; those regular expressions are so buggy they can't match http://google.com
If you don't want to use regular expressions, then you'll need to use things like indexOf and such instead. For instance, search for "://" in the text of every element and if you find it and the bit in front of it looks like a protocol (or "scheme"), grab it and the following characters that are valid URI characters (RFC2396). If the result ends in a dot or question mark, remove the dot or question (it probably ends a sentence). There's not really a lot more to say.
Update: Ah, I see from your edit that you don't have a problem with regular expressions, just the ones in the answers to that question. Fair enough.
This may well be one of those places where trying to do it all with a regular expression is more work that it should be, but using regular expressions as part of the solution is helpful. For instance,
/[a-zA-Z][a-zA-Z0-9+\-.]*:\/\//
...may well be a helpful way to find the beginning of a URL, since the scheme portion must start with an alpha and then can have zero or more alpha, digit, +, -, or . prior to the : (section 3.1).

How to detect what allowed character in current Regular Expression by using JavaScript?

In my web application, I create some framework that use to bind model data to control on page. Each model property has some rule like string length, not null and regular expression. Before submit page, framework validate any binded control with defined rules.
So, I want to detect what character that is allowed in each regular expression rule like the following example.
"^[0-9]+$" allow only digit characters like 1, 2, 3.
"^[a-zA-Z_][a-zA-Z_\-0-9]+$" allow only a-z, - and _ characters
However, this function should not care about grouping, positioning of allowed character. It just tells about possible characters only.
Do you have any idea for creating this function?
PS. I know it easy to create specified function like numeric only for allowing only digit characters. But I need share/reuse same piece of code both data tier(contains all model validator) and UI tier without modify anything.
Thanks
You can't solve this for the general case. Regexps don't generally ‘fail’ at a particular character, they just get to a point where they can't match any more, and have to backtrack to try another method of matching.
One could make a regex implementation that remembered which was the farthest it managed to match before backtracking, but most implementations don't do that, including JavaScript's.
A possible way forward would be to match first against ^pattern$, and if that failed match against ^pattern without the end-anchor. This would be more likely to give you some sort of match of the left hand part of the string, so you could count how many characters were in the match, and say the following character was ‘invalid’. For more complicated regexps this would be misleading, but it would certainly work for the simple cases like [a-zA-Z0-9_]+.
I must admit that I'm struggling to parse your question.
If you are looking for a regular expression that will match only if a string consists entirely of a certain collection of characters, regardless of their order, then your examples of character classes were quite close already.
For instance, ^[A-Za-z0-9]+$ will only allow strings that consist of letters A through Z (upper and lower case) and numbers, in any order, and of any length.

Categories

Resources