Javascripts String.split - how does it work internally? - javascript

I've recently discussed with a colleague how the separator of String.split is treated internally by JavaScript.
Is the separator always converted into a regular expression? E.g. will calling String.split(",", myvar) convert the "," into a regualar expression matching that string?

Well the answer for your question: "Is the separator always converted into a regular expression?" is:
It depends solely on the implementation. For example if you look at WebKit implementation http://svn.webkit.org/repository/webkit/trunk/Source/JavaScriptCore/runtime/StringPrototype.cpp (find stringProtoFuncSplit) then you see it is not always converted to RegEx. However, this does not imply anything, it is just a matter of implementation

Here's the official writeup over at ecma, but the relevant part is around this section:
8.If separator is a RegExp object (its [[Class]] is "RegExp"), let R = separator; otherwise let R = ToString(separator).
That being said it is the ecma spec, and as Anthony Grist mentioned in the comments, browsers can implement as they want, for instance V8 implements ecma262.
Edit: expanded thought on browser/js engines implementation, it appears the majority implement versions of ecma, as seen on this wiki

Yes, the javascript function split allow you to use regex:
EX:
var str = "I am confused";
str.split(/\s/g)
Str then contains ["I","am","confused"]

separator specifies the character(s) to use for separating the string. The separator is treated as a string or a regular expression. If separator is omitted, the array returned contains one element consisting of the entire string. If separator is an empty string, str is converted to an array of characters.
Please see the below link to know more about this, hope it will help you:
String.prototype.split()

Related

ES6 / JS: Regex for replacing delve with conditional chaining

How to replace delve with conditional chaining in a vs code project?
e.g.
delve(seo,'meta')
delve(item, "image.data.attributes.alternativeText")
desired result
seo?.meta
item?.image.data.attributes.alternativeText
Is it possible using find/replace in Visual Studio Code?
I propose the following RegEx:
delve\(\s*([^,]+?)\s*,\s*['"]([^.]+?)['"]\s*\)
and the following replacement format string:
$1?.$2
Explanation: Match delve(, a first argument up until the first comma (lazy match), and then a second string argument (no care is taken to ensure that the brackets match as this is rather quick'n'dirty anyways), then the closing bracket of the call ). Spacing at reasonable places is accounted for.
which will work for simple cases like delve(someVar, "key") but might fail for pathological cases; always review the replacements manually.
Note that this is explicitly made incapable of dealing with delve(var, "a.b.c") because as far as I know, VSC format strings don't support "joining" variable numbers of captures by a given string. As a workaround, you could explicitly create versions with two, three, four, five... dots and write the corresponding replacements. The version for two dots for example looks as follows:
delve\(([^,]+?)\s*,\s*['"]([^.]+?)\.([^.]+?)['"]\s*\)
and the format string is $1?.$2?.$3.
You write:
e.g.
delve(seo,'meta')
delve(item, "image.data.attributes.alternativeText")
desired result
seo?.meta
item?.image.data.attributes.alternativeText
but I highly doubt that this is intended, because delve(item, "image.data.attributes.alternativeText") is in fact equivalent to item?.image?.data?.attributes?.alternativeText rather than the desired result you describe. To make it handle it that way, simply replace [^.] with . to make it accept strings containing any characters (including dots).

"String does not match the pattern of \"^(?:#[a-z0-9-*~][a-z0-9-*._~]*/)?[a-z0-9-~][a-z0-9-._~]*$\"." [duplicate]

How can I make the following regex ignore case sensitivity? It should match all the correct characters but ignore whether they are lower or uppercase.
G[a-b].*
Assuming you want the whole regex to ignore case, you should look for the i flag. Nearly all regex engines support it:
/G[a-b].*/i
string.match("G[a-b].*", "i")
Check the documentation for your language/platform/tool to find how the matching modes are specified.
If you want only part of the regex to be case insensitive (as my original answer presumed), then you have two options:
Use the (?i) and [optionally] (?-i) mode modifiers:
(?i)G[a-b](?-i).*
Put all the variations (i.e. lowercase and uppercase) in the regex - useful if mode modifiers are not supported:
[gG][a-bA-B].*
One last note: if you're dealing with Unicode characters besides ASCII, check whether or not your regex engine properly supports them.
Depends on implementation
but I would use
(?i)G[a-b].
VARIATIONS:
(?i) case-insensitive mode ON
(?-i) case-insensitive mode OFF
Modern regex flavors allow you to apply modifiers to only part of the regular expression. If you insert the modifier (?im) in the middle of the regex then the modifier only applies to the part of the regex to the right of the modifier. With these flavors, you can turn off modes by preceding them with a minus sign (?-i).
Description is from the page:
https://www.regular-expressions.info/modifiers.html
regular expression for validate 'abc' ignoring case sensitive
(?i)(abc)
The i flag is normally used for case insensitivity. You don't give a language here, but it'll probably be something like /G[ab].*/i or /(?i)G[ab].*/.
Just for the sake of completeness I wanted to add the solution for regular expressions in C++ with Unicode:
std::tr1::wregex pattern(szPattern, std::tr1::regex_constants::icase);
if (std::tr1::regex_match(szString, pattern))
{
...
}
JavaScript
If you want to make it case insensitive just add i at the end of regex:
'Test'.match(/[A-Z]/gi) //Returns ["T", "e", "s", "t"]
Without i
'Test'.match(/[A-Z]/g) //Returns ["T"]
In JavaScript you should pass the i flag to the RegExp constructor as stated in MDN:
const regex = new RegExp('(abc)', 'i');
regex.test('ABc'); // true
As I discovered from this similar post (ignorecase in AWK), on old versions of awk (such as on vanilla Mac OS X), you may need to use 'tolower($0) ~ /pattern/'.
IGNORECASE or (?i) or /pattern/i will either generate an error or return true for every line.
C#
using System.Text.RegularExpressions;
...
Regex.Match(
input: "Check This String",
pattern: "Regex Pattern",
options: RegexOptions.IgnoreCase)
specifically: options: RegexOptions.IgnoreCase
[gG][aAbB].* probably simples solution if the pattern is not too complicated or long.
Addition to the already-accepted answers:
Grep usage:
Note that for greping it is simply the addition of the -i modifier. Ex: grep -rni regular_expression to search for this 'regular_expression' 'r'ecursively, case 'i'nsensitive, showing line 'n'umbers in the result.
Also, here's a great tool for verifying regular expressions: https://regex101.com/
Ex: See the expression and Explanation in this image.
References:
man pages (man grep)
http://droptips.com/using-grep-and-ignoring-case-case-insensitive-grep
In Java, Regex constructor has
Regex(String pattern, RegexOption option)
So to ignore cases, use
option = RegexOption.IGNORE_CASE
Kotlin:
"G[a-b].*".toRegex(RegexOption.IGNORE_CASE)
You also can lead your initial string, which you are going to check for pattern matching, to lower case. And using in your pattern lower case symbols respectively .
You can practice Regex In Visual Studio and Visual Studio Code using find/replace.
You need to select both Match Case and Regular Expressions for regex expressions with case. Else [A-Z] won't work.enter image description here

Efficiently remove common patterns from a string

I am trying to write a function to calculate how likely two strings are to mean the same thing. In order to do this I am converting to lower case and removing special characters from the strings before I compare them. Currently I am removing the strings '.com' and 'the' using String.replace(substring, '') and special characters using String.replace(regex, '')
str = str.toLowerCase()
.replace('.com', '')
.replace('the', '')
.replace(/[&\/\\#,+()$~%.'":*?<>{}]/g, '');
Is there a better regex that I can use to remove the common patterns like '.com' and 'the' as well as the special characters? Or some other way to make this more efficient?
As my dataset grows I may find other common meaningless patterns that need to be removed before trying to match strings and would like to avoid the performance hit of chaining more replace functions.
Examples:
Fish & Chips? => fish chips
stackoverflow.com => stackoverflow
The Lord of the Rings => lord of rings
You can connect the replace calls to a single one with a rexexp like this:
str = str.toLowerCase().replace(/\.com|the|[&\/\\#,+()$~%.'":*?<>{}]/g, '');
The different strings to remove are inside parentheses () and separated by pipes |
This makes it easy enough to add more string to the regexp.
If you are storing the words to remove in an array, you can generate the regex using the RegExp constructor, e.g.:
var words = ["\\.com", "the"];
var rex = new RegExp(words.join("|") + "|[&\\/\\\\#,+()$~%.'\":*?<>{}]", "g");
Then reuse rex for each string:
str = str.toLowerCase().replace(rex, "");
Note the additional escaping required because instead of a regular expression literal, we're using a string, so the backslashes (in the words array and in the final bit) need to be escaped, as does the " (because I used " for the string quotes).
The problem with this question is that im sure you have a very concrete idea in your mind of what you want to do, but the solution you have arrived at (removing un-informative letters before making a is-identical comparison) may not be the best for the comparison you want to do.
I think perhaps a better idea would be to use a different method comparison and a different datastructure than a string. A very simple example would be to condense your strings to sets with set('string') and then compare set similarity/difference. Another method might be to create a Directed Acyclic Graph, or sub-string Trei. The main point is that it's probably ok to reduce the information from the original string and store/compare that - however don't underestimate the value of storing the original string, as it will help you down the road if you want to change the way you compare.
Finally, if your strings are really really really long, you might want to use a perceptual hash - which is like an MD5 hash except similar strings have similar hashes. However, you will most likely have to roll your own for short strings, and define what you think is important data, and what is superfluous.

What is the function of .source in context of this new RegExp

I ran into the below monster of a regex in the wild today. The regex is meant to validate a url.
function superUrlValidation(url) {
return new RegExp(/^/.source + "((.+):\/\/)?" + /(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*#)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|\/|\?)*)?$/.source, "i")
.test(url);
}
I've never seen .source used in a regex like this so I looked it up.
The MDN docs for RegExp.prototype.source states:
The source property returns a String containing the source text of the regexp object, and it doesn't contain the two forward slashes on both sides and any flags.
... and gives this example:
var regex = /fooBar/ig;
console.log(regex.source); // "fooBar", doesn't contain /.../ and "ig".
I understand the MDN example (you're getting the source text of the regex object after it is created, makes sense), but I dont understand how this is being used in the superUrlValidation regex above.
How is the source being used before the regex object is completed and what does this accomplish? I cant find any documentation showing .source being used in this way.
Note that .source is used twice in the regex, at the beginning and the end
Use of .source everywhere in your regex seems totally unnecessary, may be just a trick to avoid double escaping. In fact even use of new RegExp is not needed and you can get away with just the regex literal as this:
var re = /^((.+):\/\/)?(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*#)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|\/|\?)*)?$/i;
/^/ is a regex literal, meaning it's a valid regex object in it's own right. This means that /^/.source === "^".
This seems like an arbitrary example of using the source property as this means the author could have just placed a "^" in it's place, or even just put a ^ at the beginning of the next string, and it would have the same effect.
The .source property returns the content of the regex between the forward slashes as you say. so the result of the above is equivalent to this string:
/^((.+):\/\/)?(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*#)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|\/|\?)*)?$/i
In JavaScript you can write regexes like this: /matchsomething/ or using the RegExp function/constructor above. It looks like the code you found is the result of someone not know what they were doing. They seem to have taken a few regexes using the literal syntax (i.e /match_here/) and plugged it into the constructor version and stuck them all together.
I can't see any benefit in using the source property this way. I would just use the string version or the constructor version. Or better, find out what the original author intended and write it again or find a respected regex library with the criteria you need.
And, yeah, wow. It's massive.

Regex not detecting swear words inside string?

I have a script here:
http://jsfiddle.net/d2rcx/
It has an array of badWords to compare input strings with.
This script works fine if the input string matches exactly as the swear word string but it does not pick up any variations where there is more characters in the string e.g. whitespace before the swear word.
Using this site as reference : http://www.zytrax.com/tech/web/regex.htm
It said the following regex would detect a string within a string.
var regex = new RegExp("/" + badWords[i] + "/g");
if (fieldValue.match(regex) == true)
return true;
However that does not seem to be the case.
What do I need to change to the regex to make it work.
Thanks
Also any good links to explain Regex than what google turns up would be appreciated.
Here's a corrected JSFiddle:
http://jsfiddle.net/d2rcx/5/
See the following documentation for RegExp:
https://developer.mozilla.org/en-US/docs/JavaScript/Reference/Global_Objects/RegExp
Note that the second parameter is where you should specify your flags (e.g. 'g' or 'i'). For example:
new RegExp(badWords[i], 'gi');
Please review http://www.w3schools.com/jsref/jsref_match.asp and note that fieldValue.match() will not return a boolean but an array of matches
Rather than using Regex to do this you are probably better off just looping through an array of badwords and looking for instances within a string using [indexOf].(https://developer.mozilla.org/en-US/docs/JavaScript/Reference/Global_Objects/String/indexOf)
Otherwise you could make a regex like...
\badword1|badword2|badword3\
and just check for any match.
A word boundary in Regex is \b so you could say
\\b(badword1|badword2|badword3)\b\
which will match only whole words - ie Scunthorpe will be ok :)
var rx = new RegExp("\\b(donkey|twerp|idiot)\\b","i"); // i = case insenstive option
alert(rx.test('you are a twerp')); //true
alert(rx.test('hello idiotstick')); //fasle -not whole word
alert(rx.test('nice "donkey"')); //true
http://jsfiddle.net/F8svC/
Changing this, which requires a loop:
var regex = new RegExp("/" + badWords[i] + "/g");
for this:
var regex = new RegExp("/" + badWords.join("|") + "/g");
would be a start. This will do all the matches in one go because the array becomes one string with each element separated by pipes.
P.S.
Reference guide for RegEx here. But there isn't a lot of clear information online about what is and isn't possible with respect to certain functions nor what's good code. I've found a couple of books most useful: David Flanagan's latest JavaScript: The Definitive Guide and Douglas Crockford's JavaScript: The Good Parts for the most usable subset of JavaScript, including RegEx. The railroad diagrams by Crockford are especially good but I'm not sure if they're available online anywhere.
EDIT: Here's an online copy of the relevant chapter including some of those railroad diagrams I mentioned, in case it helps.

Categories

Resources