is there an algorithm that indexes spaces instead of characters? - javascript

New to this and was wondering after reading up on indexing and character counts. wouldn't it be more applicable to index spaces instead of characters to improve matching of words?
Looking at the example below, it selects/counts the white spaces at the end of every word. But I want it to count or recognize the space at the end of a word and the beginning of the following word, essentially noticing/collating white space characters. Does that make any sense?
var str = 'This is a string',
index = 0,
res = [];
while ((index = str.indexOf(' ', index + 1)) > 0) {
res.push(index);
}
console.log(res)

Short answer: No, but you can split it into an array of strings by spaces with String.prototype.split().
Long answer:
You could think about this in a few ways... and this applies to more languages than just JS:
Implementation
Strings are more akin to an array of characters. This is easier visualized if you think about how a string would be recognized by a computer: as a number (wait what?).
As I'm sure you know, computers can really only see 1's and 0's, so some smart people came up with a way to represent what character with what number... and then define some special things when some characters come in certain orders, but that's a whole other can of worms, these definitions are what we call charsets. Note that the number of bits available per character is defined by the charset i.e. UTF-8 and UTF-16
So returning to the original question "Why index by characters and not spaces?": because we're lazy and that was easy it's actually pretty convenient for reasons I'm about to elaborate on.
Mathematics
Let's be honest with ourselves, this is Computer Science, so we should probably back up our reasons with math... which is where Formal Languages come in.
A formal language L over an alphabet Σ is a subset of Σ*, that is, a set of words over that alphabet.
A formal grammar is a set of production rules for strings in a formal language.
A word over an alphabet can be any finite sequence (i.e., string) of letters.
In mathematics a sequence is an enumerated collection of objects in which repetitions are allowed.
Note: Be aware that this post is in no way a proper description of formal languages, just a collection of generalized descriptions provided by wikipedia
Which leads me to grammars, your case of a string being indexed by a " " is really more of something that is defined as a grammar and is really a concept derived from those nonsensical human languages.
So what this means is if you boil down what a string really is and what defines it, you can see that it is defined all the way down on the level of mathematics.
Practicality
But wait, I am human why does that still apply?
Well think about it this way, a string can hold more than just a sentence right? Take JSON for example "{\"key1\":\"hello world\",\"key2\":0}" would it make sense to have this string indexed by spaces? There's also the issue that substrings are a lot more tricky because we can't reference the individual characters anymore, so even iterating over a string would become a complicated task.
So why not make another data type?
Honestly this is just a reiteration of before: is it really necessary? Is it common enough of a problem to warrant an entirely new datatype when the key difference is more or less splitting the string and not allowing the programmer to look up the characters individually?
An Algorithm
Well this is probably the easiest part of the answer... now that we have a general understanding of what a string is. As with all programming, there's a ton of way to go about it: I'm specifically sticking with JS functions, because that was tagged in the question.
As #Redu mentioned, there's the String.prototype.split() method (which is the easiest way I know of). Which allows you to split a string based on another string, or a regular expression (these are also defined in formal languages, most programming languages that have them have much more "featured" regular expressions, they can also be used to describe some grammars). So to split a string based on a ' ' we can do one of three ways (2 are regular expression approaches).
console.log("Hello World!".split(" ")); // Split on all instances of a single space
console.log("Foo bar".split(/ /g)); // RegEx split on all single spaces
console.log("A RegEx".split(/\s+/g)); // RegEx split on one or more whitespaces (not necessarily just a space)
console.log("A RegEx".split(" ")); // For comparison (multiple spaces)
TL;DR
Blame math There are quite a few reasons, but essentially, it really doesn't make sense to.. why not split the string?

Related

VueJS detect if string contains words

I'm looking to detect whether or not a string has any word.
"DD8BC606-E0C0-41A2-8E7E-FCB2A1D66D76.jpeg" // false
"Image of Logo JPG" // true
I'd imagine there needs to be some sort of reference to the English dictionary on what constitutes a word, but in my scenario it could be in any language.
I've tried to split the string into an array of words and check the word length for non absurd lengths (though some words especially in German are absurdly long).
string_to_array = function (str) {
return str.trim().split(" ");
};
In the same vain, I tried count how many words there are in the array and require at least 2 (which would be a strong indicator that the string has been typed by a human - i.e. a word), but this invalidates one word strings (although this isn't a stringent requirement).
What's the fastest ad-hoc way to make sure that there is at least a word in a string?

Need a regular expression in javascript [duplicate]

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 2 years ago.
I know that I can negate group of chars as in [^bar] but I need a regular expression where negation applies to the specific word - so in my example how do I negate an actual bar, and not "any chars in bar"?
A great way to do this is to use negative lookahead:
^(?!.*bar).*$
The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead [is any regex pattern].
Unless performance is of utmost concern, it's often easier just to run your results through a second pass, skipping those that match the words you want to negate.
Regular expressions usually mean you're doing scripting or some sort of low-performance task anyway, so find a solution that is easy to read, easy to understand and easy to maintain.
Solution:
^(?!.*STRING1|.*STRING2|.*STRING3).*$
xxxxxx OK
xxxSTRING1xxx KO (is whether it is desired)
xxxSTRING2xxx KO (is whether it is desired)
xxxSTRING3xxx KO (is whether it is desired)
You could either use a negative look-ahead or look-behind:
^(?!.*?bar).*
^(.(?<!bar))*?$
Or use just basics:
^(?:[^b]+|b(?:$|[^a]|a(?:$|[^r])))*$
These all match anything that does not contain bar.
The following regex will do what you want (as long as negative lookbehinds and lookaheads are supported), matching things properly; the only problem is that it matches individual characters (i.e. each match is a single character rather than all characters between two consecutive "bar"s), possibly resulting in a potential for high overhead if you're working with very long strings.
b(?!ar)|(?<!b)a|a(?!r)|(?<!ba)r|[^bar]
I came across this forum thread while trying to identify a regex for the following English statement:
Given an input string, match everything unless this input string is exactly 'bar'; for example I want to match 'barrier' and 'disbar' as well as 'foo'.
Here's the regex I came up with
^(bar.+|(?!bar).*)$
My English translation of the regex is "match the string if it starts with 'bar' and it has at least one other character, or if the string does not start with 'bar'.
The accepted answer is nice but is really a work-around for the lack of a simple sub-expression negation operator in regexes. This is why grep --invert-match exits. So in *nixes, you can accomplish the desired result using pipes and a second regex.
grep 'something I want' | grep --invert-match 'but not these ones'
Still a workaround, but maybe easier to remember.
If it's truly a word, bar that you don't want to match, then:
^(?!.*\bbar\b).*$
The above will match any string that does not contain bar that is on a word boundary, that is to say, separated from non-word characters. However, the period/dot (.) used in the above pattern will not match newline characters unless the correct regex flag is used:
^(?s)(?!.*\bbar\b).*$
Alternatively:
^(?!.*\bbar\b)[\s\S]*$
Instead of using any special flag, we are looking for any character that is either white space or non-white space. That should cover every character.
But what if we would like to match words that might contain bar, but just not the specific word bar?
(?!\bbar\b)\b\[A-Za-z-]*bar[a-z-]*\b
(?!\bbar\b) Assert that the next input is not bar on a word boundary.
\b\[A-Za-z-]*bar[a-z-]*\b Matches any word on a word boundary that contains bar.
See Regex Demo
Extracted from this comment by bkDJ:
^(?!bar$).*
The nice property of this solution is that it's possible to clearly negate (exclude) multiple words:
^(?!bar$|foo$|banana$).*
I wish to complement the accepted answer and contribute to the discussion with my late answer.
#ChrisVanOpstal shared this regex tutorial which is a great resource for learning regex.
However, it was really time consuming to read through.
I made a cheatsheet for mnemonic convenience.
This reference is based on the braces [], (), and {} leading each class, and I find it easy to recall.
Regex = {
'single_character': ['[]', '.', {'negate':'^'}],
'capturing_group' : ['()', '|', '\\', 'backreferences and named group'],
'repetition' : ['{}', '*', '+', '?', 'greedy v.s. lazy'],
'anchor' : ['^', '\b', '$'],
'non_printable' : ['\n', '\t', '\r', '\f', '\v'],
'shorthand' : ['\d', '\w', '\s'],
}
Just thought of something else that could be done. It's very different from my first answer, as it doesn't use regular expressions, so I decided to make a second answer post.
Use your language of choice's split() method equivalent on the string with the word to negate as the argument for what to split on. An example using Python:
>>> text = 'barbarasdbarbar 1234egb ar bar32 sdfbaraadf'
>>> text.split('bar')
['', '', 'asd', '', ' 1234egb ar ', '32 sdf', 'aadf']
The nice thing about doing it this way, in Python at least (I don't remember if the functionality would be the same in, say, Visual Basic or Java), is that it lets you know indirectly when "bar" was repeated in the string due to the fact that the empty strings between "bar"s are included in the list of results (though the empty string at the beginning is due to there being a "bar" at the beginning of the string). If you don't want that, you can simply remove the empty strings from the list.
I had a list of file names, and I wanted to exclude certain ones, with this sort of behavior (Ruby):
files = [
'mydir/states.rb', # don't match these
'countries.rb',
'mydir/states_bkp.rb', # match these
'mydir/city_states.rb'
]
excluded = ['states', 'countries']
# set my_rgx here
result = WankyAPI.filter(files, my_rgx) # I didn't write WankyAPI...
assert result == ['mydir/city_states.rb', 'mydir/states_bkp.rb']
Here's my solution:
excluded_rgx = excluded.map{|e| e+'\.'}.join('|')
my_rgx = /(^|\/)((?!#{excluded_rgx})[^\.\/]*)\.rb$/
My assumptions for this application:
The string to be excluded is at the beginning of the input, or immediately following a slash.
The permitted strings end with .rb.
Permitted filenames don't have a . character before the .rb.

If-else condition in Regex [duplicate]

This is what I have so far...
var regex_string = "s(at)?u(?(1)r|n)day"
console.log("Before: "+regex_string)
var regex_string = regex_string.replace(/\(\?\((\d)\)(.+?\|)(.+?)\)/g,'((?!\\$1)$2\\$1$3)')
console.log("After: "+regex_string)
var rex = new RegExp(regex_string)
var arr = "thursday tuesday thuesday tursday saturday sunday surday satunday monday".split(" ")
for(i in arr){
var m
if(m = arr[i].match(rex)){
console.log(m[0])
}
}
I am swapping (?(n)a|b) for ((?!\n)a|\nb) where n is a number, and a and b are strings. This seems to work fine - however, I am aware that it is a big fat hack.
Is there a better way to approach this problem?
In the specific case of your regex, it is much simpler and more readable to use alternation:
(?:sunday|saturday)
Or you can create alternation only between the 2 positions where the conditional regex is involved (this is more useful in the case where there are many such conditional expressions, but only refers to the nearby capturing group). Using your case as an example, we will only create the alternation for un and atur since only those are involved in the condition:
s(?:un|atur)day
There are 2 common types of conditional regex. (There are more exotic stuffs supported by Perl regular expression, but those requires support for features that JavaScript regular expression or other common regex engine doesn't have).
The first type is where an explicit pattern is provided as condition. This type can be mimicked in JavaScript regex. In the language that supports conditional regex, the pattern will be:
(?(conditional-pattern)yes-pattern|no-pattern)
In JavaScript, you can mimic it with look-ahead, with the (obvious) assumption that the original conditional-pattern is a look-ahead:
((?=conditional-pattern)yes-pattern|(?!conditional-pattern)no-pattern)
The negative look-ahead is necessary, to prevent the cases where the input string passes the conditional-pattern and fail in the yes-pattern, but it can match the no-pattern. It is safe to do so, because positive look-around and negative look-around are exact opposite of each other logically.
The second type is where a reference to a capturing group is provided (name or number), and the condition will be evaluated to true when the capturing group has a match. In such case, there is no simple solution.
The only way I can think of is by duplication, as what I have done with your case as an example. This of course reduces the maintainability. It is possible to compose you regex by writing them in parts (in literal RegExp), retrieve the string with source attribute, then concatenate them together; this will allow for changes to propagate to other duplicated parts, but makes it harder to understand the regex and/or make major modification to it.
References
Alternation Constructs in Regular Expression - .NET - Microsoft
re package in Python: Ctrl+F for (?(
perlre - Perl regular expression: Ctrl+F for (?(

Is regex a costly operation? It seems atleast

I was writing a regex pattern for a string. The string being a constant type/structure. What I mean is, it looks like(this format is not so important, have a look at the example next)-
[Data Code Format]: POP N/N/N (N: object): JSON object data
Here N represents a number or digit. and what's inside [ ] is a set of string block. But, this format is constant.
So, I wrote a regex-
\s*((?:\S+\s+){1}\S+)\s*(?:\((?:.*?)\))?:\s*(\S*)\s*(\w+)
Keeping this string example in mind-
%DATA-4-JSON: POP 0/1/0 (1: object): JSON object data
It works perfectly, but, what I see on regex101.com is that there is a successful match. But, it has undergone 330 steps to achieve this.
Screenshot-
My question is, its taking 330 steps to achieve this(atleast in my mind I feel its pretty heavy), which I guess can be achieved using if-else and other comparisons with lesser steps right?
I mean, is regex string parsing so heavy? 330 steps for like 10000's of strings I need to parse is going to be heavy right?
When you are using regexps, they can be costly if you use backtracking. When you use quantifiers with consequent patterns that may match one another (and in case the patterns between them may match empty strings (usually, declared with *, {0,x} or ? quantifiers)), backtracking might play a bad trick on you.
What is bad about your regex?
A slight performance issue is caused by an unnecessary non-capturing group (?:\S+\s+){1}. It can be written as \S+\s+ and it will already decrease the number of steps (a bit, 302 steps). However, the worst part is that \S matches the : delimiter. Thus, the regex engine has to try a lot of possible routes to match the beginning of your expected match. Replace that \S+\s+\S with [^:\s]+\s+[^:\s]+, and the step amount will decrease to 159!
Now, coming to (?:\((?:.*?)\))? - removing the unnecessary inner non-capturing group gives another slight improvement (148 steps), and replacing .*? with a negated character class gives another boost (139 steps).
I'd use this for now:
\s*([^:\s]+\s+[^:\s]+)\s*(?:\([^()]*\))?:\s*(\S*)\s*(\w+)
If the initial \s* was obligatory, it would be another enhancement, but you would have different matches.

Regular Expression to identify all camel cased strings in a document

I am rusty on regular expressions and need some help. A js code base inherited is using a mix of camel case and snake casing for things like variables names and object properties.
I am trying to formulate a regular expression I can use that will identify all the camel cased strings, and then be able to replace those strings with snake casing. The part I am struggling with is identifying the camel cased strings under the conditions I have.
Identifying which strings are camel case: In this document, all camel cased strings start off with either a lower case letter, an underscore, or a $, and then will Use a capital Letter at some point later in the string. Examples are: someCamelCasedString & _someCamelCasedString & $someCamelCasedString. The regular expression would need to take into account that some of these strings I am trying to match for may be object properties, so it should be able to identify things like: Foo._someCamelCasedString.bar or Foo[_someCamelCasedString].bar
This identifies all occurrences of "strict" camel case (only letters). Whether they start with _ or $ or foofoo doesn't matter.
[a-z]+[A-Z][a-zA-Z]*
An edge case is cameL Is that proper camel case? I have assumed it is, but we can change that.
See demo
If you want to allow other characters in the string (digits etc) then we can add them in the character classes. So this is a starting point to be refined depending on your requirements.
For instance if you know that you're happy with digits and underscores, you can go with this:
[a-z]\w*?[A-Z]\w*
If you also want to allow dollars in the name (a character that #Jongware says js strings allow) you can go with this:
[a-z][\w$]*[A-Z][\w$]*
Then there is the question of what constitutes the boundary of a valid string, so that we can perhaps devise some anchor (perhaps with sneaky lookaheads, since js doesn't support lookbehinds) in order to avoid false positives.
Maybe something like this:
/(\w|\$)+([A-Z])\w+/gm
You can play around with it here and see the examples: http://regexr.com/38qkq The site also explains what each piece means in regular expressions.
/(?:^|\s|[^\w$])([a-z_$][a-zA-Z]*[A-Z][a-zA-Z]*)/gm
Test http://regex101.com/r/pH1aB7

Categories

Resources