difference between ruby regex and javascript regex - javascript

I made this regular expression: /.net.(\w*)/
I'm trying to capture the qa in a string like this:
https://xxxxxx.cloudfront.net/qa/club/Slide1.PNG
I'm doing .replace on it like so location.replace(/.net.(\w*)/,data.newName));
But instead of capturing qa, it captures .net, when I run the code in Javascript
According to this online regex tool made for ruby, it captures qa as intended
http://rubular.com/r/ItrG7BRNRn
What's the difference between Javascript regexes and Ruby regexes, and how can I make my regex work as intended in javascript?
Edit:
I changed my code to this:
var str = `https://xxxxxxxxxx.cloudfront.net/qa/club`;
var re = /\.net\/([^\/]*)\//;
console.log(data2.files[i].location.replace(re,'$1'+ "test"));
And instead of
https://dm7svtk8jb00c.cloudfront.net/test/club
I get this:
https://dm7svtk8jb00c.cloudfrontqatestclub
If I remove the $1 I get https://dm7svtk8jb00c.cloudfronttestclub, which is closer, but I want to keep the slashes.

This would be a better regex:
/\.net\/([^\/]*)\//
Remember that . will match any character, not the period character. For that you need to escape it with a leading backslash: \.
Also, \w will only match numbers, letters and underscores. You could quite legitimately have a dash in that part of the URL. Therefore you're far better off matching anything that isn't a forward slash.

I am not sure how Ruby works, but JavaScript replace will not just replace the capture group, it replaces the whole matched string. By adding another capture group, you can use $1 to add back in the string you want to keep.
...replace(/(.net.)(\w*)/,"$1" + data.newName");

You have to do that like this:
location.replace(/(\.net.)(\w*)/, '$1' + data.newName)
replace replaces the whole matched substring, not a particular group. Ruby works exactly in the same way:
ruby -e "puts 'https://xxxxxx.cloudfront.net/qa/club/Slide1.PNG'.sub(/.net.(\w*)/, '##')"
https://xxxxxx.cloudfront##/club/Slide1.PNG
ruby -e "puts 'https://xxxxxx.cloudfront.net/qa/club/Slide1.PNG'.sub(/(.net.)(\w*)/, '\\1' + '##')"
https://xxxxxx.cloudfront.net/##/club/Slide1.PNG

There's no difference (at least with the pattern you've provided). In both cases, the expression matches ".net/qa", with qa being the first capture group within the expression. Notice that even in your linked example the entire match is highlighted.
I'd recommend something like this:
location.replace(/(.net.)\w*/, "$1" + data.newName);
Or this, to be a bit safer:
location.replace(/(.net.)\w*/, function(m, a) { return a + data.newName; });

It's not so much a different between JavaScript and Ruby's implementations of regular expressions, it's your pattern that needs a bit of work. It's not tight enough.
You can use something like /\.net\/([^\/]+)/, which you can see in action at Rubular.
That returns the characters delimited by / following .net.
Regex patterns are very powerful, but they're also fraught with dangerous side-effects that open up big holes easily, causing false-positives, which can ruin results unexpectedly. Until you know them well, start simply, and test them every imaginable way. And, once you think you know them well, keep doing that; Patterns in code we write where I work are a particular hot-button for me, and I'm always finding holes in them in our code-reviews and requiring them to be tightened until they do exactly what the developer meant, not what they thought they meant.
While the pattern above works, I'd probably do it a bit differently in Ruby. Using the tools made for the job:
require 'uri'
URL = 'https://xxxxxx.cloudfront.net/qa/club/Slide1.PNG'
uri = URI.parse(URL)
path = uri.path # => "/qa/club/Slide1.PNG"
path.split('/')[1] # => "qa"
Or, more succinctly:
URI.parse(URL).path.split('/')[1] # => "qa"

Related

Invalid regular expression in javascript

I'm trying to find out if a string contains css code with this expression:
var pattern = new RegExp('\s(?[a-zA-Z-]+)\s[:]{1}\s*(?[a-zA-Z0-9\s.#]+)[;]{1}');
But I get "invalid regular expression" error on the line above...
What's wrong with it?
found the regex here: http://www.catswhocode.com/blog/10-regular-expressions-for-efficient-web-development
It's for PHP but it should work in javascript too, right?
What are the ? at the start of the two [a-zA-z-] blocks for? They look wrong to me.
The ? is unfortunately somewhat overload in regexp syntax, it can have three different meanings that I know of, and none of them match what I see in your example.
Also, your \s sequences need the backslash escaping because this is a string - they should look like \\s. To avoid escaping, just use the /.../ syntax instead of new Regexp("...").
That said, even that is insufficient - the regexp still produces an Invalid Group error in Chrome, probably related to the {1} sequences.
The ?'s are messing it up. I'm not sure what they are for.
/\s[a-zA-Z\-]+\s*:\s*[a-zA-Z0-9\s.#]+;/
worked for me (as far as compiling. I didn't test to see if it properly detected a CSS string).
Replace the quotes with / (slashes):
var pattern = /\s([a-zA-Z-]+)\s[:]{1}\s*([a-zA-Z0-9\s.#]+)[;]{1}/;
You also don't need the new RegExp() part either, which is why it's been removed; instead of using a quote or double quote to denote a string, JavaScript uses a slash / to denote a regular expression, which isn't a normal string.
That regular expression is very bad and I would avoid its source in the future. That said, I cleaned it up a bit and got the following result:
var pattern = /\s(?:[a-zA-Z-]+)\s*:\s*(?:[^;\n\r]+);/;
this matches something that looks like css, for example:
background-color: red;
Here's the fiddle to prove it, though I'd recommend to find a different solution to your problem. This is a very simple regex and it's not save to say that it is reliable.

Javascript regexp lets undesirable characters

I'm using a regExp in my project but some how I'm getting some undesirable characters
my RegExp looks like this:
new RegExp("[א-ת,A-z,',','(',')','.','-',''']");
which supposed to avoid characters like \ or []
but let my use one and more from (,),-,alphabets etc.
Unfortunately it doesnt happen
Which pattren includes both desirable and undesirable characters??
thanks for your help
Well your regular expression just says to match one "good" character (and incorrectly at that).
I think something closer to this would be what you want, though I'm not sure about the higher-page UTC characters:
var regexp = /^[א-תA-Za-z,()\-']*$/;
If the alefbet part doesn't work (it looks backwards to me, but I guess that's kind of a conundrum :-), try:
var regexp = /^[\u05DA-\05EAA-Za-z,()\-']*$/;
Might be good to tack an "i" (ignore case) modifier on the end too:
var regexp = /^[\u05DA-\05EAA-Za-z,()\-']*$/i;
This also does not handler the various diacritical marks; I don't know if you need those matched or not.
First of all, you don't need all those single quotes and commas. Second, you want A-Za-z, not.A-z. The latter includes ASCII characters between "Z" and "a".
var re = new RegExp("[א-תA-Za-z,()\.'\s-]");

Regex to match all instances not inside quotes

From this q/a, I deduced that matching all instances of a given regex not inside quotes, is impossible. That is, it can't match escaped quotes (ex: "this whole \"match\" should be taken"). If there is a way to do it that I don't know about, that would solve my problem.
If not, however, I'd like to know if there is any efficient alternative that could be used in JavaScript. I've thought about it a bit, but can't come with any elegant solutions that would work in most, if not all, cases.
Specifically, I just need the alternative to work with .split() and .replace() methods, but if it could be more generalized, that would be the best.
For Example:
An input string of: +bar+baz"not+or\"+or+\"this+"foo+bar+
replacing + with #, not inside quotes, would return: #bar#baz"not+or\"+or+\"this+"foo#bar#
Actually, you can match all instances of a regex not inside quotes for any string, where each opening quote is closed again. Say, as in you example above, you want to match \+.
The key observation here is, that a word is outside quotes if there are an even number of quotes following it. This can be modeled as a look-ahead assertion:
\+(?=([^"]*"[^"]*")*[^"]*$)
Now, you'd like to not count escaped quotes. This gets a little more complicated. Instead of [^"]* , which advanced to the next quote, you need to consider backslashes as well and use [^"\\]*. After you arrive at either a backslash or a quote, you need to ignore the next character if you encounter a backslash, or else advance to the next unescaped quote. That looks like (\\.|"([^"\\]*\\.)*[^"\\]*"). Combined, you arrive at
\+(?=([^"\\]*(\\.|"([^"\\]*\\.)*[^"\\]*"))*[^"]*$)
I admit it is a little cryptic. =)
Azmisov, resurrecting this question because you said you were looking for any efficient alternative that could be used in JavaScript and any elegant solutions that would work in most, if not all, cases.
There happens to be a simple, general solution that wasn't mentioned.
Compared with alternatives, the regex for this solution is amazingly simple:
"[^"]+"|(\+)
The idea is that we match but ignore anything within quotes to neutralize that content (on the left side of the alternation). On the right side, we capture all the + that were not neutralized into Group 1, and the replace function examines Group 1. Here is full working code:
<script>
var subject = '+bar+baz"not+these+"foo+bar+';
var regex = /"[^"]+"|(\+)/g;
replaced = subject.replace(regex, function(m, group1) {
if (!group1) return m;
else return "#";
});
document.write(replaced);
Online demo
You can use the same principle to match or split. See the question and article in the reference, which will also point you code samples.
Hope this gives you a different idea of a very general way to do this. :)
What about Empty Strings?
The above is a general answer to showcase the technique. It can be tweaked depending on your exact needs. If you worry that your text might contain empty strings, just change the quantifier inside the string-capture expression from + to *:
"[^"]*"|(\+)
See demo.
What about Escaped Quotes?
Again, the above is a general answer to showcase the technique. Not only can the "ignore this match" regex can be refined to your needs, you can add multiple expressions to ignore. For instance, if you want to make sure escaped quotes are adequately ignored, you can start by adding an alternation \\"| in front of the other two in order to match (and ignore) straggling escaped double quotes.
Next, within the section "[^"]*" that captures the content of double-quoted strings, you can add an alternation to ensure escaped double quotes are matched before their " has a chance to turn into a closing sentinel, turning it into "(?:\\"|[^"])*"
The resulting expression has three branches:
\\" to match and ignore
"(?:\\"|[^"])*" to match and ignore
(\+) to match, capture and handle
Note that in other regex flavors, we could do this job more easily with lookbehind, but JS doesn't support it.
The full regex becomes:
\\"|"(?:\\"|[^"])*"|(\+)
See regex demo and full script.
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
You can do it in three steps.
Use a regex global replace to extract all string body contents into a side-table.
Do your comma translation
Use a regex global replace to swap the string bodies back
Code below
// Step 1
var sideTable = [];
myString = myString.replace(
/"(?:[^"\\]|\\.)*"/g,
function (_) {
var index = sideTable.length;
sideTable[index] = _;
return '"' + index + '"';
});
// Step 2, replace commas with newlines
myString = myString.replace(/,/g, "\n");
// Step 3, swap the string bodies back
myString = myString.replace(/"(\d+)"/g,
function (_, index) {
return sideTable[index];
});
If you run that after setting
myString = '{:a "ab,cd, efg", :b "ab,def, egf,", :c "Conjecture"}';
you should get
{:a "ab,cd, efg"
:b "ab,def, egf,"
:c "Conjecture"}
It works, because after step 1,
myString = '{:a "0", :b "1", :c "2"}'
sideTable = ["ab,cd, efg", "ab,def, egf,", "Conjecture"];
so the only commas in myString are outside strings. Step 2, then turns commas into newlines:
myString = '{:a "0"\n :b "1"\n :c "2"}'
Finally we replace the strings that only contain numbers with their original content.
Although the answer by zx81 seems to be the best performing and clean one, it needes these fixes to correctly catch the escaped quotes:
var subject = '+bar+baz"not+or\\"+or+\\"this+"foo+bar+';
and
var regex = /"(?:[^"\\]|\\.)*"|(\+)/g;
Also the already mentioned "group1 === undefined" or "!group1".
Especially 2. seems important to actually take everything asked in the original question into account.
It should be mentioned though that this method implicitly requires the string to not have escaped quotes outside of unescaped quote pairs.

Breaking a String into Chunks based on Pattern

I have one string, that looks like this:
a[abcdefghi,2,3,jklmnopqr]
The beginning "a" is fixed and non-changing, however the content within the brackets is and can follow a pattern. It will always be an alphabetical string, possibly followed by numbers separate by commas or more strings and/or numbers.
I'd like to be able to break it into chunks of the string and any numbers that follow it until the "]" or another string is met.
Probably best explained through examples and expected ideal results:
a[abcdefghi] -> "abcdefghi"
a[abcdefghi,2] -> "abcdefghi,2"
a[abcdefghi,2,3,jklmnopqr] -> "abcdefghi,2,3" and "jklmnopqr"
a[abcdefghi,2,3,jklmnopqr,stuvwxyz] -> "abcdefghi,2,3" and "jklmnopqr" and "stuvwxyz"
a[abcdefghi,2,3,jklmnopqr,1,9,stuvwxyz] -> "abcdefghi,2,3" and "jklmnopqr,1,9" and "stuvwxyz"
a[abcdefghi,1,jklmnopqr,2,stuvwxyz,3,4] -> "abcdefghi,1" and "jklmnopqr,2" and "stuvwxyz,3,4"
Ideally a malformed string would be partially caught (but this is a nice extra):
a[2,3,jklmnopqr,1,9,stuvwxyz] -> "jklmnopqr,1,9" and "stuvwxyz"
I'm using Javascript and I realize a regex won't bring me all the way to the solution I'd like but it could be a big help. The alternative is to do a lot of manually string parsing which I can do but doesn't seem like the best answer.
Advice, tips appreciated.
UPDATE: Yes I did mean alphametcial (A-Za-z) instead of alphanumeric. Edited to reflect that. Thanks for letting me know.
You'd probably want to do this in 2 steps. First, match against:
a\[([^[\]]*)\]
and extract group 1. That'll be the stuff in the square brackets.
Next, repeatedly match against:
[a-z]+(,[0-9]+)*
That'll match things like "abcdefghi,2,3". After the first match you'll need to see if the next character is a comma and if so skip over it. (BTW: if you really meant alphanumeric rather than alphabetic like your examples, use [a-z0-9]*[a-z][a-z0-9]* instead of [a-z]+.)
Alternatively, split the string on commas and reassemble into your word with number groups.
Why wouldn't a regex bring you all the way to a solution?
The following regex works against the given data, but it makes a few assumptions (at least two alphas followed by comma separated single digits).
([a-z]{2,}(?:,\\d)*)
Example:
re = new RegExp('[a-z]{2,}(?:,\\d)*', 'g')
matches = re.exec("a[abcdefghi,2,3,jklmnopqr,1,9,stuvwxyz]")
Assuming you can easily break out the string between the brackets, something like this might be what you're after:
> re = new RegExp('[a-z]+(?:,\\d)*(?:,?)', 'gi')
> while (match = re.exec("abcdefghi,2,3,jklmnopqr,1,9,stuvwxyz")) { print(match[0]) }
abcdefghi,2,3,
jklmnopqr,1,9,
stuvwxyz
This has the advantage of working partially in your malformed case:
> while (match = re.exec("abcdefghi,2,3,jklmnopqr,1,9,stuvwxyz")) { print(match[0]) }
jklmnopqr,1,9,
stuvwxy
The first character class [a-z] can be modified if you meant for it to be truly alphanumeric.

Need a regex for acceptable file names

I'm using Fancy Upload 3 and onSelect of a file I need to run a check to make sure the user doesn't have any bad characters in the filename. I'm currently getting people uploading files with hieroglyphics and such in the names.
What I need is to check if the filename only contains:
A-Z
a-z
0-9
_ (underscore)
- (minus)
SPACE
ÀÈÌÒÙàèìòùÁÉÍÓÚÝáéíóúýÂÊÎÔÛâêîôûÃÑÕãñõÄËÏÖÜäëïöü (as single and double byte)
Obviously you can see the difficult thing there. The non-english single and double byte chars.
I've seen this:
[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]
And this:
[\x80-\xA5]
But neither of them fully cover the situation right.
Examples that should work:
fást.zip
abc.zip
ABC.zip
Über.zip
Examples that should NOT work:
∑∑ø∆.zip
¡wow!.zip
•§ªº¶.zip
The following is close, but I'm NO RegEx'pert, not even close.
var filenameReg = /^[A-Za-z0-9-_]|[\x00A0-\xD7FF\xF900-\xFDCF\xFDF0-\xFFEF]+$/;
Thanks in advance.
Solution from Zafer mostly works, but it does not catch all of the other symbols, see below.
Uncaught:
¡£¢§¶ª«ø¨¥®´åß©¬æ÷µç
Caught:
™∞•–≠'"πˆ†∑œ∂ƒ˙∆˚…≥≤˜∫√≈Ω
Regex:
var filenameReg = /^([A-Za-z0-9\-_. ]|[\x00A0-\xD7FF\xF900-\xFDCF\xFDF0-\xFFEF])+$/;
Alternation between two character classes (ie. [abc]|[def]) can be simplified to a single character class ([abcdef]) -- the first can be read as "(a or b or c) OR (d or e or f)"; the second as "(a or b or c or d or e or f)". What probably tripped up your regular expression is the unescaped dash in the first class -- if you want a literal dash, it should be the last character in the class.
So we'll modify your expression to get it working:
var filenameReg = /^[A-Za-z0-9_\x00A0-\xD7FF\xF900-\xFDCF\xFDF0-\xFFEF-]+$/;
The problem now is that you're not accounting for the file extension, but that is an easy modification (assuming you're always getting .zip files):
var filenameReg = /^[A-Za-z0-9_\x00A0-\xD7FF\xF900-\xFDCF\xFDF0-\xFFEF-]+\.zip$/;
Replace zip with another pattern if the extension differs.
It looks like it is the character ranges that are causing the problem, because they include some unallowable characters in between. Since you already have the list of allowable characters, the best thing would be to just use that directly:
var filenameReg = /^[A-Za-z0-9_\-\ ÀÈÌÒÙàèìòùÁÉÍÓÚÝáéíóúýÂÊÎÔÛâêîôûÃÑÕãñõÄËÏÖÜäëïöü]+$/;
The following should work:
var filenameReg = /^([A-Za-z0-9\-_. ]|[\x00A0-\xD7FF\xF900-\xFDCF\xFDF0-\xFFEF])+$/;
I've put \ next to - and grouped two expressions otherwise + sign doesn't affect the first expression.
EDIT 1 :I've also put . in the expression.
We have diffrent rules for diffrent platforms. But I think you mean long file names in windows. For that you can use following RegEx:
var longFilenames = #"^[^\./:*\?\""<>\|]{1}[^\/:*\?\""<>\|]{0,254}$";
NOTE: Instead of saying which Character is allowed, you need to say which ones are not allowed!
But keep in mind that this is not 100% complete RegEx. If you really want to make it complete you have to add exceptions for reserved names as well.
You can find more information about filename rules here:
http://msdn.microsoft.com/en-us/library/aa365247%28VS.85%29.aspx

Categories

Resources