I am developing a small character encoder generator where the user input their text and on the click of a button, it outputs the encoded version.
I've defined an object of the characters that need to be encoded like so:
map = {
'©' : '©',
'&' : '&'
},
And here is the loop that gets the values from the map and replaces them:
Object.keys(map).forEach(function (ico) {
var icoE = ico.replace(/([.?*+^$[\]\\(){}|-])/g, "\\$1");
raw = raw.replace( new RegExp(icoE, 'g'), map[ico] );
});
I am them simply outputting the result to a textarea. This all works fine, however the problem I'm facing is this.
© is replaced with © however the & symbol at the beginning of this is then converted to & so it ends up being ©.
I see why this is happening however I'm not sure how to go about ensuring that & is not replaced within character encoded strings.
Here is a JSFiddle for a live preview of what I mean:
http://jsfiddle.net/4m3nw/1/
Any help would be much appreciated
Prelude: Apart from regex, an idea worth considering is something like this JS function that already handles html entities. Now, on to the regex question.
HTML Special Characters, Negative Lookahead
In HTML, special characters can look not only like © but also like —, and they can have upper-case characters.
To replace ampersands that are not immediately followed by a hash or word characters and a semicolon, you can use something like this:
&(?!(?:#[0-9]+|[a-z]+);)
See the demo.
Make sure to use the i flag to activate case-insensitive mode
& matches the literal ampersand
The negative lookahead (?!(?:#[0-9]+|[a-z]+);) asserts that it is not followed by...
(?:#[0-9]+|[a-z]+) a hash and digits, | OR letters...
then a semicolon.
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind
The problem is that since you process the same string you replace the &in ©. If you re-order your map then that seemingly solves the problem. However according to the ECMAScript specifications, this is not a given, so you would be relying on implementation details of the ECMAScript engine used.
What you can do to make sure it will always work is to swap the keys so that & is always processed first:
map = {
'©' : '©',
'&' : '&'
};
var keys = Object.keys(map);
keys[keys.indexOf('&')] = keys[0];
keys[0] = '&';
keys.forEach(function (ico) {
var icoE = ico.replace(/([.?*+^$[\]\\(){}|-])/g, "\\$1");
raw = raw.replace( new RegExp(icoE, 'g'), map[ico] );
});
Obviously you need to add checks for the &'s existence if it isn't always there.
jsFiddle Demo.
Probably the simplest code change is to reorder your map by putting the ampersand on top.
Related
I am passing a URL to a block of code in which I need to insert a new element into the regex. Pretty sure the regex is valid and the code seems right but no matter what I can't seem to execute the match for regex!
//** Incoming url's
//** url e.g. api/223344
//** api/11aa/page/2017
//** Need to match to the following
//** dir/api/12ab/page/1999
//** Hence the need to add dir at the front
var url = req.url;
//** pass in: /^\/api\/([a-zA-Z0-9-_~ %]+)(?:\/page\/([a-zA-Z0-9-_~ %]+))?$/
var re = myregex.toString();
//** Insert dir into regex: /^dir\/api\/([a-zA-Z0-9-_~ %]+)(?:\/page\/([a-zA-Z0-9-_~ %]+))?$/
var regVar = re.substr(0, 2) + 'dir' + re.substr(2);
var matchedData = url.match(regVar);
matchedData === null ? console.log('NO') : console.log('Yay');
I hope I am just missing the obvious but can anyone see why I can't match and always returns NO?
Thanks
Let's break down your regex
^\/api\/ this matches the beginning of a string, and it looks to match exactly the string "/api"
([a-zA-Z0-9-_~ %]+) this is a capturing group: this one specifically will capture anything inside those brackets, with the + indicating to capture 1 or more, so for example, this section will match abAB25-_ %
(?:\/page\/([a-zA-Z0-9-_~ %]+)) this groups multiple tokens together as well, but does not create a capturing group like above (the ?: makes it non-captuing). You are first matching a string exactly like "/page/" followed by a group exactly like mentioned in the paragraph above (that matches a-z, A-Z, 0-9, etc.
?$ is at the end, and the ? means capture 0 or more of the precending group, and the $ matches the end of the string
This regex will match this string, for example: /api/abAB25-_ %/page/abAB25-_ %
You may be able to take advantage of capturing groups, however, and use something like this instead to get similar results: ^\/api\/([a-zA-Z0-9-_~ %]+)\/page\/\1?$. Here, we are using \1 to reference that first capturing group and match exactly the same tokens it is matching. EDIT: actually, this probably won't work, since the text after /api/ and the text after /page/ will most likely be different, carrying on...
Afterwards, you are are adding "dir" to the beginning of your search, so you can now match someting like this: dir/api/abAB25-_ %/page/abAB25-_ %
You have also now converted the regex to a string, so like Crayon Violent pointed out in their comment, this will break your expected funtionality. You can fix this by using .source on your regex: var matchedData = url.match(regVar.source); https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/source
Now you can properly match a string like this: dir/api/11aa/page/2017 see this example: https://repl.it/Mj8h
As mentioned by Crayon Violent in the comments, it seems you're passing a String rather than a regular expression in the .match() function. maybe try the following:
url.match(new RegExp(regVar, "i"));
to convert the string to a regular expression. The "i" is for ignore case; don't know that's what you want. Learn more here:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp
With the code below, I have converted the following names into URL such as
Love & Relationships to http://domain.org/love-relationships
Career & Guidance to http://domain.org/career-guidance
filter('ampToDash', function(){
return function(text){
return text ? String(text).replace(/ & /g,'-'): '';
};
}).filter('dashToAmp', function(){
return function(text){
return text ? String(text).replace(/-/g,' & '): '';
};
})
However, I have a new set of names and I can't figure out how to do both at the same time.
Being Human to http://domain.org/being-human
Competitive Exams to http://domain.org/competitive-exams
filter('ampToDash', function(){
return function(text){
return text ? String(text).replace(/ /g,'-'): '';
};
}).filter('dashToAmp', function(){
return function(text){
return text ? String(text).replace(/-/g,' '): '';
};
})
How do I combine both the regex codes so it can work hand in hand?
You may also want to extend your replacement criteria to cover all "non-word" characters, instead of just accounting for the ones you're currently aware of (& and space). This would be more future-proof, and perhaps easier to reason with:
String(text).replace(/\W+/g, '-')
(\W+ means any sequence of non-word characters.)
Example:
'Jack & Jill went up the #$%#! hill'.replace(/\W+/g, '-')
Yields:
Jack-Jill-went-up-the-hill
And because there's loss of information (i.e. you don't know what exactly leads to a '-' by looking at the transformed string), a way you can find the original string is to simply store it and look up by the transformed string. To elaborate: You're probably going to be looking up some document from this new string (a "slug", as others pointed out). Store the slug along with the document and just look up the document (and its original title) from your database.
It looks like you simply want to change any instances of an ampersand with leading or trailing white-space or just white-space to a single hyphen. If so, you could just use the following expression :
// Replace any strings that have leading and trailing spaces or just a series of spaces
String(text).replace(/(\s+&\s+|\s+)/g,'-'): '';
Example
var input = ['Love & Relationships', 'Career & Guidance', 'Being Human', 'Competitive Exams'];
for (var i in input) {
var phrase = input[i];
console.log(phrase + ' -> ' + phrase.replace(/(\s+&\s+|\s+)/g, '-'));
}
I think you are looking for a lib that converts a string into a slug.
You can do this manually, but you'll probably have hard time covering other edge cases.
I would suggest you to use something like :
https://github.com/dodo/node-slug
Or check out this gist if you really want to stay with the regex way : https://gist.github.com/mathewbyrne/1280286
You have two separate problems:
how to 'slugify' a string
how to undo / reverse the slugify.
To answer 1: A generic slugify method would be something like: text.replace(/\W+/g, '-')
To answer 2: you can't. You have a function (ampToDash) that can produce the same output given different inputs. i.e. there is NO equivalent of dashToAmp any more.
I have this regular expression
// Look for /en/ or /en-US/ or /en_US/ on the URL
var matches = req.url.match( /^\/([a-zA-Z]{2,3}([-_][a-zA-Z]{2})?)(\/|$)/ );
Now with the above regular express it will cause the problem with the URL such as:
http://mydomain.com/css/bootstrap.css
or
http://mydomain.com/js/jquery.js
because my regular expression is to strip off 2-3 characters from A-Z or a-z
My question is how would I add in to this regular expression to not strip off anything with
js or img or css or ext
Without impacting the original one.
I'm not so expert on regular expression :(
Negative lookahead?
var matches = req.url.match(/^\/(?!(js|css))([a-zA-Z]{2,3}([-_][a-zA-Z]{2})?)(\/|$)/ );
\ not followed by js or css
First of all you have not defined what exactly you are searching for.
Define an array with lowercased common language codes (Common language codes)
This way you'll know what to look for.
After that, convert your url to lowercase and replace all '_' with '-' and search for every member of the array in the resulting string using indexOf().
Since you said you're using the regex to replace text, I changed it to a replace function. Also, you forced the regex to match the start of the string; I don't see how it would match anything with that. Anyway, here's my approach:
var result = req.url.replace(/\/([a-z]{2,3}([-_][a-z]{2})?)(?=\/|$)/i,
function(s,t){
switch(t){case"js":case"img":case"css":case"ext":return s;}
return "";
}
);
In the following input string:
{$foo}foo bar \\{$blah1}oh{$blah2} even more{$blah3} but not{$blarg}{$why_not_me}
I am trying to match all instances of {$SOMETHING_HERE} that are not preceded by an unescaped backslash.
Example:
I want it to match {$SOMETHING} but not \{$SOMETHING}.
But I do want it to match \\{$SOMETHING}
Attempts:
All of my attempts so far will match what I want except for tags right next to each other like {$SOMETHING}{$SOMETHING_ELSE}
Here is what I currently have:
var input = '{$foo}foo bar \\{$blah1}oh{$blah2} even more{$blah3} but not{$blarg}{$why_not_me}';
var results = input.match(/(?:[^\\]|^)\{\$[a-zA-Z_][a-zA-Z0-9_]*\}/g);
console.log(results);
Which outputs:
["{$foo}", "h{$blah2}", "e{$blah3}", "t{$blarg}"]
Goal
I want it to be :
["{$foo}", "{$blah2}", "{$blah3}", "{$blarg}", "{$why_not_me}"]
Question
Can anybody point me in the right direction?
The problem here is that you need a lookbehind, which JavaScript Regexs don't support
basically you need "${whatever} if it is preceded by a double slash but not a single slash" which is what the lookbehind does.
You can mimic simple cases of lookbehinds, but not sure if it will help in this example. Give it a go: http://blog.stevenlevithan.com/archives/mimic-lookbehind-javascript
edit
Btw, I don't think you can do this a 'stupid way' either because if you have [^\\]\{ you'll match any character that is not a backslash before the brace. You really need the lookbehind to do this cleanly.
Otherwise you can do
(\\*{\$[a-zA-Z_][a-zA-Z0-9_]*\})
Then just count the number of backslashes in the resulting tokens.
When all else fails, split, join/replace the crap out of it.
Note: the first split/join is actually the cleanup portion. That kills \{<*>}
Also, I didn't account for the stuff inside the brackets since there's code for that already.
var input = '{$foo}foo bar \\{$blah1}oh{$blah2} even more\\\\{$blah3} but not{$blarg}{$why_not_me}';
input.split(/(?:[^\\])\\\{[^\}]*\}/).join('').replace(/\}[^\{]*\{/g,'},{').split(/,/));
This seems to do what I want:
var input = '{$foo}foo bar \\{$blah1}oh{$blah2} even more\\\\{$blah3} but not{$blarg}{$why_not_me}';
var results = [];
input.replace(/(\\*)\{\$[a-z_][a-z0-9_]*\}/g, function($0,$1){
$0 = $0.replace(/^\\\\/g,'');
var result = ($0.indexOf('\\') === 0 ? false : $0);
if(result) {
results.push(result);
}
})
console.log(results);
Which gives:
["{$foo}", "{$blah2}", "{$blah3}", "{$blarg}", "{$why_not_me}"]
I have this RegExp expression I found couple weeks ago
/([\r\n])|(?:\[([a-z\*]{1,16})(?:=([^\x00-\x1F"'\(\)<>\[\]]{1,256}))?\])|(?:\[\/([a-z]{1,16})\])/ig
And it's working to find the BBCode tags such as [url] and [code].
However if I try [url="http://www.google.com"] it won't match. I'm not very good at RegExp and I can't figure out how to still be valid but the ="http://www.google.com" be optional.
This also fails for [color="red"] but figure it is the same issue the url tag is having.
This part: [^\x00-\x1F"'\(\)<>\[\]] says that after the =there must not be a ". That means your regexp matches [url=http://stackoverflow.com]. If you want to have quotes you can simply put them around your capturing group:
/([\r\n])|(?:\[([a-z\*]{1,16})(?:="([^\x00-\x1F"'\(\)<>\[\]]{1,256})")?\])|(?:\[\/([a-z]{1,16})\])/gi
I think you would benefit from explicitly enumerating all the tags you want to match, since it should allow matching the closing tag more specifically.
Here's a sample code:
var tags = [ 'url', 'code', 'b' ]; // add more tags
var regParts = tags.map(function (tag) {
return '(\\[' + tag + '(?:="[^"]*")?\\](?=.*?\\[\\/' + tag + '\\]))';
});
var re = new RegExp(regParts.join('|'), 'g');
You might notice that the regular expression is composed from a set of smaller ones, each representing a single tag with a possible attribute ((?:="[^"]*")?, see explanation below) of variable length, like [url="google.com"], and separated with the alternation operator |.
(="[^"]*")? means an = symbol, then a double quote, followed by any symbol other than double quote ([^"]) in any quantity, i.e. 0 or more, (*), followed by a closing quote. The final ? means that the whole group may not be present at all.