Converting ampersand (&) and blank space to a dash (-) in URLs using regex - javascript

With the code below, I have converted the following names into URL such as
Love & Relationships to http://domain.org/love-relationships
Career & Guidance to http://domain.org/career-guidance
filter('ampToDash', function(){
return function(text){
return text ? String(text).replace(/ & /g,'-'): '';
};
}).filter('dashToAmp', function(){
return function(text){
return text ? String(text).replace(/-/g,' & '): '';
};
})
However, I have a new set of names and I can't figure out how to do both at the same time.
Being Human to http://domain.org/being-human
Competitive Exams to http://domain.org/competitive-exams
filter('ampToDash', function(){
return function(text){
return text ? String(text).replace(/ /g,'-'): '';
};
}).filter('dashToAmp', function(){
return function(text){
return text ? String(text).replace(/-/g,' '): '';
};
})
How do I combine both the regex codes so it can work hand in hand?

You may also want to extend your replacement criteria to cover all "non-word" characters, instead of just accounting for the ones you're currently aware of (& and space). This would be more future-proof, and perhaps easier to reason with:
String(text).replace(/\W+/g, '-')
(\W+ means any sequence of non-word characters.)
Example:
'Jack & Jill went up the #$%#! hill'.replace(/\W+/g, '-')
Yields:
Jack-Jill-went-up-the-hill
And because there's loss of information (i.e. you don't know what exactly leads to a '-' by looking at the transformed string), a way you can find the original string is to simply store it and look up by the transformed string. To elaborate: You're probably going to be looking up some document from this new string (a "slug", as others pointed out). Store the slug along with the document and just look up the document (and its original title) from your database.

It looks like you simply want to change any instances of an ampersand with leading or trailing white-space or just white-space to a single hyphen. If so, you could just use the following expression :
// Replace any strings that have leading and trailing spaces or just a series of spaces
String(text).replace(/(\s+&\s+|\s+)/g,'-'): '';
Example
var input = ['Love & Relationships', 'Career & Guidance', 'Being Human', 'Competitive Exams'];
for (var i in input) {
var phrase = input[i];
console.log(phrase + ' -> ' + phrase.replace(/(\s+&\s+|\s+)/g, '-'));
}

I think you are looking for a lib that converts a string into a slug.
You can do this manually, but you'll probably have hard time covering other edge cases.
I would suggest you to use something like :
https://github.com/dodo/node-slug
Or check out this gist if you really want to stay with the regex way : https://gist.github.com/mathewbyrne/1280286

You have two separate problems:
how to 'slugify' a string
how to undo / reverse the slugify.
To answer 1: A generic slugify method would be something like: text.replace(/\W+/g, '-')
To answer 2: you can't. You have a function (ampToDash) that can produce the same output given different inputs. i.e. there is NO equivalent of dashToAmp any more.

Related

Use only one of the characters in regular expression javascript

I guess that should be smth very easy, but I'm stuck with that for at least 2 hours and I think it's better to ask the question here.
So, I've got a reg expression /&t=(\d*)$/g and it works fine while it is not ?t instead of &t in url. I've tried different combinations like /\?|&t=(\d*)$/g ; /\?t=(\d*)$|/&t=(\d*)$/g ; /(&|\?)t=(\d*)$/g and various others. But haven't got the expected result which is /\?t=(\d*)$/g or /&t=(\d*)$/g url part (whatever is placed to input).
Thx for response. I think need to put some details here. I'm actually working on this peace of code
var formValue = $.trim($("#v").val());
var formValueTime = /&t=(\d*)$/g.exec(formValue);
if (formValueTime && formValueTime.length > 1) {
formValueTime = parseInt(formValueTime[1], 10);
formValue = formValue.replace(/&t=\d*$/g, "");
}
and I want to get the t value whether reference passed with &t or ?t in references like youtu.be/hTWKbfoikeg?t=82 or similar one youtu.be/hTWKbfoikeg&t=82
To replace, you may use
var formValue = "some?some=more&t=1234"; // $.trim($("#v").val());
var formValueTime;
formValue = formValue.replace(/[&?]t=(\d*)$/g, function($0,$1) {
formValueTime = parseInt($1,10);
return '';
});
console.log(formValueTime, formValue);
To grab the value, you may use
/[?&]t=(\d*)$/g.exec(formValue);
Pattern details
[?&] - a character class matching ? or &
t= - t= substring
(\d*) - Group 1 matching zero or more digits
$ - end of string
/\?t=(\d*)|\&t=(\d*)$/g
you inverted the escape character for the second RegEx.
http://regexr.com/3gcnu
I want to thank you all guys for trying to help. Special thanks to #Wiktor Stribiżew who gave the closest answer.
Now the piece of code I needed looks exactly like this:
/[?&]t=(\d*)$/g.exec(formValue);
So that's the [?&] part that solved the problem.
I use array later, so /\?t=(\d*)|\&t=(\d*)$/g doesn't help because I get an array like [t&=50,,50] when reference is & type and the correct answer [t?=50,50] when reference is ? type just because of the order of statements in RegExp.
Now, if you're looking for a piece of RegExp that picks either character in one place while the rest of RegExp remains the same you may use smth like this [?&] for the example where wanted characters are ? and &.

How to split a string by a character not directly preceded by a character of the same type?

Let's say I have a string: "We.need..to...split.asap". What I would like to do is to split the string by the delimiter ., but I only wish to split by the first . and include any recurring .s in the succeeding token.
Expected output:
["We", "need", ".to", "..split", "asap"]
In other languages, I know that this is possible with a look-behind /(?<!\.)\./ but Javascript unfortunately does not support such a feature.
I am curious to see your answers to this question. Perhaps there is a clever use of look-aheads that presently evades me?
I was considering reversing the string, then re-reversing the tokens, but that seems like too much work for what I am after... plus controversy: How do you reverse a string in place in JavaScript?
Thanks for the help!
Here's a variation of the answer by guest271314 that handles more than two consecutive delimiters:
var text = "We.need.to...split.asap";
var re = /(\.*[^.]+)\./;
var items = text.split(re).filter(function(val) { return val.length > 0; });
It uses the detail that if the split expression includes a capture group, the captured items are included in the returned array. These capture groups are actually the only thing we are interested in; the tokens are all empty strings, which we filter out.
EDIT: Unfortunately there's perhaps one slight bug with this. If the text to be split starts with a delimiter, that will be included in the first token. If that's an issue, it can be remedied with:
var re = /(?:^|(\.*[^.]+))\./;
var items = text.split(re).filter(function(val) { return !!val; });
(I think this regex is ugly and would welcome an improvement.)
You can do this without any lookaheads:
var subject = "We.need.to....split.asap";
var regex = /\.?(\.*[^.]+)/g;
var matches, output = [];
while(matches = regex.exec(subject)) {
output.push(matches[1]);
}
document.write(JSON.stringify(output));
It seemed like it'd work in one line, as it did on https://regex101.com/r/cO1dP3/1, but had to be expanded in the code above because the /g option by default prevents capturing groups from returning with .match (i.e. the correct data was in the capturing groups, but we couldn't immediately access them without doing the above).
See: JavaScript Regex Global Match Groups
An alternative solution with the original one liner (plus one line) is:
document.write(JSON.stringify(
"We.need.to....split.asap".match(/\.?(\.*[^.]+)/g)
.map(function(s) { return s.replace(/^\./, ''); })
));
Take your pick!
Note: This answer can't handle more than 2 consecutive delimiters, since it was written according to the example in the revision 1 of the question, which was not very clear about such cases.
var text = "We.need.to..split.asap";
// split "." if followed by "."
var res = text.split(/\.(?=\.)/).map(function(val, key) {
// if `val[0]` does not begin with "." split "."
// else split "." if not followed by "."
return val[0] !== "." ? val.split(/\./) : val.split(/\.(?!.*\.)/)
});
// concat arrays `res[0]` , `res[1]`
res = res[0].concat(res[1]);
document.write(JSON.stringify(res));

Regex converting & to &

I am developing a small character encoder generator where the user input their text and on the click of a button, it outputs the encoded version.
I've defined an object of the characters that need to be encoded like so:
map = {
'©' : '©',
'&' : '&'
},
And here is the loop that gets the values from the map and replaces them:
Object.keys(map).forEach(function (ico) {
var icoE = ico.replace(/([.?*+^$[\]\\(){}|-])/g, "\\$1");
raw = raw.replace( new RegExp(icoE, 'g'), map[ico] );
});
I am them simply outputting the result to a textarea. This all works fine, however the problem I'm facing is this.
© is replaced with © however the & symbol at the beginning of this is then converted to & so it ends up being &copy;.
I see why this is happening however I'm not sure how to go about ensuring that & is not replaced within character encoded strings.
Here is a JSFiddle for a live preview of what I mean:
http://jsfiddle.net/4m3nw/1/
Any help would be much appreciated
Prelude: Apart from regex, an idea worth considering is something like this JS function that already handles html entities. Now, on to the regex question.
HTML Special Characters, Negative Lookahead
In HTML, special characters can look not only like © but also like —, and they can have upper-case characters.
To replace ampersands that are not immediately followed by a hash or word characters and a semicolon, you can use something like this:
&(?!(?:#[0-9]+|[a-z]+);)
See the demo.
Make sure to use the i flag to activate case-insensitive mode
& matches the literal ampersand
The negative lookahead (?!(?:#[0-9]+|[a-z]+);) asserts that it is not followed by...
(?:#[0-9]+|[a-z]+) a hash and digits, | OR letters...
then a semicolon.
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind
The problem is that since you process the same string you replace the &in ©. If you re-order your map then that seemingly solves the problem. However according to the ECMAScript specifications, this is not a given, so you would be relying on implementation details of the ECMAScript engine used.
What you can do to make sure it will always work is to swap the keys so that & is always processed first:
map = {
'©' : '©',
'&' : '&'
};
var keys = Object.keys(map);
keys[keys.indexOf('&')] = keys[0];
keys[0] = '&';
keys.forEach(function (ico) {
var icoE = ico.replace(/([.?*+^$[\]\\(){}|-])/g, "\\$1");
raw = raw.replace( new RegExp(icoE, 'g'), map[ico] );
});
Obviously you need to add checks for the &'s existence if it isn't always there.
jsFiddle Demo.
Probably the simplest code change is to reorder your map by putting the ampersand on top.

extracting middle OR final part of a string

I want to extract only the first fontname out of a URL-string from the Google Webfont Directory. Here are some examples of possible strings and what part should be returned:
fonts.googleapis.com/css?family=Raleway // "Raleway"
fonts.googleapis.com/css?family=Caesar+Dressing // "Caesar Dressing"
fonts.googleapis.com/css?family=Raleway:300,400 // "Raleway"
fonts.googleapis.com/css?family=Raleway|Fondamento // "Raleway"
fonts.googleapis.com/css?family=Caesar+Dressing|Raleway:300,400|Fondamento // "Caesar Dressing"
So sometimes it's just one fontname, sometimes it has a weight indicated by a colon (:) and sometimes there are more fontnames divided by a pipe (|).
I have tried /family=(\S*)[:|]/ but it only matches the strings with :or |. I could do it like this, but it's not a nice solution:
var fontUrl = "fonts.googleapis.com/css?family=Caesar+Dressing|Raleway:300,400|Fondamento";
var fontName = /family=(\S*)/.exec(fontUrl)[1].replace(/\+/, " ");
if (fontName.indexOf(':') != -1){
fontName = fontName.split(':')[0];
}
if (fontName.indexOf('|') != -1){
fontName = fontName.split('|')[0];
}
console.log(fontName);
Is there a nice regex solution to this?
Instead of matching the character that (might) follow the string you want, match only the string you want except those characters:
/family=([^\s:|]*)/
Alternatively, you'd use a lookahead like this:
/family=(\S*?)(?=$|[:|])/
That should be better:
/family=([^:|]*)/
Of course for the + case, you'll have to replace it afterwards (or before maybe).
You can use (choose the i and m modifier in all case):
family=([a-z]+\+?[a-z]+)
or more simply
family=([a-z+]+)
or to avoid matching the + char:
family=([a-z]+)\+?([a-z]+)?
but it is an easyer way to use the second solution, and to replace the + chars with a space after.
try this:
/family\=(\S+?)[\:\|,]{0,2}\S*/ims
No regex is required in this case, unless you are good with regex's or test them thoroughly then you are likely to make mistakes.
var fontUrls = [];
fontUrls.push("fonts.googleapis.com/css?family=Raleway");
fontUrls.push("fonts.googleapis.com/css?family=Caesar+Dressing");
fontUrls.push("fonts.googleapis.com/css?family=Raleway:300,400");
fontUrls.push("fonts.googleapis.com/css?family=Raleway|Fondamento");
fontUrls.push("fonts.googleapis.com/css?family=Caesar+Dressing|Raleway:300,400|Fondamento");
function getFirstFont(url) {
return url.split("=")[1].split("|")[0].split(":")[0];
}
fontUrls.forEach(function (fontUrl) {
console.log(getFirstFont(fontUrl));
});
on jsfiddle

JavaScript RegEx to match punctuation NOT part of any HTML tags

Okay, I know there's much controversy with matching and parsing HTML within a RegEx, but I was wondering if I could have some help. Case and Point.
I need to match any punctuation characters e.g . , " ' but I don't want to ruin any HTML, so ideally it should occur between a > and a < - essentially my query isn't so much about parsing HTML, as avoiding it.
I'm going to attempt to replace wrap each instance in a <span></span> - but having absolutely no experience in RegEx, I'm not sure I'm able to do it.
I've figured character sets [\.\,\'\"\?\!] but I'm not sure how to match character sets that only occur between certain characters. Can anybody help?
To start off, here's a X-browser dom-parser function:
var parseXML = (function(w,undefined)
{
'use strict';
var parser,ie = false;
switch (true)
{
case w.DOMParser !== undefined:
parser = new w.DOMParser();
break;
case new w.ActiveXObject("Microsoft.XMLDOM") !== undefined:
parser = new w.ActiveXObject("Microsoft.XMLDOM");
parser.async = false;
ie = true;
break;
default :
throw new Error('No parser found');
}
return function(xmlString)
{
if (ie === true)
{//return DOM
parser.loadXML(xmlString);
return parser;
}
return parser.parseFromString(xmlString,'text/xml');
};
})(this);
//usage:
var newDom = parseXML(yourString);
var allTags = newDom.getElementsByTagName('*');
for(var i=0;i<allTags.length;i++)
{
if (allTags[i].tagName.toLowerCase() === 'span')
{//if all you want to work with are the spans:
if (allTags[i].hasChildNodes())
{
//this span has nodes inside, don't apply regex:
continue;
}
allTags[i].innerHTML = allTags[i].innerHTML.replace(/[.,?!'"]+/g,'');
}
}
This should help you on your way. You still have access to the DOM, so whenever you find a string that needs filtering/replacing, you can reference the node using allTags[i] and replace the contents.Note that looping through all elements isn't to be recommended, but I didn't really feel like doing all of the work for you ;-). You'll have to check what kind of node you're handling:
if (allTags[i].tagName.toLowerCase() === 'span')
{//do certain things
}
if (allTags[i].tagName.toLowerCase() === 'html')
{//skip
continue;
}
And that sort of stuff...Note that this code is not tested, but it's a simplified version of my answer to a previous question. The parser-bit should work just fine, in fact here's a fiddle I've set up for that other question, that also shows you how you might want to alter this code to better suite your needs
Edit As Elias pointed out, native JScript doesn't support the lookaheads. I'll leave this up in case someone else looks for something similar, just be aware.
Here is the regex I got to work, it requires lookaheads and lookbehinds and I'm not familiar enough with Javascript to know if those are supported or not. Either way, here is the regex:
(?<=>.*?)[,."'](?=.*<)
Breakdown:
1. (?<=>.*?) --> The match(es) must have ">" followed by any characters
2. [,."'] --> Matches for the characters: , . " '
3. (?=.*<) --> The match(es) must have any characters then "<" before it
This essentially means it will match any of the characters you want in between a set of > <.
That being said, I would suggest as Point mentioned in the comments to parse the HTML with a tool designed for that, and search through the results with the regex [,."'].
Dan, resurrecting this question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
The Dom parser solution was great. With all the disclaimers about using regex to parse html, I'd like to add a simple way to do what you wanted with regex in Javascript.
The regex is very simple:
<[^>]*>|([.,"'])
The left side of the alternation matches complete tags. We will ignore these matches. The right side matches and captures punctuation to Group 1, and we know they are the right punctuation because they were not matched by the expression on the left.
On this demo, looking at the lower right pane, you can see that only the right punctuation is captured to Group 1.
You said you wanted to embed the punctuation in a <span>. This Javascript code will do it.
I've replaced the <tags> with {tags} to make sure the example displays in the browser.
<script>
var subject = 'true ,she said. {tag \" . ,}';
var regex = /{[^}]*}|([.,"'])/g;
replaced = subject.replace(regex, function(m, group1) {
if (group1 == "" ) return m;
else return "<span>" + group1 + "</span>";
});
document.write(replaced);
</script>
Here's a live demo
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...

Categories

Resources