regular expression to split string avoiding double tokens - javascript

In order to split a string value into an array using javascript I need to split using delimiters. Repeated delimiters indicate a sub-value within the array, so for example
abc+!+;def+!+!+;123+!+;xyz
should split into abc, [def, 123], xyz
My nearest expression is ((?:+!(?!+!))+\;|$) but thinking about it that may be the one I first started with, as I've gone through so many many variations since then.
There is probably a blindingly obvious answer, but after what seems an eternity I'm now stumped. I took a look at regex to parse string with escaped characters, and similar articles which were close although not the same problem, but basically came to a stop with ideas.
Somewhere out there someone will know regular expressions far better than I do, and hope that they have an answer

I got this to work by using .split() with this basic pattern:
\b\+!\+;\b
And then:
\b\+!\+!\+;\b
And so on, and so forth. You will need to turn this into a recursive function, but I made a basic JSFiddle to get you started. First we split the string using our first expression. Then we create our newer expression by adding !\+ (this can easily be done dynamically). Now we can loop through our initial array, see if the string matches our new expression and if it does split it again.
var string = 'abc+!+;def+!+!+;123+!+;xyz',
data = string.split(/\b\+!\+;\b/);
var regex = /\b\+!\+!\+;\b/
for(var i = 0; i < data.length; i++) {
var string = data[i];
if(string.match(regex)) {
data[i] = string.split(regex);
}
}
console.log(data);
// ["abc", ["def", "123"], "xyz"]
I'm leaving the task of making this a recursive function up to OP. If you want some direction, I can try to provide some more insight.

Related

Get an example matched text from a regex pattern [duplicate]

Is there any way of generating random text which satisfies provided regular expression.
I am looking for a function which works like below
var reg = Some Regular Expression
var str = RandString(reg)
I have seen fairly good solutions in perl and ruby on github, but I think there are technical issues that make a complete solution impossible. For example, /[0-9]+/ has an infinite upper bound, which is not practical for selecting random numbers from.
Never seen it in JavaScript, but you could translate.
EDIT: After googling for a few seconds...
https://github.com/fent/randexp.js
if you know what the regular expression is, you can just generate random strings, then use a function that references the index of the letters and changes them as needed. Regex expressions vary widely, so it will be difficult to find one in particular that satisfies all possible regex.
Your question is pretty open so hopefully this steers you to the right solution. Get the current time (in seconds), MD5 it, check it against a REGEX, return the match.
Running Example: http://jsfiddle.net/MattLo/3gKrb/
Usage: RandString(/([A-Za-z])/ig); // expected to be a string
For JavaScript, the following modules can generate a random match to a regex:
pxeger
randexp.js
regexgen

Match between simple delimiters, but not delimiters themselves

I was looking at JSON data that was just in a text file. I don't want to do anything aside from just use regex to get the values in between quotes. I'm just using this as a way to help practice regex and got to this point that seems like it should be simple, but it turns out it's not (at least to me and a few other people at the office). I've matched complicated urls with ease in regex so I'm not completely new to regex. This just seems like a weird case for me.
I've tried:
/(?:")(.*?)(?:")/
/"(.*?)"/
and several others but these got me the closest.
Basically we can forget that it's JSON and just say I want to match the words value and stuff out of "value" and "stuff". Everything I try includes the quotes, so I'd have to clean the strings afterwards of the delimiters or else the string is literally "value" with the quotes.
Any help would be much appreciated, whether this is simple or complicated, I'd love to know! Thanks
Update: Alright so I think I'll go with (?<=")(.*?)(?=") and read things by line without the global setting on so I just get the first match on each line. In my code I was just plopping in a huge string into a var in the code instead of actually opening a file with ajax/filereader or having a form setup to input data. I think I'll mark this as solved, much appreciated!
You have two choices to solve this problem:
Use capturing groups
You can match the delimiters and use capturing groups to get the text within. In this case your two regexes will work, but you need to use access capturing group 1 to get the results (demo). See How do you access the matched groups in a JavaScript regular expression? for how to do that.
Use zero-width assertions
You can use zero-width assertions to match only the text within, require delimiters around them without actually matching them (demo):
(?<=")(.*?)(?=")
but now since I'm not consuming the quotes it'll find instances between each quote, not just between pairs of quotes: e.g., a"b"c" would find b and c.
As for getting just the first match, I think that'll happen by default in JavaScript. You'd have to ask for repeated matching before you see the subsequent ones. So if you process your file one line at a time, you should get what you want.
get the values in between quotes
One thing to keep in mind is that valid JSON accepts escaped quotes inside the quoted values. Therefore, the RegEx should take this into account when capturing the groups which is done with the “unrolling-the-loop” pattern.
var pattern = /"[^"\\]*(?:\\.[^"\\]*)*"/g;
var data = {
"value": "This is \"stuff\".",
"empty": "",
"null": null,
"number": 50
};
var dataString = JSON.stringify(data);
console.log(dataString);
var matched = dataString.match(pattern);
matched.map(item => console.log(JSON.parse(item)));

Why would the replace with regex not work even though the regex does?

There may be a very simple answer to this, probably because of my familiarity (or possibly lack thereof) of the replace method and how it works with regex.
Let's say I have the following string: abcdefHellowxyz
I just want to strip the first six characters and the last four, to return Hello, using regex... Yes, I know there may be other ways, but I'm trying to explore the boundaries of what these methods are capable of doing...
Anyway, I've tinkered on http://regex101.com and got the following Regex worked out:
/^(.{6}).+(.{4})$/
Which seems to pass the string well and shows that abcdef is captured as group 1, and wxyz captured as group 2. But when I try to run the following:
"abcdefHellowxyz".replace(/^(.{6}).+(.{4})$/,"")
to replace those captured groups with "" I receive an empty string as my final output... Am I doing something wrong with this syntax? And if so, how does one correct it, keeping my original stance on wanting to use Regex in this manner...
Thanks so much everyone in advance...
The code below works well as you wish
"abcdefHellowxyz".replace(/^.{6}(.+).{4}$/,"$1")
I think that only use ()to capture the text you want, and in the second parameter of replace(), you can use $1 $2 ... to represent the group1 group2.
Also you can pass a function to the second parameter of replace,and transform the captured text to whatever you want in this function.
For more detail, as #Akxe recommend , you can find document on https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace.
You are replacing any substring that matches /^(.{6}).+(.{4})$/, with this line of code:
"abcdefHellowxyz".replace(/^(.{6}).+(.{4})$/,"")
The regex matches the whole string "abcdefHellowxyz"; thus, the whole string is replaced. Instead, if you are strictly stripping by the lengths of the extraneous substrings, you could simply use substring or substr.
Edit
The answer you're probably looking for is capturing the middle token, instead of the outer ones:
var str = "abcdefHellowxyz";
var matches = str.match(/^.{6}(.+).{4}$/);
str = matches[1]; // index 0 is entire match
console.log(str);

regex between two character positions with known start and end indices

In regex, generally speaking, is there a way to select data between two line positions? I'm not even sure the correct terminology (character/line position, index, column?) after a few days of reading up on regex, but what I mean is...
Select the data between two indices, what is between ^.{4} and ^.{7}, for example:
TESTINGREGEX
ISNTTHEBEST!
or
TESTINGREGEXCANBEFUN
ISNTTHEBEST!ANDFARFROMFUN
the results I'm looking for would be:
TESTREGEX
ISNTBEST!
and
TESTREGEXCANBEFUN
ISNTBEST!ANDFARFROMFUN
I'm wondering, so I can learn if it's possible, how to achieve it? I'm very familiar with other ways to do this using other tools, but I'm curious how to achieve this using regex.
I've tried working with non capturing groups, and wondering if maybe I'm being limited by the fact that I'm attempting to apply this regex within the atom editor find and replace regex feature (falling victim to: Avoiding Common Pitfalls), so I'm hoping to get a few suggestions to broaden my knowledge and try out. I'm guessing javascript, and/or sed style regex answers would be acceptable...really anything would help!
EDIT:
.{3}(?=.{5}$) from Mark's answer works for me and with the example text I gave in the OP. And it's a good thing to know when able to count from the $ end of line. But I'm realizing I actually need the opposite... I need to count out from the ^ start of line. Is this not possible; re: comments on there being no support for lookbehind?
With just regex it's possible, just not in javascript. The regex (?<=^.{4}).+(?=.{5}$) works to capture the group between the 4th letter and the 5th to last letter. Since javascript doesn't support positive look behinds, you'll have to use some ammount of javascript beyond a simple .replace(regex, "") to remove those characters.
The next closest regex possible in javascript would be .{3}(?=.{5}$), which would match 3 characters before the 5th to last letter.
If you wanted with pure regex in javascript to capture something a few characters after the start of a string it would be impossible.
The regex ^(.{4}).{3}(.{5})$ (expressed in JavaScript's dialect, but the features used in it are quite common) will give you two capture groups you can combine to get the output you describe:
function test(str) {
var match = str.match(/^(.{4}).{3}(.{5})$/);
console.log(str, '=>', match[1] + match[2]);
}
test("TESTINGREGEX");
test("ISNTTHEBEST!");
If the lines are of varying length and you want to ignore everything after the end of what you want, just drop the $ assertion at the end.
If the purpose is to get the text between two character offsets then regular expressions are overkill. Just use slice:
function exclude(str, i, j) {
return str.slice(0, i) + str.slice(j);
}
console.log(exclude("TESTINGREGEX", 4, 7));
console.log(exclude("ISNTTHEBEST!", 4, 7));
If you really need to do this with regular expressions then proceed as follows:
function exclude(str, i, j) {
return str.replace(new RegExp(`^(.{${i}})(.{${j-i}})`), "$1");
}
console.log(exclude("TESTINGREGEX", 4, 7));
console.log(exclude("ISNTTHEBEST!", 4, 7));

Efficiently remove common patterns from a string

I am trying to write a function to calculate how likely two strings are to mean the same thing. In order to do this I am converting to lower case and removing special characters from the strings before I compare them. Currently I am removing the strings '.com' and 'the' using String.replace(substring, '') and special characters using String.replace(regex, '')
str = str.toLowerCase()
.replace('.com', '')
.replace('the', '')
.replace(/[&\/\\#,+()$~%.'":*?<>{}]/g, '');
Is there a better regex that I can use to remove the common patterns like '.com' and 'the' as well as the special characters? Or some other way to make this more efficient?
As my dataset grows I may find other common meaningless patterns that need to be removed before trying to match strings and would like to avoid the performance hit of chaining more replace functions.
Examples:
Fish & Chips? => fish chips
stackoverflow.com => stackoverflow
The Lord of the Rings => lord of rings
You can connect the replace calls to a single one with a rexexp like this:
str = str.toLowerCase().replace(/\.com|the|[&\/\\#,+()$~%.'":*?<>{}]/g, '');
The different strings to remove are inside parentheses () and separated by pipes |
This makes it easy enough to add more string to the regexp.
If you are storing the words to remove in an array, you can generate the regex using the RegExp constructor, e.g.:
var words = ["\\.com", "the"];
var rex = new RegExp(words.join("|") + "|[&\\/\\\\#,+()$~%.'\":*?<>{}]", "g");
Then reuse rex for each string:
str = str.toLowerCase().replace(rex, "");
Note the additional escaping required because instead of a regular expression literal, we're using a string, so the backslashes (in the words array and in the final bit) need to be escaped, as does the " (because I used " for the string quotes).
The problem with this question is that im sure you have a very concrete idea in your mind of what you want to do, but the solution you have arrived at (removing un-informative letters before making a is-identical comparison) may not be the best for the comparison you want to do.
I think perhaps a better idea would be to use a different method comparison and a different datastructure than a string. A very simple example would be to condense your strings to sets with set('string') and then compare set similarity/difference. Another method might be to create a Directed Acyclic Graph, or sub-string Trei. The main point is that it's probably ok to reduce the information from the original string and store/compare that - however don't underestimate the value of storing the original string, as it will help you down the road if you want to change the way you compare.
Finally, if your strings are really really really long, you might want to use a perceptual hash - which is like an MD5 hash except similar strings have similar hashes. However, you will most likely have to roll your own for short strings, and define what you think is important data, and what is superfluous.

Categories

Resources