RegEx for detecting a string and a path in one go

RegEx for detecting a string and a path in one go - javascript

Here is an example of what regex I need regex
I have many of these lines in a file
build test/testfoo/CMakeFiles/testfoo2.dir/testfoo2.cpp.o: CXX_COMPILER__testfoo2_Debug /home/juxeii/projects/gtest-cmake-example/test/testfoo/testfoo2.cpp || cmake_object_order_depends_target_testfoo2
I need to detect the string between CXX_COMPILER__ and _Debug, which here is testfoo2.
At the same time, I need to also detect the entire file path /home/juxeii/projects/gtest-cmake-example/test/testfoo/testfoo2.cpp, which comes always after the first match.
I could not figure out a regex for this. So far I have .*CXX_COMPILER__(.\w+)_\w+|(\/[a-zA-Z_0-9-]+)+\.\w+ and I am using it in typescript like so:
const fileAndTargetRegExp = new RegExp('.*CXX_COMPILER__(.\w+)_\w+|(\/[a-zA-Z_0-9-]+)+\.\w+', 'gm');
let match;
while (match = fileAndTargetRegExp.exec(fileContents)) {
//do something
}
But I get no matches. Is there an easy way to do this?

Will it always have the || <stuff here> at the end? If so, this regex based on the one you provided should work:
/.*CXX_COMPILER__(\w+)_.+?((?:\/.+)+) \|\|.*/g
As the regex101 breakdown shows, the first capturing group should contain the string between CXX_COMPILER__ and _Debug, while the second should contain the path, using the space and pipes to detect where the latter ends.
let line = 'build test/testfoo/CMakeFiles/testfoo2.dir/testfoo2.cpp.o: CXX_COMPILER__testfoo2_Debug /home/juxeii/projects/gtest-cmake-example/test/testfoo/testfoo2.cpp || cmake_object_order_depends_target_testfoo2';
const matches = line.match(/.*CXX_COMPILER__(\w+)_.+?((?:\/.+)+) \|\|.*/).slice(1); //slice(1) just to not include the first complete match returned by match!
for (let match of matches) {
console.log(match);
}
If the pipes won't always be there, then this version should work instead (regex101):
.*CXX_COMPILER__(\w+)_.+?((?:\/(?:\w|\.|-)+)+).*
But it requires you to add all of the valid path characters individually every time you realize a new one might be there, and you'll need to make sure the paths don't have spaces because adding space to the regex would make it detect the stuff after the path too.

Looks good, but you need delimiters. Add "/" before and after your Regex - no quotation marks.
let fileContents = 'build test/testfoo/CMakeFiles/testfoo2.dir/testfoo2.cpp.o: CXX_COMPILER__testfoo2_Debug /home/juxeii/projects/gtest-cmake-example/test/testfoo/testfoo2.cpp || cmake_object_order_depends_target_testfoo2';
const fileAndTargetRegExp = new RegExp(/.*CXX_COMPILER__(.\w+)_\w+|(\/[a-zA-Z_0-9-]+)+\.\w+/, 'gm');
let match;
while (match = fileAndTargetRegExp.exec(fileContents)) {
console.log(match);
}

Here's my way of doing it with replace:
I need to detect the string between CXX_COMPILER__ and _Debug, which is here testfoo2.
Try to replace all characters of the string with just the first captured group $1 which is between CXX_COMPILER__ and _Debug:
/.*CXX_COMPILER__(\w+)_Debug.*/
^^^^<--testfoo2
I need to also detect the entire file path /home/juxeii/projects/gtest-cmake-example/test/testfoo/testfoo2.cpp
The same, just this time replace all just leave the second matched group which is anything comes after our first captured group:
/.*CXX_COMPILER__(\w+)_Debug\s+(.*?)(?=\\|\|).*/
^^^<-- /home/.../testfoo2.cpp
let line = 'build test/testfoo/CMakeFiles/testfoo2.dir/testfoo2.cpp.o: CXX_COMPILER__testfoo2_Debug /home/juxeii/projects/gtest-cmake-example/test/testfoo/testfoo2.cpp || cmake_object_order_depends_target_testfoo2'
console.log(line.replace(/.*CXX_COMPILER__(\w+)_Debug.*/gm,'$1'))
console.log(line.replace(/.*CXX_COMPILER__(\w+)_Debug\s+(.*?)(?=\\|\|).*/gm,'$2'))

Related

How to capture one particular instance of a string only if it occurs twice in regex?

I have a regex expression:
/diff\\left\((...*?\\right\){0,1})\\right\)/gm
and the string I want to match is
diff\left(5x^2\right) + diff\left(5x^2+\tan\left(x\right)\right)
I want to match in such a way that there are two matches:diff\left(5x^2+\right) and diff\left(5x^2+\tan\left(x\right)\right)
each having captured groups 5x^2 and 5x^2+\tan\left(x\right).
I want to add \right) inside a captured group once only if it occurs twice.
However, I'm only getting a single match with the entire 5x^2\right)+diff\left(5x^2+\tan\left(x\right) inside a captured group.
Here are two images for better understanding. Blue parts represent matches and green parts represent captured groups
Here is the output I'm getting (screenshot from regex101)
Desired output (this is an edited image)
Please help me with this I'm trying to build a symbolic calculator app. Thanks

If those two parts are to always be bound by space characters, you could try something like the below:
https://regex101.com/r/Lcsxxv/1
const regex = /diff\\left\(([^ ]*)\\right\)/gm;
const str = `diff\\left(5x^2\\right) + diff\\left(5x^2+\\tan\\left(x\\right)\\right)`;
const matches = [];
const groups = [];
let r;
while ((r = regex.exec(str)) !== null) {
matches.push(r[0]);
groups.push(r[1]);
}
console.log(`matches:\n\t${matches.join('\n\t')}
groups:\n\t${groups.join('\n\t')}`)
The way it works is that it's going to look for the last instance of \right) until either the end of the string or a space character, whichever comes first.
I hope this answers your question.

How can I cut the string after a second underscore?

I'm receiving a list of files in an object and I just need to display a file name and its type in a table.
All files come back from a server in such format: timestamp_id_filename.
Example: 1568223848_12345678_some_document.pdf
I wrote a helper function which cuts the string.
At first, I did it with String.prototype.split() method, I used regex, but then again - there was a problem. Files can have underscores in their names so that didn't work, so I needed something else. I couldn't come up with a better idea. I think it looks really dumb and it's been haunting me the whole day.
The function looks like this:
const shortenString = (attachmentName) => {
const file = attachmentName
.slice(attachmentName.indexOf('_') + 1)
.slice(attachmentName.slice(attachmentName.indexOf('_') + 1).indexOf('_') + 1);
const fileName = file.slice(0, file.lastIndexOf('.'));
const fileType = file.slice(file.lastIndexOf('.'));
return [fileName, fileType];
};
I wonder if there is a more elegant way to solve the problem without using loops.

You can use replace and split, with the pattern we are replacing the string upto the second _ from start of string and than we split on . to get name and type
let nameAndType = (str) => {
let replaced = str.replace(/^(?:[^_]*_){2}/g, '')
let splited = replaced.split('.')
let type = splited.pop()
let name = splited.join('.')
return {name,type}
}
console.log(nameAndType("1568223848_12345678_some_document.pdf"))
console.log(nameAndType("1568223848_12345678_some_document.xyz.pdf"))

function splitString(val){
return val.split('_').slice('2').join('_');
}

const getShortString = (str) => str.replace(/^(?:[^_]*_){2}/g, '')
For input like
1568223848_12345678_some_document.pdf, it should give you something like some_document.pdf

const re = /(.*?)_(.*?)_(.*)/;
const name = "1568223848_12345678_some_document.pdf";
[,date, id, filename] = re.exec(name);
console.log(date);
console.log(id);
console.log(filename);
some notes:
you want to make the regular expression 1 time. If you do this
function getParts(str) {
const re = /expression/;
...
}
Then you're making a new regular expression object every time you call getParts.
.*? is faster than .*
This is because .* is greedy so the moment the regular expression engine sees that it puts the entire rest of the string into that slot and then checks if can continue the expression. If it fails it backs off one character. If that fails it backs off another character, etc.... .*? on the other hand is satisfied as soon as possible. So it adds one character then sees if the next part of the expression works, if not it adds one more character and sees if the expressions works, etc..
splitting on '_' works but it could potentially make many temporary strings
for example if the filename is 1234_1343_a________________________.pdf
you'd have to test to see if using a regular experssion is faster or slower than splitting, assuming speed matters.

You can kinda chain .indexOf to get second offset and any further, although more than two would look ugly. The reason is that indexOf takes start index as second argument, so passing index of the first occurrence will help you find the second one:
var secondUnderscoreIndex = name.indexOf("_",name.indexOf("_")+1);
So my solution would be:
var index = name.indexOf("_",name.indexOf("_")+1));
var [timestamp, name] = [name.substring(0, index), name.substr(index+1)];
Alternatively, using regular expression:
var [,number1, number2, filename, extension] = /([0-9]+)_([0-9]+)_(.*?)\.([0-9a-z]+)/i.exec(name)
// Prints: "1568223848 12345678 some_document pdf"
console.log(number1, number2, filename, extension);

I like simplicity...
If you ever need the date in times, theyre in [1] and [2]
var getFilename = function(str) {
return str.match(/(\d+)_(\d+)_(.*)/)[3];
}
var f = getFilename("1568223848_12345678_some_document.pdf");
console.log(f)

If ever files names come in this format timestamp_id_filename. You can use a regular expression that skip the first two '_' and save the nex one.
test:
var filename = '1568223848_12345678_some_document.pdf';
console.log(filename.match(/[^_]+_[^_]+_(.*)/)[1]); // result: 'some_document.pdf'
Explanation:
/[^]+[^]+(.*)/
[^]+ : take characters diferents of ''
: take '' character
Repeat so two '_' are skiped
(.*): Save characters in a group
match method: Return array, his first element is capture that match expression, next elements are saved groups.

Split the file name string into an array on underscores.
Discard the first two elements of the array.
Join the rest of the array with underscores.
Now you have your file name.

Match only # and not ## without negative lookbehind

Using JavaScript, I need a regex that matches any instance of #{this-format} in any string. My original regex was the following:
#{[a-z-]*}
However, I also need a way to "escape" those instances. I want it so that if you add an extra #, the match gets escaped, like ##{this}.
I originally used a negative lookbehind:
(?<!#)#{[a-z-]*}
And that would work just fine, except... lookbehinds are an ECMAScript2018 feature, only supported by Chrome.
I read some people suggesting the usage of a negated character set. So my little regex became this:
(?:^|[^#])#{[a-z-]*}
...which would have worked just as well, except it doesn't work if you put two of these together: #{foo}#{bar}
So, anyone knows how can I achieve this? Remember that these conditions need to be met:
Find #{this} anywhere in a string
Be able to escape like ##{this}
Be able to put multiple adjacent, like #{these}#{two}
Lookbehinds must not be used

If you include ## in your regex pattern as an alternate match option, it will consume the ## instead of allowing a match on the subsequent bracketed entity. Like this:
##|(#{[a-z-]*})
You can then evaluate the inner match object in javascript. Here is a jsfiddle to demonstrate, using the following code.
var targetText = '#{foo} in a #{bar} for a ##{foo} and #{foo}#{bar} things.'
var reg = /##|(#{[a-z-]*})/g;
var result;
while((result = reg.exec(targetText)) !== null) {
if (result[1] !== undefined) {
alert(result[1]);
}
}

You could use (?:^|[^#])# to match the start of the pattern, and capture the following #{<sometext>} in a group. Since you don't want the initial (possible) [^#] to be in the result, you'll have to iterate over the matches manually and extract the group that contains the substring you want. For example:
function test(str) {
const re = /(?=(?:^|[^#])(#{[a-z-]*}))./g;
let match;
const matches = [];
while (match = re.exec(str)) {
matches.push(match[1]); // extract the captured group
}
return matches;
}
console.log(test('##{this}'))
console.log(test('#{these}#{two}'))

javascript regex insert new element into expression

I am passing a URL to a block of code in which I need to insert a new element into the regex. Pretty sure the regex is valid and the code seems right but no matter what I can't seem to execute the match for regex!
//** Incoming url's
//** url e.g. api/223344
//** api/11aa/page/2017
//** Need to match to the following
//** dir/api/12ab/page/1999
//** Hence the need to add dir at the front
var url = req.url;
//** pass in: /^\/api\/([a-zA-Z0-9-_~ %]+)(?:\/page\/([a-zA-Z0-9-_~ %]+))?$/
var re = myregex.toString();
//** Insert dir into regex: /^dir\/api\/([a-zA-Z0-9-_~ %]+)(?:\/page\/([a-zA-Z0-9-_~ %]+))?$/
var regVar = re.substr(0, 2) + 'dir' + re.substr(2);
var matchedData = url.match(regVar);
matchedData === null ? console.log('NO') : console.log('Yay');
I hope I am just missing the obvious but can anyone see why I can't match and always returns NO?
Thanks

Let's break down your regex
^\/api\/ this matches the beginning of a string, and it looks to match exactly the string "/api"
([a-zA-Z0-9-_~ %]+) this is a capturing group: this one specifically will capture anything inside those brackets, with the + indicating to capture 1 or more, so for example, this section will match abAB25-_ %
(?:\/page\/([a-zA-Z0-9-_~ %]+)) this groups multiple tokens together as well, but does not create a capturing group like above (the ?: makes it non-captuing). You are first matching a string exactly like "/page/" followed by a group exactly like mentioned in the paragraph above (that matches a-z, A-Z, 0-9, etc.
?$ is at the end, and the ? means capture 0 or more of the precending group, and the $ matches the end of the string
This regex will match this string, for example: /api/abAB25-_ %/page/abAB25-_ %
You may be able to take advantage of capturing groups, however, and use something like this instead to get similar results: ^\/api\/([a-zA-Z0-9-_~ %]+)\/page\/\1?$. Here, we are using \1 to reference that first capturing group and match exactly the same tokens it is matching. EDIT: actually, this probably won't work, since the text after /api/ and the text after /page/ will most likely be different, carrying on...
Afterwards, you are are adding "dir" to the beginning of your search, so you can now match someting like this: dir/api/abAB25-_ %/page/abAB25-_ %
You have also now converted the regex to a string, so like Crayon Violent pointed out in their comment, this will break your expected funtionality. You can fix this by using .source on your regex: var matchedData = url.match(regVar.source); https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/source
Now you can properly match a string like this: dir/api/11aa/page/2017 see this example: https://repl.it/Mj8h

As mentioned by Crayon Violent in the comments, it seems you're passing a String rather than a regular expression in the .match() function. maybe try the following:
url.match(new RegExp(regVar, "i"));
to convert the string to a regular expression. The "i" is for ignore case; don't know that's what you want. Learn more here:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp

Regex: Getting content from URL

I want to get "the-game" using regex from URLs like
http://www.somesite.com.domain.webdev.domain.com/en/the-game/another-one/another-one/another-one/
http://www.somesite.com.domain.webdev.domain.com/en/the-game/another-one/another-one/
http://www.somesite.com.domain.webdev.domain.com/en/the-game/another-one/

What parts of the URL could vary and what parts are constant? The following regex will always match whatever is in the slashes following "/en/" - the-game in your example.
(?<=/en/).*?(?=/)
This one will match the contents of the 2nd set of slashes of any URL containing "webdev", assuming the first set of slashes contains a 2 or 3 character language code.
(?<=.*?webdev.*?/.{2,3}/).*?(?=/)
Hopefully you can tweak these examples to accomplish what you're looking for.

var myregexp = /^(?:[^\/]*\/){4}([^\/]+)/;
var match = myregexp.exec(subject);
if (match != null) {
result = match[1];
} else {
result = "";
}
matches whatever lies between the fourth and fifth slash and stores the result in the variable result.

You probably should use some kind of url parsing library rather than resorting to using regex.
In python:
from urlparse import urlparse
url = urlparse('http://www.somesite.com.domain.webdev.domain.com/en/the-game/another-one/another-one/another-one/')
print url.path
Which would yield:
/en/the-game/another-one/another-one/another-one/
From there, you can do simple things like stripping /en/ from the beginning of the path. Otherwise, you're bound to do something wrong with a regular expression. Don't reinvent the wheel!

Develop Reference

JavaScript is the programming language of the Web.

RegEx for detecting a string and a path in one go - javascript

Related

How to capture one particular instance of a string only if it occurs twice in regex?

How can I cut the string after a second underscore?

Match only # and not ## without negative lookbehind

javascript regex insert new element into expression

Regex: Getting content from URL

Categories

Resources