Parsing file names with javascript

Parsing file names with javascript - javascript

I have file names like the following:
SEM_VSE_SKINSHARPS_555001881_181002_1559_37072093.DAT
SEM_VSE_SECURITY_555001881_181002_1559_37072093.DAT
SEM_VSE_MEDICALCONDEMERGENCIES_555001881_181002_1559_37072093.DAT
SEM_REASONS_555001881_181002_1414_37072093.DAT
SEM_PSE_NPI_SECURITY_555001881_181002_1412_37072093.DAT
and I need to strip the numbers from the end. This will happen daily and and the numbers will change. I HAVE to do it in javascript. The problem is, I know really nothing about javascript. I've looked at both split and slice and I'm not sure either will work. These files come from a government entity which means the file name will probably not be consistent.
expected output:
SEM_VSE_SKINSHARPS
SEM_VSE_SECURITY
SEM_VSE_MEDICALCONDEMERGENCIES
SEM_REASONS
SEM_PSE_NPI_SECURITY
Any help is greatly appreciated.

This is a good use case for regular expressions. For example,
var oldFileName = 'SEM_VSE_SKINSHARPS_555001881_181002_1559_37072093.DAT',
newFileName;
newFileName = oldFileName.replace(/[_0-9]+(?=.DAT$)/, ''); // SEM_VSE_SKINSHARPS.DAT
This says to replace as many characters as it can in the set - and 0-9, with the requirement that the replaced portion must be followed by .DAT and the end of the string.
If you want to strip the .DAT, as well, use /[_0-9]+.DAT$/ as the regular expression instead of the one above.

If all the files end in .XYZ and follow the given pattern, this might also work:
var filename = "SEM_VSE_SKINSHARPS_555001881_181002_1559_37072093.DAT"
filename.slice(0,-4).split("_").filter(x => !+x).join("_")
results in:
"SEM_VSE_SKINSHARPS"
This is how it works:
drop the last 4 chars (.DAT)
split by _
filter out the numbers
join what is remaining with another _
You can also create a function out of this solution (or the other ones) and use it to process all the files provided they are in an array:
var fileTrimmer = filename => filename.slice(0,-4).split("_").filter(x => !+x).join("_")
var result = array_of_filenames.map(fileTrimmer)

Below is a solution that assumes you have your file name strings stored in an array. The code below simply creates a new array of properly formatted file names by utilizing Array.prototype.map on the original array - the map callback function first grabs the extension part of the string to tack on the file name later. Next, the function breaks the fileName string into an array delimited on the _ character. Finally, the filter function returns true if it does not find a number within the fileName string - returning true means that the element will be part of the new array. Otherwise, filter will return false and will not include the portion of the string that contains a number.
var fileNames = ['SEM_VSE_SKINSHARPS_555001881_181002_1559_37072093.DAT', 'SEM_VSE_SECURITY_555001881_181002_1559_37072093.DAT', 'SEM_VSE_MEDICALCONDEMERGENCIES_555001881_181002_1559_37072093.DAT', 'SEM_REASONS_555001881_181002_1414_37072093.DAT', 'SEM_PSE_NPI_SECURITY_555001881_181002_1412_37072093.DAT'];
var formattedFileNames = fileNames.map(fileName => {
var ext = fileName.substring(fileName.indexOf('.'), fileName.length);
var parts = fileName.split('_');
return parts.filter(part => !part.match(/[0-9]/g)).join('_') + ext;
});
console.log(formattedFileNames);

Related

get last 6 characters of a match (javascript regex)

I am trying to parse txt files with js + regex and my problem is as follows:
I have multiple txt files, and inside each one I need to search for an Id, made by 6 characters (numb + letters)
this is the string inside one of those files:
**IFCPROPERTYSINGLEVALUE('codice sito',$,IFCTEXT('I013FR'),$);**
I need to extract the I013FR only, and so far the closest js-regex I wrote is:
(codice sito',\$,IFCTEXT\('[a-zA-Z\d]{6})
using that, I get in return:
codice sito',$,IFCTEXT('I372TO
now I need to "add something" at the end of the regex, in order to only take the last 6 characters from the match.
Is that possible? am I on the right way? or maybe there is another better way to do that?

To extract the sequence of symbols, you need to put it in parenthesis. This pattern is called a "capturing group". Read more
/codice sito',\$,IFCTEXT\('([a-zA-Z\d]{6})/g
And then you can get your id using RegExp.exec() method.
const str = "**IFCPROPERTYSINGLEVALUE('codice sito',$,IFCTEXT('I013FR'),$);**";
const regex = /codice sito',\$,IFCTEXT\('([a-zA-Z\d]{6})/g;
const id = regex.exec(str)[1];

How can I cut the string after a second underscore?

I'm receiving a list of files in an object and I just need to display a file name and its type in a table.
All files come back from a server in such format: timestamp_id_filename.
Example: 1568223848_12345678_some_document.pdf
I wrote a helper function which cuts the string.
At first, I did it with String.prototype.split() method, I used regex, but then again - there was a problem. Files can have underscores in their names so that didn't work, so I needed something else. I couldn't come up with a better idea. I think it looks really dumb and it's been haunting me the whole day.
The function looks like this:
const shortenString = (attachmentName) => {
const file = attachmentName
.slice(attachmentName.indexOf('_') + 1)
.slice(attachmentName.slice(attachmentName.indexOf('_') + 1).indexOf('_') + 1);
const fileName = file.slice(0, file.lastIndexOf('.'));
const fileType = file.slice(file.lastIndexOf('.'));
return [fileName, fileType];
};
I wonder if there is a more elegant way to solve the problem without using loops.

You can use replace and split, with the pattern we are replacing the string upto the second _ from start of string and than we split on . to get name and type
let nameAndType = (str) => {
let replaced = str.replace(/^(?:[^_]*_){2}/g, '')
let splited = replaced.split('.')
let type = splited.pop()
let name = splited.join('.')
return {name,type}
}
console.log(nameAndType("1568223848_12345678_some_document.pdf"))
console.log(nameAndType("1568223848_12345678_some_document.xyz.pdf"))

function splitString(val){
return val.split('_').slice('2').join('_');
}

const getShortString = (str) => str.replace(/^(?:[^_]*_){2}/g, '')
For input like
1568223848_12345678_some_document.pdf, it should give you something like some_document.pdf

const re = /(.*?)_(.*?)_(.*)/;
const name = "1568223848_12345678_some_document.pdf";
[,date, id, filename] = re.exec(name);
console.log(date);
console.log(id);
console.log(filename);
some notes:
you want to make the regular expression 1 time. If you do this
function getParts(str) {
const re = /expression/;
...
}
Then you're making a new regular expression object every time you call getParts.
.*? is faster than .*
This is because .* is greedy so the moment the regular expression engine sees that it puts the entire rest of the string into that slot and then checks if can continue the expression. If it fails it backs off one character. If that fails it backs off another character, etc.... .*? on the other hand is satisfied as soon as possible. So it adds one character then sees if the next part of the expression works, if not it adds one more character and sees if the expressions works, etc..
splitting on '_' works but it could potentially make many temporary strings
for example if the filename is 1234_1343_a________________________.pdf
you'd have to test to see if using a regular experssion is faster or slower than splitting, assuming speed matters.

You can kinda chain .indexOf to get second offset and any further, although more than two would look ugly. The reason is that indexOf takes start index as second argument, so passing index of the first occurrence will help you find the second one:
var secondUnderscoreIndex = name.indexOf("_",name.indexOf("_")+1);
So my solution would be:
var index = name.indexOf("_",name.indexOf("_")+1));
var [timestamp, name] = [name.substring(0, index), name.substr(index+1)];
Alternatively, using regular expression:
var [,number1, number2, filename, extension] = /([0-9]+)_([0-9]+)_(.*?)\.([0-9a-z]+)/i.exec(name)
// Prints: "1568223848 12345678 some_document pdf"
console.log(number1, number2, filename, extension);

I like simplicity...
If you ever need the date in times, theyre in [1] and [2]
var getFilename = function(str) {
return str.match(/(\d+)_(\d+)_(.*)/)[3];
}
var f = getFilename("1568223848_12345678_some_document.pdf");
console.log(f)

If ever files names come in this format timestamp_id_filename. You can use a regular expression that skip the first two '_' and save the nex one.
test:
var filename = '1568223848_12345678_some_document.pdf';
console.log(filename.match(/[^_]+_[^_]+_(.*)/)[1]); // result: 'some_document.pdf'
Explanation:
/[^]+[^]+(.*)/
[^]+ : take characters diferents of ''
: take '' character
Repeat so two '_' are skiped
(.*): Save characters in a group
match method: Return array, his first element is capture that match expression, next elements are saved groups.

Split the file name string into an array on underscores.
Discard the first two elements of the array.
Join the rest of the array with underscores.
Now you have your file name.

Regex one-liner for splitting string at nth character where n is a variable length

I've found a few similar questions, but none of them are clean one-liners, which I feel should be possible. I want to split a string at the last instance of specific character (in my case .).
var img = $('body').attr('data-bg-img-url'); // the string http://sub.foo.com/img/my-img.jpg
var finalChar = img.split( img.split(/[.]+/).length-1 ); // returns int 3 in above string example
var dynamicRegex = '/[.$`finalChar`]/';
I know I'm breaking some rules here, wondering if someone smarter than me knows the correct way to put that together and compress it?
EDIT - The end goal here is to split and store http://sub.foo.com/img/my-img and .jpg as separate strings.

In regex, .* is greedy, meaning it will match as much as possible. Therefore, if you want to match up to the last ., you could do:
/^.*\./
And from the looks, you are trying to get the file extension, so you would want to add capture:
var result = /^.*\.(.*)$/.exec( str );
var extension = result[1];
And for both parts:
var result = /^(.*)\.(.*)$/.exec( str );
var path = result[1];
var extension = result[2];

You can use the lastIndexOf() method on the period and then use the substring method to obtain the first and second string. The split() method is better used in a foreach scenario where you want to split at all instances. Substring is preferable for these types of cases where you are breaking at a single instance of the string.

string replace with jquery assitance

I have a string like this
"/folder1/folder2/folder3/IMG_123456_PP.jpg"
I want to use JavaScript / jQuery to replace the 123456 in the above string with 987654. The entire string is dynamic so cant do a simple string replace. For example, the string could also be
"/folder1/folder2/folder3/IMG_143556_TT.jpg"
"/folder1/folder2/folder3/IMG_1232346_RR.jpg"
Any tips on this?

"/folder1/folder2/folder3/IMG_123456_PP.jpg".replace(/\_\d{2,}/,'_987654');
Edit :
"/fo1/fo2/fol3/IMG_123456fgf_PP.jpg".replace(/\_\d{2,}[A-Za-z]*/,'_987654');

I am sure there is a better way to do this, but if you are trying to always replace the numbers of that file regardless of what they may be you could use a combination of splits/joins like this:
str = "/folder1/folder2/folder3/IMG_143556_TT.jpg" //store image src in string
strAry = str.split('/') //split up the string by folders and file (as last array position) into array.
lastPos = strAry.length-1; //find the index of the last array position (the file name)
fileNameAry = strAry[lastPos].split('_'); //take the file name and split it into an array based on the underscores.
fileNameAry[1] = '987654'; //rename the part of the file name you want to rename.
strAry[lastPos] = fileNameAry.join('_'); //rejoin the file name array back into a string and over write the old file name in the original string array.
newStr = strAry.join('/'); //rejoin the original string array back into a string.
What this will do is make it so that regardless of what directory or original name of the file name is, you can change it based on the string's structure. so as long as the file naming convention stays the same (with underscores) this script will work.
please excuse my vocab, I know it's not very good heh.

Use a regular expression
var str = '/folder1/folder2/folder3/IMG_123456_PP.jpg';
var newstr = str.replace(/(img_)(\d+)(?=_)/gi,function($0, $1){
return $1 ? $1 + '987654' : $0;
});
example at http://www.jsfiddle.net/MZXhd/
Perhaps more comprehensible is
var str = '/folder1/folder2/folder3/IMG_123456_PP.jpg';
var replacewith = '987654';
var newstr = str.replace(/(img_)(\d+)(?=_)/gi,'$1'+replacewith);
example at http://www.jsfiddle.net/CXAq6/

Regex to extract substring, returning 2 results for some reason

I need to do a lot of regex things in javascript but am having some issues with the syntax and I can't seem to find a definitive resource on this.. for some reason when I do:
var tesst = "afskfsd33j"
var test = tesst.match(/a(.*)j/);
alert (test)
it shows
"afskfsd33j, fskfsd33"
I'm not sure why its giving this output of original and the matched string, I am wondering how I can get it to just give the match (essentially extracting the part I want from the original string)
Thanks for any advice

match returns an array.
The default string representation of an array in JavaScript is the elements of the array separated by commas. In this case the desired result is in the second element of the array:
var tesst = "afskfsd33j"
var test = tesst.match(/a(.*)j/);
alert (test[1]);

Each group defined by parenthesis () is captured during processing and each captured group content is pushed into result array in same order as groups within pattern starts. See more on http://www.regular-expressions.info/brackets.html and http://www.regular-expressions.info/refcapture.html (choose right language to see supported features)
var source = "afskfsd33j"
var result = source.match(/a(.*)j/);
result: ["afskfsd33j", "fskfsd33"]
The reason why you received this exact result is following:
First value in array is the first found string which confirms the entire pattern. So it should definitely start with "a" followed by any number of any characters and ends with first "j" char after starting "a".
Second value in array is captured group defined by parenthesis. In your case group contain entire pattern match without content defined outside parenthesis, so exactly "fskfsd33".
If you want to get rid of second value in array you may define pattern like this:
/a(?:.*)j/
where "?:" means that group of chars which match the content in parenthesis will not be part of resulting array.
Other options might be in this simple case to write pattern without any group because it is not necessary to use group at all:
/a.*j/
If you want to just check whether source text matches the pattern and does not care about which text it found than you may try:
var result = /a.*j/.test(source);
The result should return then only true|false values. For more info see http://www.javascriptkit.com/javatutors/re3.shtml

I think your problem is that the match method is returning an array. The 0th item in the array is the original string, the 1st thru nth items correspond to the 1st through nth matched parenthesised items. Your "alert()" call is showing the entire array.

Just get rid of the parenthesis and that will give you an array with one element and:
Change this line
var test = tesst.match(/a(.*)j/);
To this
var test = tesst.match(/a.*j/);
If you add parenthesis the match() function will find two match for you one for whole expression and one for the expression inside the parenthesis
Also according to developer.mozilla.org docs :
If you only want the first match found, you might want to use
RegExp.exec() instead.
You can use the below code:
RegExp(/a.*j/).exec("afskfsd33j")

I've just had the same problem.
You only get the text twice in your result if you include a match group (in brackets) and the 'g' (global) modifier.
The first item always is the first result, normally OK when using match(reg) on a short string, however when using a construct like:
while ((result = reg.exec(string)) !== null){
console.log(result);
}
the results are a little different.
Try the following code:
var regEx = new RegExp('([0-9]+ (cat|fish))','g'), sampleString="1 cat and 2 fish";
var result = sample_string.match(regEx);
console.log(JSON.stringify(result));
// ["1 cat","2 fish"]
var reg = new RegExp('[0-9]+ (cat|fish)','g'), sampleString="1 cat and 2 fish";
while ((result = reg.exec(sampleString)) !== null) {
console.dir(JSON.stringify(result))
};
// '["1 cat","cat"]'
// '["2 fish","fish"]'
var reg = new RegExp('([0-9]+ (cat|fish))','g'), sampleString="1 cat and 2 fish";
while ((result = reg.exec(sampleString)) !== null){
console.dir(JSON.stringify(result))
};
// '["1 cat","1 cat","cat"]'
// '["2 fish","2 fish","fish"]'
(tested on recent V8 - Chrome, Node.js)
The best answer is currently a comment which I can't upvote, so credit to #Mic.

Develop Reference

JavaScript is the programming language of the Web.

Parsing file names with javascript - javascript

Related

get last 6 characters of a match (javascript regex)

How can I cut the string after a second underscore?

Regex one-liner for splitting string at nth character where n is a variable length

string replace with jquery assitance

Regex to extract substring, returning 2 results for some reason

Categories

Resources