Split string with regex with numbers problems - javascript

Having a list of strings like:
Client Potential XSS2Medium
Client HTML5 Insecure Storage41Medium
Client Potential DOM Open Redirect12Low
I would like to split every string into three strings like:
["Client Potential XSS", "2", "Medium"]
I use this regular expression:
/[a-zA-Z ]+|[0-9]+/g)
But with strings that contains others numbers into, it obviously doesn't work. For example with:
Client HTML5 Insecure Storage41Medium
the result is:
["Client HTML", "5", " Insercure Storage", "41", "Medium"]
I can't find the regex that produces:
["Client HTML5 Insercure Storage", "41", "Medium"]
This regex works on regex101.com:
(.+[ \t][A-z]+)+([0-9]+)+([A-z]+)
Using it in my code:
data.substring(startIndex, endIndex)
.split("\r\n") // Split the vulnerabilities
.filter(item => !item.match(/(-+)Page \([0-9]+\) Break(-+)/g) // Remove page break
&& !item.match(/PAGE [0-9]+ OF [0-9]+/g) // Remove pagination
&& item !== '') // Remove blank strings
.map(v => v.match(/(.+[ \t][A-z]+)+([0-9]+)+([A-z]+)/g));
doesn't work.
Any help would be greatly appreciated!
EDIT:
All strings end with High, Medium and Low.

The problem is with your g global flag.
Remove that flag from this line: .map(v => v.match(/(.+[ \t][A-z]+)+([0-9]+)+([A-z]+)/g)); to make it:
.map(v => v.match(/(.+[ \t][A-z]+)+([0-9]+)+([A-z]+)/));
Also, you could make the regex much simpler, as shown by #bhmahler:
.map(v => v.match(/(.*?)(\d+)(low|medium|high)/i));

The following regex should give you what you are looking for.
/(.*?)(\d+)(low|medium|high)/gi
Here is an example https://regex101.com/r/AS9mvf/1
Here is an example of it working with map
var entries = [
'Client Potential XSS2Medium',
'Client HTML5 Insecure Storage41Medium',
'Client Potential DOM Open Redirect12Low'
];
var matches = entries.map(v => {
var result = /(.*?)(\d+)(low|medium|high)/gi.exec(v);
return [
result[1],
result[2],
result[3]
];
});
console.log(matches);

You could use a workaround (that is match vs. capture, then replace):
let strings = ['Client Potential XSS2Medium', 'Client HTML5 Insecure Storage41Medium', 'Client Potential DOM Open Redirect12Low', 'Client HTML5 Insecure Storage41Medium'];
let regex = /(?:HTML5|or_other_string)|(\d+)/g;
strings.forEach(function(string) {
string = string.replace(regex, function(match, g1) {
if (typeof(g1) != "undefined") {
return "###" + g1 + "###";
}
return match;
});
string = string.split("###");
console.log(string);
});
See an additional demo on regex101.com.

let arr = ["Client Potential XSS2Medium",
"Client HTML5 Insecure Storage41Medium",
"Client Potential DOM Open Redirect12Low"];
let re = /^.+[a-zA-Z](?=\d+)|\d+(?=[A-Z])|[^\d]+\w+$/g;
arr.forEach(str => console.log(str.match(re)))
^.+[a-zA-Z](?=\d+) Match beginning of string followed by a-zA-Z followed by one or more digit characters
\d+(?=[A-Z]) Match one or more digit characters followed by uppercase letter character
[^\d]+\w+$ Negate digit characters followed by matching word characters until end of string

Here you have one solution that wraps the number before the words High, Low or Medium with a custom token using String.replace() and finally split the resulting string by this token:
const inputs = [
"Client Potential XSS2High",
"Client HTML5 Insecure Storage41Medium",
"Client Potential DOM Open Redirect12Low"
];
let token = "-#-";
let regexp = /(\d+)(High|Low|Medium)$/;
let res = inputs.map(
x => x.replace(regexp, `${token}$1${token}$2`).split(token)
);
console.log(res);
Another solution is to use this regexp: /^(.*?)(\d+)(High|Low|Medium)$/i
const inputs = [
"Client Potential XSS2High",
"Client HTML5 Insecure Storage41Medium",
"Client Potential DOM Open Redirect12Low"
];
let regexp = /^(.*?)(\d+)(High|Low|Medium)$/i;
let res = inputs.map(
x => x.match(regexp).slice(1)
);
console.log(res);

const text = `Client Potential XSS2Medium
Client HTML5 Insecure Storage41Medium
Client Potential DOM Open Redirect12Low`
const res = text.split("\n").map(el => el.replace(/\d+/g, a => ' ' + a + ' ') );
console.log(res)

Related

regex get by multiple separator?

I want separate the sentence
hey ! there you are
to
["hey!","there you are"]
in js. now I found that
(?<=\!)
keep separator with before element.
but what if I want to use the rule to the "!!" or "!!!"?
so my goal is change to separate sentence from
hey! there!!! you are!!!!
to
["hey!","there!!!", "you are!!!!"]
but is it possible?
I tried to \(?<=!+) or \(?<=\+!) but fail.
I don't know even it possible to get !, !!..n by once
In addition to the elegant solution by The fourth bird, this specific requirement can be met by simply splitting on the regex, (?<=!)\s+|$ which can be explained as "One or more whitespace characters, or end of line, preceded by a !".
const regex = /(?<=!)\s+|$/;
[
"hey! there!!! you are!!!!",
"hey ! there you are"
].forEach(s =>
console.log(
s.split(regex)
.map(s => s.replace(/\s+!/, "!").trim())
)
);
Based on your needs my solution was to first get the exclamations, then get the strings (split by exclamations).
This method creates two arrays, one of exclamations and one of the strings.
Then I just loop over them, concatenate them, and push into a new array.
As a starting point it should be enough, you can always modify and built on top of this.
const str = 'hey! there!!! you are!!!!';
const exclamations = str.match(/!+/g);
const characters = str.split(/!+/).filter(s => s.trim());
let newArr = [];
for (i = 0; i < characters.length; i++) {
newArr.push(characters[i].trim() + exclamations[i]);
}
console.log(newArr); // ["hey!","there!!!","you are!!!!"]
You could use split with 2 lookarounds, asserting ! to the left and not ! to the right. If you want to remove the leading whitespace chars before the exclamation mark you could do some sanitizing:
const regex = /(?<=!)(?!!)/g;
[
"hey! there!!! you are!!!!",
"hey ! there you are"
].forEach(s =>
console.log(
s.split(regex)
.map(s => s.replace(/\s+!/, "!").trim())
)
);
Another option could be to match the parts instead of splitting:
[^\s!].*?(?:!(?!!)|$)
See a regex demo.
const regex = /[^\s!].*?(?:!(?!!)|$)/g;
[
"hey! there!!! you are!!!!",
"hey ! there you are"
].forEach(s =>
console.log(s.match(regex))
);
Here is yet another solution using a .split() and .reduce(). This does not use a lookbehind, for those concerned about Safari and other browsers not supporting it:
[
'hey ! there you are',
'hey! there!!! you are!!!!'
].forEach(str => {
let result = str
.split(/( *!+ *)/) // split and keep split pattern because of parenthesis
.filter(Boolean) // filter out empty items
.reduce((acc, val, idx) => {
if(idx % 2) {
// !+ split pattern => combine with previous array item
acc[acc.length - 1] += val.trim();
} else {
acc.push(val);
}
return acc;
}, []);
console.log(str, ' => ', result);
});
Output:
hey ! there you are => [
"hey!",
"there you are"
]
hey! there!!! you are!!!! => [
"hey!",
"there!!!",
"you are!!!!"
]

Regular expression to match environment

I'm using JavaScript and I'm looking for a regex to match the placeholder "environment", which will be a different value like "production" or "development" in "real" strings.
The regex should match "environment" in both strings:
https://company-application-environment.company.local
https://application-environment.company.local
I have tried:
[^-]+$ which matches environment.company.local
\.[^-]+$ which matches .company.local
How do I get environment?
You may use this regex based on a positive lookahead:
/[^.-]+(?=\.[^-]+$)/
Details:
[^.-]+: Match 1+ of any char that is not - and .
(?=\.[^-]+$): Lookahead to assert that we have a dot and 1+ of non-hyphen characters till end.
RegEx Demo
Code:
const urls = [
"https://company-application-environment.company.local",
"https://application-environment.company.local",
"https://application-production.any.thing",
"https://foo-bar-baz-development.any.thing"
]
const regex = /[^.-]+(?=\.[^-]+$)/;
urls.forEach(url =>
console.log(url.match(regex)[0])
)
Not the fanciest reg exp, but gets the job done.
const urls = [
"https://company-application-environment.company.local",
"https://application-environment.company.local",
"https://a-b-c-d-e-f.foo.bar"
]
urls.forEach(url =>
console.log(url.match(/-([^-.]+)\./)[1])
)
As an alternative you might use URL, split on - and get the last item from the array. Then split on a dot and get the first item.
[
"https://company-application-environment.company.local",
"https://application-environment.company.local"
].forEach(s => {
let env = new URL(s).host.split('-').pop().split('.')[0];
console.log(env);
})
Match for known environments
var tests = [
'https://company-application-development.company.local',
'https://application-production.company.local',
'https://appdev.company.local',
'https://appprod.company.local'
];
tests.forEach(test => {
var pattern = /(development|dev|production|prod)/g;
var match = test.match(pattern);
console.log(`environment = ${match}`);
});
In this case, the best way to match is to literally use the word you are looking for.
And if you need to match multiple values in the environment position, use the RegEx or format. See the MDN.
(production|development)

Insure that regex moves to the second OR element only if the first one doesn't exist

I'm trying to match a certain word on a string and only if it doesn't exist i want to match the another one using the OR | operator ....but the match is ignoring that... how can i insure that the behavior works :
const str = 'Soraka is an ambulance 911'
const regex = RegExp('('+'911'+'|'+'soraka'+')','i')
console.log(str.match(regex)[0]) // should get 911 instead
911 occurs late in the string, whereas Soraka occurs earlier, and the regex engine iterates character-by-character, so Soraka gets matched first, even though it's on the right-hand side of the alternation.
One option would be to match Soraka or 911 in captured lookaheads instead, and then with the regex match object, alternate between the two groups to get the one which is not undefined:
const check = (str) => {
const regex = /^(?=.*(911)|.*(Soraka))/;
const match = str.match(regex);
console.log(match[1] || match[2]);
};
check('Soraka is an ambulance 911');
check('foo 911');
check('foo Soraka');
You can use includes and find
You can pass the strings in the priority sequence, so as soon as find found any string in the original string it returns that strings back,
const str = 'Soraka is an ambulance 911'
const findStr = (...arg) => {
return [...arg].find(toCheck => str.includes(toCheck))
}
console.log(findStr("911", "Soraka"))
You can extend the findStr if you want your match to be case insensitive something like this
const str = 'Soraka is an ambulance 911'
const findStr = (...arg) => {
return [...arg].find(toCheck => str.toLowerCase().includes(toCheck.toLowerCase()))
}
console.log(findStr("Soraka", "911"))
If you want match to be whole word not the partial words than you can build dynamic regex and use it search value
const str = '911234 Soraka is an ambulance 911'
const findStr = (...arg) => {
return [...arg].find(toCheck =>{
let regex = new RegExp(`\\b${toCheck}\\b`,'i')
return regex.test(str)
})
}
console.log(findStr("911", "Soraka"))
Just use a greedy dot before a capturing group that matches 911 or Soraka:
/.*(911)|(Soraka)/
See the regex demo
The .* (or, if there are line breaks, use /.*(911)|(Soraka)/s in Chrome/Node, or /[^]*(911)|(Soraka)/ to support legacy EMCMScript versions) will ensure the regex index advances to the rightmost position when matching 911 or Soraka.
JS demo (borrowed from #CertainPerformance's answer):
const check = (str) => {
const regex = /.*(911)|(Soraka)/;
const match = str.match(regex) || ["","NO MATCH","NO MATCH"];
console.log(match[1] || match[2]);
};
check('Soraka is an ambulance 911');
check('Ambulance 911, Soraka');
check('foo 911');
check('foo Soraka');
check('foo oops!');

Regex optimization and best practice

I need to parse information out from a legacy interface. We do not have the ability to update the legacy message. I'm not very proficient at regular expressions, but I managed to write one that does what I want it to do. I just need peer-review and feedback to make sure it's clean.
The message from the legacy system returns values resembling the example below.
%name0=value
%name1=value
%name2=value
Expression: /\%(.*)\=(.*)/g;
var strBody = body_text.toString();
var myRegexp = /\%(.*)\=(.*)/g;
var match = myRegexp.exec(strBody);
var objPair = {};
while (match != null) {
if (match[1]) {
objPair[match[1].toLowerCase()] = match[2];
}
match = myRegexp.exec(strBody);
}
This code works, and I can add partial matches the middle of the name/values without anything breaking. I have to assume that any combination of characters could appear in the "values" match. Meaning it could have equal and percent signs within the message.
Is this clean enough?
Is there something that could break the expression?
First of all, don't escape characters that don't need escaping: %(.*)=(.*)
The problem with your expression: An equals sign in the value would break your parser. %name0=val=ue would result in name0=val=ue instead of name0=val=ue.
One possible fix is to make the first repetition lazy by appending a question mark: %(.*?)=(.*)
But this is not optimal due to unneeded backtracking. You can do better by using a negated character class: %([^=]*)=(.*)
And finally, if empty names should not be allowed, replace the first asterisk with a plus: %([^=]+)=(.*)
This is a good resource: Regex Tutorial - Repetition with Star and Plus
Your expression is fine, and wrapping it with two capturing groups is simple to get your desired variables and values.
You likely may not need to escape some chars and it would still work.
You can use this tool and test/edit/modify/change your expressions if you wish:
%(.+)=(.+)
Since your data is pretty structured, you can also do so with string split and get the same desired outputs, if you want.
RegEx Descriptive Graph
This graph shows how the expression would work and you can visualize other expressions in this link:
JavaScript Test
const regex = /%(.+)=(.+)/gm;
const str = `%name0=value
%name1=value
%name2=value`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Performance Test
This JavaScript snippet shows the performance of that expression using a simple 1-million times for loop.
const repeat = 1000000;
const start = Date.now();
for (var i = repeat; i >= 0; i--) {
const string = '%name0=value';
const regex = /(%(.+)=(.+))/gm;
var match = string.replace(regex, "\nGroup #1: $1 \n Group #2: $2 \n Group #3: $3 \n");
}
const end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match 💚💚💚 ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. 😳 ");

Get first letter of each word in a string, in JavaScript

How would you go around to collect the first letter of each word in a string, as in to receive an abbreviation?
Input: "Java Script Object Notation"
Output: "JSON"
I think what you're looking for is the acronym of a supplied string.
var str = "Java Script Object Notation";
var matches = str.match(/\b(\w)/g); // ['J','S','O','N']
var acronym = matches.join(''); // JSON
console.log(acronym)
Note: this will fail for hyphenated/apostrophe'd words Help-me I'm Dieing will be HmImD. If that's not what you want, the split on space, grab first letter approach might be what you want.
Here's a quick example of that:
let str = "Java Script Object Notation";
let acronym = str.split(/\s/).reduce((response,word)=> response+=word.slice(0,1),'')
console.log(acronym);
I think you can do this with
'Aa Bb'.match(/\b\w/g).join('')
Explanation: Obtain all /g the alphanumeric characters \w that occur after a non-alphanumeric character (i.e: after a word boundary \b), put them on an array with .match() and join everything in a single string .join('')
Depending on what you want to do you can also consider simply selecting all the uppercase characters:
'JavaScript Object Notation'.match(/[A-Z]/g).join('')
Easiest way without regex
var abbr = "Java Script Object Notation".split(' ').map(function(item){return item[0]}).join('');
This is made very simple with ES6
string.split(' ').map(i => i.charAt(0)) //Inherit case of each letter
string.split(' ').map(i => i.charAt(0)).toUpperCase() //Uppercase each letter
string.split(' ').map(i => i.charAt(0)).toLowerCase() //lowercase each letter
This ONLY works with spaces or whatever is defined in the .split(' ') method
ie, .split(', ') .split('; '), etc.
string.split(' ') .map(i => i.charAt(0)) .toString() .toUpperCase().split(',')
To add to the great examples, you could do it like this in ES6
const x = "Java Script Object Notation".split(' ').map(x => x[0]).join('');
console.log(x); // JSON
and this works too but please ignore it, I went a bit nuts here :-)
const [j,s,o,n] = "Java Script Object Notation".split(' ').map(x => x[0]);
console.log(`${j}${s}${o}${n}`);
#BotNet flaw:
i think i solved it after excruciating 3 days of regular expressions tutorials:
==> I'm a an animal
(used to catch m of I'm) because of the word boundary, it seems to work for me that way.
/(\s|^)([a-z])/gi
Try -
var text = '';
var arr = "Java Script Object Notation".split(' ');
for(i=0;i<arr.length;i++) {
text += arr[i].substr(0,1)
}
alert(text);
Demo - http://jsfiddle.net/r2maQ/
Using map (from functional programming)
'use strict';
function acronym(words)
{
if (!words) { return ''; }
var first_letter = function(x){ if (x) { return x[0]; } else { return ''; }};
return words.split(' ').map(first_letter).join('');
}
Alternative 1:
you can also use this regex to return an array of the first letter of every word
/(?<=(\s|^))[a-z]/gi
(?<=(\s|^)) is called positive lookbehind which make sure the element in our search pattern is preceded by (\s|^).
so, for your case:
// in case the input is lowercase & there's a word with apostrophe
const toAbbr = (str) => {
return str.match(/(?<=(\s|^))[a-z]/gi)
.join('')
.toUpperCase();
};
toAbbr("java script object notation"); //result JSON
(by the way, there are also negative lookbehind, positive lookahead, negative lookahead, if you want to learn more)
Alternative 2:
match all the words and use replace() method to replace them with the first letter of each word and ignore the space (the method will not mutate your original string)
// in case the input is lowercase & there's a word with apostrophe
const toAbbr = (str) => {
return str.replace(/(\S+)(\s*)/gi, (match, p1, p2) => p1[0].toUpperCase());
};
toAbbr("java script object notation"); //result JSON
// word = not space = \S+ = p1 (p1 is the first pattern)
// space = \s* = p2 (p2 is the second pattern)
It's important to trim the word before splitting it, otherwise, we'd lose some letters.
const getWordInitials = (word: string): string => {
const bits = word.trim().split(' ');
return bits
.map((bit) => bit.charAt(0))
.join('')
.toUpperCase();
};
$ getWordInitials("Java Script Object Notation")
$ "JSON"
How about this:
var str = "", abbr = "";
str = "Java Script Object Notation";
str = str.split(' ');
for (i = 0; i < str.length; i++) {
abbr += str[i].substr(0,1);
}
alert(abbr);
Working Example.
If you came here looking for how to do this that supports non-BMP characters that use surrogate pairs:
initials = str.split(' ')
.map(s => String.fromCodePoint(s.codePointAt(0) || '').toUpperCase())
.join('');
Works in all modern browsers with no polyfills (not IE though)
Getting first letter of any Unicode word in JavaScript is now easy with the ECMAScript 2018 standard:
/(?<!\p{L}\p{M}*)\p{L}/gu
This regex finds any Unicode letter (see the last \p{L}) that is not preceded with any other letter that can optionally have diacritic symbols (see the (?<!\p{L}\p{M}*) negative lookbehind where \p{M}* matches 0 or more diacritic chars). Note that u flag is compulsory here for the Unicode property classes (like \p{L}) to work correctly.
To emulate a fully Unicode-aware \b, you'd need to add a digit matching pattern and connector punctuation:
/(?<!\p{L}\p{M}*|[\p{N}\p{Pc}])\p{L}/gu
It works in Chrome, Firefox (since June 30, 2020), Node.js, and the majority of other environments (see the compatibility matrix here), for any natural language including Arabic.
Quick test:
const regex = /(?<!\p{L}\p{M}*)\p{L}/gu;
const string = "Żerard Łyżwiński";
// Extracting
console.log(string.match(regex)); // => [ "Ż", "Ł" ]
// Extracting and concatenating into string
console.log(string.match(regex).join("")) // => ŻŁ
// Removing
console.log(string.replace(regex, "")) // => erard yżwiński
// Enclosing (wrapping) with a tag
console.log(string.replace(regex, "<span>$&</span>")) // => <span>Ż</span>erard <span>Ł</span>yżwiński
console.log("_Łukasz 1Żukowski".match(/(?<!\p{L}\p{M}*|[\p{N}\p{Pc}])\p{L}/gu)); // => null
In ES6:
function getFirstCharacters(str) {
let result = [];
str.split(' ').map(word => word.charAt(0) != '' ? result.push(word.charAt(0)) : '');
return result;
}
const str1 = "Hello4 World65 123 !!";
const str2 = "123and 456 and 78-1";
const str3 = " Hello World !!";
console.log(getFirstCharacters(str1));
console.log(getFirstCharacters(str2));
console.log(getFirstCharacters(str3));
Output:
[ 'H', 'W', '1', '!' ]
[ '1', '4', 'a', '7' ]
[ 'H', 'W', '!' ]
This should do it.
var s = "Java Script Object Notation",
a = s.split(' '),
l = a.length,
i = 0,
n = "";
for (; i < l; ++i)
{
n += a[i].charAt(0);
}
console.log(n);
The regular expression versions for JavaScript is not compatible with Unicode on older than ECMAScript 6, so for those who want to support characters such as "å" will need to rely on non-regex versions of scripts.
Event when on version 6, you need to indicate Unicode with \u.
More details: https://mathiasbynens.be/notes/es6-unicode-regex
Yet another option using reduce function:
var value = "Java Script Object Notation";
var result = value.split(' ').reduce(function(previous, current){
return {v : previous.v + current[0]};
},{v:""});
$("#output").text(result.v);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<pre id="output"/>
This is similar to others, but (IMHO) a tad easier to read:
const getAcronym = title =>
title.split(' ')
.map(word => word[0])
.join('');
ES6 reduce way:
const initials = inputStr.split(' ').reduce((result, currentWord) =>
result + currentWord.charAt(0).toUpperCase(), '');
alert(initials);
Try This Function
const createUserName = function (name) {
const username = name
.toLowerCase()
.split(' ')
.map((elem) => elem[0])
.join('');
return username;
};
console.log(createUserName('Anisul Haque Bhuiyan'));

Categories

Resources