I'm parsing the body text from incoming emails, looking for key/value pairs.
Example Email Body
First Name: John
Last Name:Smith
Email : john#example.com
Comments = Just a test comment that
may span multiple lines.
I tried using a RegEx ([\w\d\s]+)\s?[=|:]\s?(.+) in multiline mode. This works for most emails, but fails when there's a line break that should be part of the value. I don't know enough about RegEx to go any further.
I have another parser that goes line-by-line looking for the key/value pairs and simply folds a line into the last matched value if a key/value pair is NOT found. It's implemented in Scala.
val lines = text.split("\\r?\\n").toList
var lastLabelled: Int = -1
val linesBuffer = mutable.ListBuffer[(String, String)]()
// only parse lines until the first blank line
// null_? method is checks for empty strings and nulls
lines.takeWhile(!_.null_?).foreach(line => {
line.splitAt(delimiter) match {
case Nil if line.nonEmpty => {
val l = linesBuffer(lastLabelled)
linesBuffer(lastLabelled) = (l._1, l._2 + "\n" + line)
}
case pair :: Nil => {
lastLabelled = linesBuffer.length
linesBuffer += pair
}
case _ => // skip this line
}
})
I'm trying to use RegEx so that I can save the parser to the db and change it on a per-sender basis at runtime (implement different parsers for different senders).
Can my RegEx be modified to match values that contain newlines?
Do I need to just forget about using RegEx and use some JavaScript? I already have a JavaScript parser that lets me store the JS in the DB and essentially do everything that I want to do with the RegEx parser.
I think this should work...
((.+?)((\s*)(:|=)(\s*)))(((.|\n)(?!((.+?)(:|=))))+)
...as tested here http://regexpal.com/. If you loop through the matches you should be able to pull out the key and value.
Related
I am very new to making search text in array some elements in array are in rangers i.e it cant be anything after certain text in this AA and A regex and I have multi-dimensional array and I want search text in each array . So I wrote something like this.I put AA* in array so only first 2 character should match and A* for only one character match.
arr = [
["AA*","ABC","XYZ"] ,
["A*","AXY","AAJ"]
]
var text = "AA3";
for ($i=0; $i<arr.length; $i++ ){
var new_array = [];
new_array = arr[$i];
new_array.filter(function(array_element) {
var result = new RegExp(array_element).test(text);
if( result == true){
console.log(arr[$i]);
}
});
}
So what i want is when text = "AA3" or anything after double A AA[anything] and the output should be first array which is ["AA*","ABC","XYZ"] but I am getting both array as output and when text = "A3" then output should be second array which is ["A*","POI","LKJ"] but I am getting both array.But if text = "ABC" or text = "AAJ" then it should output first array or second array respectively.I dont know anything about how to write regex or is there anyway I can implement this using any other method.
Thanks in advance any advice will be helpful.
Summary
In short, the issue is "*"! The * found in the members of the set array is why you're getting the same array each time.
Detailed Info
Regexp is a one concept most developers find hard to understand (I am one of such btw 😅).
I'll start off with an excerpt intro to Regexp on MDN
Regexp are patterns used to match character combinations in strings - MDN
With that in mind you want to understand what goes on with your code.
When you create a Regex like /A*/ to test "AA3", what would be matched would be A, AA, etc. This is a truthy in javascript. You would want a more stricter matching with ^ or $ or strictly matching a number with \d.
I rewrote your code as a function below:
arr = [
["AA*", "ABC", "XYZ"],
["A*", "AXY", "AAJ"],
];
findInArray(arr, "AA3") // prints both array
findInArray(arr, "AAJ") // prints second array
findInArray(arr, "ABC") // prints first array
function findInArray(array, value) {
return array.filter((subArray) =>
subArray.some((item) => {
const check = new RegExp(value);
return check.test(item);
})
);
}
Problem
The problem lies in the fact you use each of the strings as a regex.
For a string with a * wildcard, this evaluates to zero or more matches of the immediately preceding item, which will always be true.
For a string consisting solely of alphanumerics, this is comparing a string to itself, which similarly will always give true.
For strings containing characters that constitute the regex's syntax definition, this could result in errors or unintended behavior.
MDN article on RegExp quantifiers
Rewrite
Assumptions:
The value with a * wildcard is always only at the 0th position,
There is only one such wildcard in its string,
The question mentions that for text = 'AAJ' only the 2nd array shall be returned, but both the AA* from the 1st array and AAJ from the 2nd would seem to match this text.
As such, I assume the wildcard can only stand for a number (as other examples seem to suggest).
Code:
const abc = (arrs, text) => {
return arrs.filter(arr => {
const regex = new RegExp(`^${arr[0].replace('*', '\\d+')}$`);
return regex.test(text) || arr.includes(text);
})
}
const arr = [
["AA*", "ABC", "XYZ"],
["A*", "AXY", "AAJ"]
];
console.log(
`1=[${abc(arr, "AA3")}]
2=[${abc(arr, "ABC")}]
3=[${abc(arr, "AAJ")}]`);
I have file names like the following:
SEM_VSE_SKINSHARPS_555001881_181002_1559_37072093.DAT
SEM_VSE_SECURITY_555001881_181002_1559_37072093.DAT
SEM_VSE_MEDICALCONDEMERGENCIES_555001881_181002_1559_37072093.DAT
SEM_REASONS_555001881_181002_1414_37072093.DAT
SEM_PSE_NPI_SECURITY_555001881_181002_1412_37072093.DAT
and I need to strip the numbers from the end. This will happen daily and and the numbers will change. I HAVE to do it in javascript. The problem is, I know really nothing about javascript. I've looked at both split and slice and I'm not sure either will work. These files come from a government entity which means the file name will probably not be consistent.
expected output:
SEM_VSE_SKINSHARPS
SEM_VSE_SECURITY
SEM_VSE_MEDICALCONDEMERGENCIES
SEM_REASONS
SEM_PSE_NPI_SECURITY
Any help is greatly appreciated.
This is a good use case for regular expressions. For example,
var oldFileName = 'SEM_VSE_SKINSHARPS_555001881_181002_1559_37072093.DAT',
newFileName;
newFileName = oldFileName.replace(/[_0-9]+(?=.DAT$)/, ''); // SEM_VSE_SKINSHARPS.DAT
This says to replace as many characters as it can in the set - and 0-9, with the requirement that the replaced portion must be followed by .DAT and the end of the string.
If you want to strip the .DAT, as well, use /[_0-9]+.DAT$/ as the regular expression instead of the one above.
If all the files end in .XYZ and follow the given pattern, this might also work:
var filename = "SEM_VSE_SKINSHARPS_555001881_181002_1559_37072093.DAT"
filename.slice(0,-4).split("_").filter(x => !+x).join("_")
results in:
"SEM_VSE_SKINSHARPS"
This is how it works:
drop the last 4 chars (.DAT)
split by _
filter out the numbers
join what is remaining with another _
You can also create a function out of this solution (or the other ones) and use it to process all the files provided they are in an array:
var fileTrimmer = filename => filename.slice(0,-4).split("_").filter(x => !+x).join("_")
var result = array_of_filenames.map(fileTrimmer)
Below is a solution that assumes you have your file name strings stored in an array. The code below simply creates a new array of properly formatted file names by utilizing Array.prototype.map on the original array - the map callback function first grabs the extension part of the string to tack on the file name later. Next, the function breaks the fileName string into an array delimited on the _ character. Finally, the filter function returns true if it does not find a number within the fileName string - returning true means that the element will be part of the new array. Otherwise, filter will return false and will not include the portion of the string that contains a number.
var fileNames = ['SEM_VSE_SKINSHARPS_555001881_181002_1559_37072093.DAT', 'SEM_VSE_SECURITY_555001881_181002_1559_37072093.DAT', 'SEM_VSE_MEDICALCONDEMERGENCIES_555001881_181002_1559_37072093.DAT', 'SEM_REASONS_555001881_181002_1414_37072093.DAT', 'SEM_PSE_NPI_SECURITY_555001881_181002_1412_37072093.DAT'];
var formattedFileNames = fileNames.map(fileName => {
var ext = fileName.substring(fileName.indexOf('.'), fileName.length);
var parts = fileName.split('_');
return parts.filter(part => !part.match(/[0-9]/g)).join('_') + ext;
});
console.log(formattedFileNames);
In python there exists ast.literal_eval(x) where if x is "['a','b','c']" then it will return the list ['a','b','c']. Does something similar exist in Javascript / jQuery where I can take the array that is stored in the table cell as [x,y,z] and turn that into a literal JavaScript array?
I'd prefer to avoid any complex solutions that might be error prone since it's possible that involve splitting on the comma or escaping characters.
Edit: I should have given some better examples:
['la maison', "l'animal"] is an example of one that hits an error because doing a replace of a single or double quote can cause an issue since there's no guarantee on which one it'll be.
One could leverage String.prototype.replace() and JSON.parse().
See below for a rough example.
// String.prototype.replace() + JSON.parse() Strategy.
const input = "['a','b','c']" // Input.
const array = JSON.parse(input.replace(/'/g, '"')) // Array.
console.log(array) // Proof.
Although, given your update/more complex use case, eval() might be more appropriate.
// eval() Strategy.
const input = `['la maison', "l'animal"]` // Input.
const dangerousarray = eval(input) // Array.
const safearray = eval(`new Array(${input.replace(/^\[|\]$/g, '')})`)
console.log(dangerousarray) // Proof.
console.log(safearray) // Proof.
However, the MDN docs discourage use of eval() due to security/speed flaws.
As a result, one may opt for an approach similar to the following:
// Heavy Replacement Strategy.
const input = `['la maison', 'l\'animal']` // Input.
const array = input
.replace(/^\[|\]$/g, '') // Remove leading and ending square brackets ([]).
.split(',') // Split by comma.
.map((phrase) => // Iterate over each phrase.
phrase.trim() // Remove leading and ending whitespace.
.replace(/"/g, '') // Remove all double quotes (").
.replace(/^\'|\'$/g, '') // Remove leading and ending single quotes (').
)
console.log(array) // Proof.
In JavaScript you can use eval() Function like the sample bellows :
// define the string to evaluate
var str_to_evaluate = 'new Array("Saab", "Volvo", "BMW")';
// retreive the result in a array
var cars = eval(str_to_evaluate);
// print the array
console.log(cars);
I Would like to extract the Twitter handler names from a text string, using a regex. I believe I am almost there, except for the ">" that I am including in my output. How can I change my regex to be better, and drop the ">" from my output?
Here is an example of a text string value:
"PlaymakersZA, Absa, DiepslootMTB"
The desired output would be an array consisting of the following:
PlaymakersZA, Absa, DiepslootMTB
Here is an example of my regex:
var array = str.match(/>[a-z-_]+/ig)
Thank you!
You can use match groups in your regex to indicate the part you wish to extract.
I set up this JSFiddle to demonstrate.
Basically, you surround the part of the regex that you want to extract in parenthesis: />([a-z-_]+)/ig, save it as an object, and execute .exec() as long as there are still values. Using index 1 from the resulting array, you can find the first match group's result. Index 0 is the whole regex, and next indices would be subsequent match groups, if available.
var str = "PlaymakersZA, Absa, DiepslootMTB";
var regex = />([a-z-_]+)/ig
var array = regex.exec(str);
while (array != null) {
alert(array[1]);
array = regex.exec(str);
}
You could just strip all the HTML
var str = "PlaymakersZA, Absa, DiepslootMTB";
$handlers = str.replace(/<[^>]*>|\s/g,'').split(",");
A continuation of my previous question...
After testing the text format, if it is not the correct format I would like to figure out which pairs of hex values are incorrect (i.e. any pair that contains value(s) other than[0-9A-Fa-f]).
if( validFormat ) {
// do processing
}
else {
// find invalid hex value pairs
}
What is the most efficient way to obtain a list of incorrect(invalid) hex pairs so that I can report back the errors and their associated hex pairs.
Edit for additional question
Also, how would I go about testing to ensure there is not a "double space" anywhere, because that also constitutes for invalid format even though the hex pairs may be valid.
Thanks!
The easiest is to find all values and scan for those that are not valid:
var isHexPair = /^[0-9a-f]{2}$/i;
var allPairs = myTextArea.value.split(/\s+/);
var notHex = [];
for (var i=allPairs.length;i--;){
if (!isHexPair.test(allPairs[i])){
notHex.push(allPairs(i));
}
}
That regex says:
^ starting at the beginning of the string
[0-9a-f] find any character that is a digit or a-f
{2} find exactly two of them
$ making sure that we are now at the end of the string
i and make it case-insensitive (allow A-F as well as a-f)
With the above you can then do:
if (notHex.length){
// There is at least one invalid entry
}else{
// all is well
}
Edit: If you explicitly want to test that the string contains nothing but single-byte hex strings separated by a single space, the simplest test would just be:
if (/^([0-9a-f]{2} )+[0-9a-f]{2}$/i.test(myStr)){ /* valid! */ }
Take the values of the text area and store in a var since they are space separated do a .splt(" ") (splits on white space) on it and you will end up with an array of hex pairs. Then just iterate through the array comparing inside your loop to the regex from your last question, and store the invalid pairs in a new var and print that out to the user.