Convert Java tokenizing regex into Javascript

Convert Java tokenizing regex into Javascript - javascript

As an answer to my question Tokenizing an infix string in Java, I got the regex (?<=[^\.a-zA-Z\d])|(?=[^\.a-zA-Z\d]. However, now I'm writing the same code in Javascript, and I'm stuck as to how I would get a Javascript regex to do the same thing.
For example, if I have the string sin(4+3)*2, I would need it parsed into ["sin","(","4","+","3",")","*","2"]
What regex would I use to tokenize the string into each individual part.
Before, what I did is I just did a string replace of every possible token, and put a space around it, then split on that whitespace. However, that code quickly became very bloated.
The operators I would need to split on would be the standard math operators (+,-,*,/,^), as well as function names (sin,cos,tan,abs,etc...), and commas
What is a fast, efficient way to do this?

You can take advantage of regular expression grouping to do this. You need a regex that combines the different possible tokens, and you apply it repeatedly.
I like to separate out the different parts; it makes it easier to maintain and extend:
var tokens = [
"sin",
"cos",
"tan",
"\\(",
"\\)",
"\\+",
"-",
"\\*",
"/",
"\\d+(?:\\.\\d*)?"
];
You glue those all together into a big regular expression with | between each token:
var rtok = new RegExp( "\\s*(?:(" + tokens.join(")|(") + "))\\s*", "g" );
You can then tokenize using regex operations on your source string:
function tokenize( expression ) {
var toks = [], p;
rtok.lastIndex = p = 0; // reset the regex
while (rtok.lastIndex < expression.length) {
var match = rtok.exec(expression);
// Make sure we found a token, and that we found
// one without skipping garbage
if (!match || rtok.lastIndex - match[0].length !== p)
throw "Oops - syntax error";
// Figure out which token we matched by finding the non-null group
for (var i = 1; i < match.length; ++i) {
if (match[i]) {
toks.push({
type: i,
txt: match[i]
});
// remember the new position in the string
p = rtok.lastIndex;
break;
}
}
}
return toks;
}
That just repeatedly matches the token regex against the string. The regular expression was created with the "g" flag, so the regex machinery will automatically keep track of where to start matching after each match we make. When it doesn't see a match, or when it does but has to skip invalid stuff to find it, we know there's a syntax error. When it does match, it records in the token array which token it matched (the index of the non-null group) and the matched text. By remembering the matched token index, it saves you the trouble of having to figure out what each token string means after you've tokenized; you just have to do a simple numeric comparison.
Thus calling tokenize( "sin(4+3) * cos(25 / 3)" ) returns:
[ { type: 1, txt: 'sin' },
{ type: 4, txt: '(' },
{ type: 10, txt: '4' },
{ type: 6, txt: '+' },
{ type: 10, txt: '3' },
{ type: 5, txt: ')' },
{ type: 8, txt: '*' },
{ type: 2, txt: 'cos' },
{ type: 4, txt: '(' },
{ type: 10, txt: '25' },
{ type: 9, txt: '/' },
{ type: 10, txt: '3' },
{ type: 5, txt: ')' } ]
Token type 1 is the sin function, type 4 is left paren, type 10 is a number, etc.
edit — if you want to match identifiers like "x" and "y", then I'd probably use a different set of token patterns, with one just to match any identifiers. That'd mean that the parser would not find out directly about "sin" and "cos" etc. from the lexer, but that's OK. Here's an alternative list of token patterns:
var tokens = [
"[A-Za-z_][A-Za-z_\d]*",
"\\(",
"\\)",
"\\+",
"-",
"\\*",
"/",
"\\d+(?:\\.\\d*)?"
];
Now any identifier will be a type 1 token.

I don't know if this will do everything of what you want to achieve, but it works for me:
'sin(4+3)*2'.match(/\d+\.?\d*|[a-zA-Z]+|\S/g);
// ["sin", "(", "4", "+", "3", ")", "*", "2"]
You may replace [a-zA-Z]+ part with sin|cos|tan|etc to support only math functions.

Just offer up a few possibilities:
[a-zA-Z]+|\d+(?:\.\d+)?|.

Related

use named regex groups to output an array of matches

I'm trying to get the hang of named capturing groups.
Given a string
var a = '{hello} good {sir}, a [great] sunny [day] to you.';
I'd like to output an array which maintains the integrity of the sentence (complete with punctuation, spaces, etc) so I can reassemble the sentence at a later time:
[
{
group: "braces",
word: "hello"
},
{
group: "other",
word: " good " <-- space on either side is maintained
},
{
group: "braces",
word: "sir"
},
{
group: "other",
word: ", a "
},
{
group: "brackets",
word: "great"
},
{
group: "other",
word: " sunny "
},
{
group: "brackets",
word: "day"
},
{
group: "other",
word: " to you."
},
]
I'm using named capturing groups to try and output this. <braces> captures any text within {}, <brackets> captures any text within [], and <others> captures anything else (\s,.\w+):
var regex = /(?<braces>\{(.*?)\})(?<brackets>\[(.*?)\])(?<others>\s,.\w+)?/g;
console.log(a.match(regex)); outputs nothing.
If I remove <others> group,
var regex = /(?<braces>\{(.*?)\})(?<brackets>\[(.*?)\])?/g;
console.log(a.match(regex)); outputs ["{hello}", "{sir}"]
Question: How do I use capturing groups to find all instances of named groups and output them like the above desired array?

A regex match object will only contain one string for a given named capture group. For what you're trying to do, you'll have to do it in two steps: first separate out the parts of the input, then map it to the array of objects while checking which group was captured to identify the sort of group it needs:
const str = '{hello} good {sir}, a [great] sunny [day] to you.';
const matches = [...str.matchAll(/{([^{]+)}|\[([^\]]+)\]|([^[{]+)/g)]
.map(match => ({
group: match[1] ? 'braces' : match[2] ? 'brackets' : 'other',
word: match[1] || match[2] || match[3]
}));
console.log(matches);

Convert user input string to an object to be accessed by function

I have data in the format (input):
doSomething({
type: 'type',
Unit: 'unit',
attributes: [
{
attribute: 'attribute',
value: form.first_name
},
{
attribute: 'attribute2',
value: form.family_name
}
],
groups: [
{
smth: 'string1',
smth2: 'string2',
start: timeStart.substring(0, 9)
}
]
})
I managed to take out the doSomething part with the parenthesis as to load the function from the corresponding module with
expression.split('({',1)[0]
However using the loaded function with the rest, obtained with:
expression.split(temp+'(')[1].trim().replace(/\n+/g, '').slice(0, -1)
does not work because it should be an object and not a string. Hardcoding the data in does work as it is automatically read as an object.
My question is if there is any way of converting the string that I get from the user and convert it to an object. I have tried to convert it to a json object with JSON.parse but I get an unexpected character t at position 3. Also I have tried new Object(myString) but that did not work either.
What I would like is to have the body of the provided function as an object as if I would hard code it, so that the function can evaluate the different fields properly.
Is there any way to easily achieve that?
EDIT: the "output" would be:
{
type: 'type',
Unit: 'unit',
attributes: [
{
attribute: 'attribute',
value: form.first_name
},
{
attribute: 'attribute2',
value: form.family_name
}
],
groups: [
{
smth: 'string1',
smth2: 'string2',
start: timeStart.substring(0, 9)
}
]
}
as an object. This is the critical part because I have this already but as a string. However the function that uses this, is expecting an object. Like previously mentioned, hard coding this would work, as it is read as an object, but I am getting the input mentioned above as a string from the user.

Aside: I know eval is evil. The user could do by this certain injections. This is only one possibility to do this there are certain other ways.
I just added before "output =", cut from the input-string the "doSomething(" and the last ")". By this I have a normal command-line which I could execute by eval.
I highly not recommend to use eval this way; especially you don't
know what the user will do, so you don't know what could all happen
with your code and data.
let form = {first_name: 'Mickey', family_name: 'Mouse'};
let timeStart = (new Date()).toString();
let input = `doSomething({
type: 'type',
Unit: 'unit',
attributes: [
{
attribute: 'attribute',
value: form.first_name
},
{
attribute: 'attribute2',
value: form.family_name
}
],
groups: [
{
smth: 'string1',
smth2: 'string2',
start: timeStart.substring(0, 9)
}
]
})`;
let pos= "doSomething(".length;
input = 'output = ' + input.substr(pos, input.length-pos-1);
eval(input);
console.log(output);

Regex to NOT match a string with 10 consecutive digits. The digits may be separated by white space. All other string return a match

I have a codepen with 5/7 unit tests passing. Stuck on strings starting with non-digit characters.
https://codepen.io/david-grieve/pen/pBpGoO?editors=0012
var regexString = /^\D*(?!(\s*\d\s*){10,}).*/;
var regexString = /^\D*(?!(\s*\d\s*){10,}).*/;
var tests = [{
text: 'abc123',
ismatch: true
}, {
text: '1234567890',
ismatch: false
}, {
text: '123456789',
ismatch: true
}, {
text: 'abc1234567890efg',
ismatch: false
}, {
text: '123 456 789 123',
ismatch: false
},
{
text: 'abc1234567890',
ismatch: false
}, {
text: '1234567890efg',
ismatch: false
}
];
console.log(new Date().toString());
tests.map(test => console.log(test.text, regexString.test(test.text) == test.ismatch));
With this regex the following strings pass the unit tests
"abc123" true
"1234567890" true
"123456789" true
"123 456 789 123" true
"1234567890efg" true
These fail the unit tests
"abc1234567890" false
"abc1234567890efg" false
Note: /^\D{3,}(?!(\s*\d\s*){10,}).*/ passes all the tests but is obviously wrong.

The problem with ^\D*(?! is that, even if a long digit/space string is found in the negative lookahead, the part matched by \D will simply backtrack one character once the negative lookahead matches. Eg, when
^\D*(?!\d{10,}).*
matches
abc1234567890
the \D* matches ab, and the .* matches c1234567890. The position between the b and the c is not immediately followed by a long number/space substring, so the match does not fail.
Also, because some digits may come before the 10 consecutive digits, the ^\D* at the beginning won't be enough - for example, what if the input is 1a01234567890? Instead, try
^(?!.*(\d\s*){10}).*
This ensures that every position is not followed by (10 digits, possibly separated by spaces).
https://regex101.com/r/v7t4IC/1
If the digits can only come in a single block (possibly separated by spaces) in the string, your pattern would've worked if you were in an environment which supports possessive quantifiers, which prevent backtracking, eg:
^\D*+(?!(\s*\d\s*){10,}).*
^
https://regex101.com/r/eGdw2l/1
(but Javascript does not support such syntax, unfortunately)

Mongoose: Sorting

what's the best way to sort the following documents in a collection:
{"topic":"11.Topic","text":"a.Text"}
{"topic":"2.Topic","text":"a.Text"}
{"topic":"1.Topic","text":"a.Text"}
I am using the following
find.(topic:req.body.topic).(sort({topic:1}))
but is not working (because the fields are strings and not numbers so I get):
{"topic":"1.Topic","text":"a.Text"},
{"topic":"11.Topic","text":"a.Text"},
{"topic":"2.Topic","text":"a.Text"}
but i'd like to get:
{"topic":"1.Topic","text":"a.Text"},
{"topic":"2.Topic","text":"a.Text"},
{"topic":"11.Topic","text":"a.Text"}
I read another post here that this will require complex sorting which mongoose doesn't have. So perhaps there is no real solution with this architecture?
Your help is greatly appreciated

i will suggest you make your topic filed as type : Number, and create another field topic_text.
Your Schema would look like:
var documentSchema = new mongoose.Schema({
topic : Number,
topic_text : String,
text : String
});
Normal document would look something like this:
{document1:[{"topic":11,"topic_text" : "Topic" ,"text":"a.Text"},
{"topic":2,"topic_text" : "Topic","text":"a.Text"},
{"topic":1,"topic_text" : "Topic","text":"a.Text"}]}
Thus, you will be able to use .sort({topic : 1}) ,and get the result you want.
while using topic value, append topic_text to it.
find(topic:req.body.topic).sort({topic:1}).exec(function(err,result)
{
var topic = result[0].topic + result[0].topic_text;//use index i to extract the value from result array.
})

If you do not want (or maybe do not even can) change the shape of your documents to include a numeric field for the topic number then you can achieve your desired sorting with the aggregation framework.
The following pipeline essentially splits the topic strings like '11.Topic' by the dot '.' and then prefixes the first part of the resulting array with a fixed number of leading zeros so that sorting by those strings will result in 'emulated' numeric sorting.
Note however that this pipeline uses $split and $strLenBytes operators which are pretty new so you may have to update your mongoDB instance - I used version 3.3.10.
db.getCollection('yourCollection').aggregate([
{
$project: {
topic: 1,
text: 1,
tmp: {
$let: {
vars: {
numStr: { $arrayElemAt: [{ $split: ["$topic", "."] }, 0] }
},
in: {
topicNumStr: "$$numStr",
topicNumStrLen: { $strLenBytes: "$$numStr" }
}
}
}
}
},
{
$project: {
topic: 1,
text: 1,
topicNumber: { $substr: [{ $concat: ["_0000", "$tmp.topicNumStr"] }, "$tmp.topicNumStrLen", 5] },
}
},
{
$sort: { topicNumber: 1 }
},
{
$project: {
topic: 1,
text: 1
}
}
])

How to make a MongoDB query sort on strings with -number postfix?

I have a query:
ownUnnamedPages = Entries.find( { author : this.userId, title : {$regex: /^unnamed-/ }}, {sort: { title: 1 }}).fetch()
That returns the following array sorted:
[ {
title: 'unnamed-1',
text: '<p>sdaasdasdasd</p>',
tags: [],
_id: 'Wkxxpapm8bbiq59ig',
author: 'AHSwfYgeGmur9oHzu',
visibility: 'public' },
{
title: 'unnamed-10',
text: '',
author: 'AHSwfYgeGmur9oHzu',
visibility: 'public',
_id: 'aDSN2XFjQPh9HPu4c' },
{
title: 'unnamed-2',
text: '<p>kkhjk</p>',
tags: [],
_id: 'iM9FMCsyzehQvYGKj',
author: 'AHSwfYgeGmur9oHzu',
visibility: 'public' },
{
title: 'unnamed-3',
text: '',
tags: [],
_id: 'zK2w9MEQGnwsm3Cqh',
author: 'AHSwfYgeGmur9oHzu',
visibility: 'public' }]
The problem is that it seems to sort on the first numeric character so it thinks the proper sequence is 1, 10, 2, 3, etc....
what I really want is for it to sort on both the whole numerical part so that 10 would be at the end.
I'd prefer not to do this by having additional numbers such as 01 or 001 for the numbers.
How would I do that?

You can use
db.collectionName.find().sort({title: 1}).collation({locale: "en_US", numericOrdering: true})
numericOrdering flag is boolean and is Optional. Flag that determines whether to compare numeric strings as numbers or as strings.
If true, compare as numbers; i.e. "10" is greater than "2".
If false, compare as strings; i.e. "10" is less than "2".
Default is false.
See mongo's collation documentation for an updated explanation of those fields.

MongoDB can't sort by numbers stored as strings. You either have to store the number as an integer in its own field, pad with leading zeroes, or sort the results after they've been returned from the database.

If you 0 pad the numbers you will be able to search as a string in the right order, so instead of 0,1,2,3,4,5,6,7,8,9,10,11...
use 01,02,03,04,05,06,07,08,09,10,11...
and a string search will return them in order.

The mongo documentation said you can use Collation for this goal
as #Eugene Kaurov said you can use
.collation({locale: "en_US", numericOrdering: true})
this is the official documentation:
mongo ref
and be aware that the accepted answer is not correct now

In mongo is not possible (sort strings in ascii) but you can sort with the below function after you get all documents from the collection
const sortString = (a, b) => {
const AA = a.title.split('-');
const BB = b.title.split('-');
if (parseInt(AA[1], 10) === parseInt(BB[1], 10)) {
return 0;
}
return (parseInt(AA[1], 10) < parseInt(BB[1], 10)) ? -1 : 1;
};
document.sort(sortString);

In my case we work with aggregations. The approach was to sort using the length of our string; only works when the text part is always the same (unnamed- in your case)
db.YourCollection.aggregate([
{
$addFields: {
"TitleSize": { $strLenCP: "$Title" }
}
},
{
$sort: {
"TitleIdSize": 1,
"Title": 1
}
}
]);
Now we sort using length, the second sort will use the content.
Example:
"unnamed-2", Titlesize: 9
"unnamed-7", Titlesize: 9
"unnamed-30", Titlesize: 10
"unnamed-1", Titlesize: 9
The first sort will put the ids in this order: 2, 7, 1, 30. Then the second sort will put the ids in the correct order: 1, 2, 7, 30.

Develop Reference

JavaScript is the programming language of the Web.

Convert Java tokenizing regex into Javascript - javascript

I don't know if this will do everything of what you want to achieve, but it works for me: 'sin(4+3)2'.match(/\d+\.?\d|[a-zA-Z]+|\S/g); // ["sin", "(", "4", "+", "3", ")", "*", "2"] You may replace [a-zA-Z]+ part with sin|cos|tan|etc to support only math functions.

Just offer up a few possibilities: [a-zA-Z]+|\d+(?:\.\d+)?|.

Related

use named regex groups to output an array of matches

Convert user input string to an object to be accessed by function

Regex to NOT match a string with 10 consecutive digits. The digits may be separated by white space. All other string return a match

Mongoose: Sorting

How to make a MongoDB query sort on strings with -number postfix?

Categories

Resources

Develop Reference

JavaScript is the programming language of the Web.

Convert Java tokenizing regex into Javascript - javascript

I don't know if this will do everything of what you want to achieve, but it works for me: 'sin(4+3)*2'.match(/\d+\.?\d*|[a-zA-Z]+|\S/g); // ["sin", "(", "4", "+", "3", ")", "*", "2"] You may replace [a-zA-Z]+ part with sin|cos|tan|etc to support only math functions.

Just offer up a few possibilities: [a-zA-Z]+|\d+(?:\.\d+)?|.

Related

use named regex groups to output an array of matches

Convert user input string to an object to be accessed by function

Regex to NOT match a string with 10 consecutive digits. The digits may be separated by white space. All other string return a match

Mongoose: Sorting

How to make a MongoDB query sort on strings with -number postfix?

Categories

Resources

I don't know if this will do everything of what you want to achieve, but it works for me: 'sin(4+3)2'.match(/\d+\.?\d|[a-zA-Z]+|\S/g); // ["sin", "(", "4", "+", "3", ")", "*", "2"] You may replace [a-zA-Z]+ part with sin|cos|tan|etc to support only math functions.