Ambiguous interface of RegExp

Ambiguous interface of RegExp - javascript

Something very strange.
var body="Received: from ([195.000.000.0])\r\nReceived: from ([77.000.000.000]) by (6.0.000.000)"
var lastMath="";
var subExp = "[\\[\\(](\\d+\\.\\d+\\.\\d+\\.\\d+)[\\]\\)]"
var re = new RegExp("Received\\: from.*?"+subExp +".*", "mg");
var re1 = new RegExp(subExp , "mg");
while(ares= re.exec(body))
{
print(ares[0])
while( ares1 = re1.exec(ares[0]))
{
if(!IsLocalIP(ares1[1]))
{
print(ares1[1])
lastMath=ares1[1];
break ;
}
}
}
print(lastMath)
It outputs:
Received: from ([195.000.000.0])
195.000.000.0
Received: from ([77.000.000.000]) by (6.0.000.000)
6.0.000.000
6.0.000.000
But I think it should be:
Received: from ([195.000.000.0])
195.000.000.0
Received: from ([77.000.000.000]) by (6.0.000.000)
77.000.000.000
77.000.000.000
Because obviously "77.000.000.000" goes first. If I comment "break", output order is correct.
What's wrong with my code?

Note that regex grouping in Javascript (and most languages) does not work with a very obvious behavior with the * or + operators. For example:
js>r = /^(ab[0-9])+$/
/^(ab[0-9])+$/
js>"ab1ab2ab3ab4".match(r)
ab1ab2ab3ab4,ab4
In this case, you get the last group that matches and that's it. I'm not sure where this behavior is specified, but it can vary from language to language.
edit: What does IsLocalIP() do?
OK, I think the problem has to do with exec's statefulness (which may be why I don't use it; I use String.match()) -- if you're going to do this, you need to manually initialize the regex's lastindex property to 0, otherwise you get this behavior:
function weird(dobreak)
{
var s = "Received: from ([77.000.000.000]) by (6.0.000.000)"
var re1 = /[\[\(](\d+\.\d+\.\d+\.\d+)[\]\)]/mg
while (s2 = re1.exec(s))
{
writeln("s2="+s2);
if (dobreak)
break;
}
}
produces this result:
js>weird(true)
js>weird(true)
s2=[77.000.000.000],77.000.000.000
js>weird(true)
s2=(6.0.000.000),6.0.000.000
js>weird(true)
js>
You'll note that the same function gets three different results, which implies statefulness is mucking things up for some bizarre reason (Javascript is caching/interning the regex somehow? I'm using JSDB which uses Spidermonkey = Firefox's javascript engine).
So if I change the code to the following:
function notweird(dobreak)
{
var s = "Received: from ([77.000.000.000]) by (6.0.000.000)"
var re1 = /[\[\(](\d+\.\d+\.\d+\.\d+)[\]\)]/mg
re1.lastIndex = 0;
while (s2 = re1.exec(s))
{
writeln("s2="+s2);
if (dobreak)
break;
}
}
Then I get the expected behavior:
js>notweird(true)
s2=[77.000.000.000],77.000.000.000
js>notweird(true)
s2=[77.000.000.000],77.000.000.000
js>notweird(true)
s2=[77.000.000.000],77.000.000.000

Related

Is there an equivalent of find_first_of c++ string method in javascript

I come from C++ background and currently working on node.js server app.
I want to know if there exists an equivalent of find_first_of C++ string class method in Javascript string.
Basically I'll have a string like
var str ="abcd=100&efgh=101&ijkl=102&mnop=103". The order of & seprated words could be random. So, I wanted to do something like the following:
str.substr(str.find("mnop=") + string("mnop=").length, str.find_first_of("&,\n'\0'")
Is there a way to it in a single line like above?

You may find the search function useful.
"string find first find second".search("find"); // 7
In addition, you may also find this question useful.

There's no direct equivalent, but you always can employ regular expressions:
var str ="abcd=100&efgh=101&ijkl=102&mnop=103";
console.log(str.match(/&mnop=([^&]+)/)[1]);
However, in this specific case, it's better to use the dedicated module:
var qs = require('querystring');
var vars = qs.parse(str);
console.log(vars.mnop);
If you really want a method that behaves like find_first_of, it can be implemented like this:
String.prototype.findFirstOf = function(chars, start) {
var idx = -1;
[].some.call(this.slice(start || 0), function(c, i) {
if(chars.indexOf(c) >= 0)
return idx = i, true;
});
return idx >= 0 ? idx + (start || 0) : -1;
}
console.log("abc?!def??".findFirstOf('?!')); // 3
console.log("abc?!def??".findFirstOf('?!', 6)); // 8

Javascript splitting string using only last splitting parameter

An example of what im trying to get:
String1 - 'string.co.uk' - would return 'string' and 'co.uk'
String2 - 'random.words.string.co.uk' - would return 'string` and 'co.uk'
I currently have this:
var split= [];
var tld_part = domain_name.split(".");
var sld_parts = domain_name.split(".")[0];
tld_part = tld_part.slice(1, tld_part.length);
split.push(sld_parts);
split.push(tld_part.join("."));
With my current code, it takes the split parameter from the beginning, i want to reverse it if possible. With my current code it does this:
String1 - 'string.co.uk' - returns 'string' and 'co.uk'
String2 - 'random.words.string.co.uk' - would return 'random` and 'words.string.co.uk'
Any suggestions?

To expand upon elclanrs comment:
function getParts(str) {
var temp = str.split('.').slice(-3) // grabs the last 3 elements
return {
tld_parts : [temp[1],temp[2]].join("."),
sld_parts : temp[0]
}
}
getParts("foo.bar.baz.co.uk") would return { tld_parts : "co.uk", sld_parts : "baz" }
and
getParts("i.got.99.terms.but.a.bit.aint.one.co.uk") would return { tld_parts : "co.uk", sld_parts : "one" }

try this
var str='string.co.uk'//or 'random.words.string.co.uk'
var part = str.split('.');
var result = part[part.length - 1].toString() + '.' + part[part.length - 1].toString();
alert(result);

One way that comes to mind is the following
var tld_part = domain_name.split(".");
var name = tld_part[tld_part.length - 2];
var tld = tld_part[tld_part.length - 1] +"."+ tld_part[tld_part.length];

Depending on your use case, peforming direct splits might not be a good idea — for example, how would the above code handle .com or even just localhost? In this respect I would go down the RegExp route:
function stripSubdomains( str ){
var regs; return (regs = /([^.]+)(\.co)?(\.[^.]+)$/i.exec( str ))
? regs[1] + (regs[2]||'') + regs[3]
: str
;
};
Before the Regular Expression Police attack reprimand me for not being specific enough, a disclaimer:
The above can be tightened as a check against domain names by rather than checking for ^., to check for the specific characters allowed in a domain at that point. However, my own personal perspective on matters like these is to be more open at the point of capture, and be tougher from a filtering point at a later date... This allows you to keep an eye on what people might be trying, because you can never be 100% certain your validation isn't blocking valid requests — unless you have an army of user testers at your disposal. At the end of the day, it all depends on where this code is being used, so the above is an illustrated example only.

Javascript RegEx Mysteries (for a poor C programmer)

This is clearly a RTFM issue, but after I did so repeatedly I just can't get the damn thing to work so there are times when asking for help makes sense:
var text = "KEY:01 VAL:1.10,KEY:02 VAL:2.20,KEY:03 VAL:3.30";
var pattern = '/KEY:(\S+) VAL:([^,]+)/g';
//var pattern = '/KEY:(\S+) VAL:(.?+)(?:(?=,KEY:)|$)/g';
//var pattern = '/KEY:(\S+) VAL:(.+)$/g';
//pattern.compile(pattern);
var kv = null;
var row = 0, col = 0;
while((kv = pattern.exec(text) != null))
{
row = kv[1].charAt(0) - '0';
col = kv[1].charAt(1) - '0';
e = document.getElementById('live').rows[row].cells;
e[col].innerHTML = kv[2].slice(0, kv[2].indexOf(","));
}
kv[1] is supposed to give "01"
kv[2] is supposed to give "1.10"
...and of course kv[] should list all the values of 'text'
to fill the table called 'live'.
But I can't get to have pattern.exec() succeed in doing that.
Where is the glitch?

First, the delimiters for the RegExp should be /s, there's no need to put them in ' delimiters. i.e. to get your exec to run properly you should have:
var pattern = /KEY:(\S+) VAL:([^,]+)/g;
Second, you're assigning a boolean to kv which you don't want. The while will obviously only evaluate to true if it's not null so that's redundant. Instead you just need:
while (kv = pattern.exec(text)) {
That should get your code to work as you desire.

the syntax for pattern objects doesn't include quoting, such as:
var pattern=/KEY:(\S+) VAL:([^,]+)/g;
http://www.w3schools.com/jsref/jsref_regexp_exec.asp

It should be
var pattern = /KEY:(\S+) VAL:([^,]+)/g;
http://www.regular-expressions.info/ is a good place to start with.

javascript and string manipulation w/ utf-16 surrogate pairs

I'm working on a twitter app and just stumbled into the world of utf-8(16). It seems the majority of javascript string functions are as blind to surrogate pairs as I was. I've got to recode some stuff to make it wide character aware.
I've got this function to parse strings into arrays while preserving the surrogate pairs. Then I'll recode several functions to deal with the arrays rather than strings.
function sortSurrogates(str){
var cp = []; // array to hold code points
while(str.length){ // loop till we've done the whole string
if(/[\uD800-\uDFFF]/.test(str.substr(0,1))){ // test the first character
// High surrogate found low surrogate follows
cp.push(str.substr(0,2)); // push the two onto array
str = str.substr(2); // clip the two off the string
}else{ // else BMP code point
cp.push(str.substr(0,1)); // push one onto array
str = str.substr(1); // clip one from string
}
} // loop
return cp; // return the array
}
My question is, is there something simpler I'm missing? I see so many people reiterating that javascript deals with utf-16 natively, yet my testing leads me to believe, that may be the data format, but the functions don't know it yet. Am I missing something simple?
EDIT:
To help illustrate the issue:
var a = "0123456789"; // U+0030 - U+0039 2 bytes each
var b = "𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡"; // U+1D7D8 - U+1D7E1 4 bytes each
alert(a.length); // javascript shows 10
alert(b.length); // javascript shows 20
Twitter sees and counts both of those as being 10 characters long.

Javascript uses UCS-2 internally, which is not UTF-16. It is very difficult to handle Unicode in Javascript because of this, and I do not suggest attempting to do so.
As for what Twitter does, you seem to be saying that it is sanely counting by code point not insanely by code unit.
Unless you have no choice, you should use a programming language that actually supports Unicode, and which has a code-point interface, not a code-unit interface. Javascript isn't good enough for that as you have discovered.
It has The UCS-2 Curse, which is even worse than The UTF-16 Curse, which is already bad enough. I talk about all this in OSCON talk, 🔫 Unicode Support Shootout: 👍 The Good, the Bad, & the (mostly) Ugly 👎.
Due to its horrible Curse, you have to hand-simulate UTF-16 with UCS-2 in Javascript, which is simply nuts.
Javascript suffers from all kinds of other terrible Unicode troubles, too. It has no support for graphemes or normalization or collation, all of which you really need. And its regexes are broken, sometimes due to the Curse, sometimes just because people got it wrong. For example, Javascript is incapable of expressing regexes like [𝒜-𝒵]. Javascript doesn’t even support casefolding, so you can’t write a pattern like /ΣΤΙΓΜΑΣ/i and have it correctly match στιγμας.
You can try to use the XRegEXp plugin, but you won’t banish the Curse that way. Only changing to a language with Unicode support will do that, and 𝒥𝒶𝓋𝒶𝓈𝒸𝓇𝒾𝓅𝓉 just isn’t one of those.

I've knocked together the starting point for a Unicode string handling object. It creates a function called UnicodeString() that accepts either a JavaScript string or an array of integers representing Unicode code points and provides length and codePoints properties and toString() and slice() methods. Adding regular expression support would be very complicated, but things like indexOf() and split() (without regex support) should be pretty easy to implement.
var UnicodeString = (function() {
function surrogatePairToCodePoint(charCode1, charCode2) {
return ((charCode1 & 0x3FF) << 10) + (charCode2 & 0x3FF) + 0x10000;
}
function stringToCodePointArray(str) {
var codePoints = [], i = 0, charCode;
while (i < str.length) {
charCode = str.charCodeAt(i);
if ((charCode & 0xF800) == 0xD800) {
codePoints.push(surrogatePairToCodePoint(charCode, str.charCodeAt(++i)));
} else {
codePoints.push(charCode);
}
++i;
}
return codePoints;
}
function codePointArrayToString(codePoints) {
var stringParts = [];
for (var i = 0, len = codePoints.length, codePoint, offset, codePointCharCodes; i < len; ++i) {
codePoint = codePoints[i];
if (codePoint > 0xFFFF) {
offset = codePoint - 0x10000;
codePointCharCodes = [0xD800 + (offset >> 10), 0xDC00 + (offset & 0x3FF)];
} else {
codePointCharCodes = [codePoint];
}
stringParts.push(String.fromCharCode.apply(String, codePointCharCodes));
}
return stringParts.join("");
}
function UnicodeString(arg) {
if (this instanceof UnicodeString) {
this.codePoints = (typeof arg == "string") ? stringToCodePointArray(arg) : arg;
this.length = this.codePoints.length;
} else {
return new UnicodeString(arg);
}
}
UnicodeString.prototype = {
slice: function(start, end) {
return new UnicodeString(this.codePoints.slice(start, end));
},
toString: function() {
return codePointArrayToString(this.codePoints);
}
};
return UnicodeString;
})();
var ustr = UnicodeString("f𝌆𝌆bar");
document.getElementById("output").textContent = "String: '" + ustr + "', length: " + ustr.length + ", slice(2, 4): " + ustr.slice(2, 4);
<div id="output"></div>

Here are a couple scripts that might be helpful when dealing with surrogate pairs in JavaScript:
ES6 Unicode shims for ES3+ adds the String.fromCodePoint and String.prototype.codePointAt methods from ECMAScript 6. The ES3/5 fromCharCode and charCodeAt methods do not account for surrogate pairs and therefore give wrong results.
Full 21-bit Unicode code point matching in XRegExp with \u{10FFFF} allows matching any individual code point in XRegExp regexes.

Javascript string iterators can give you the actual characters instead of the surrogate code points:
>>> [..."0123456789"]
["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
>>> [..."𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡"]
["𝟘", "𝟙", "𝟚", "𝟛", "𝟜", "𝟝", "𝟞", "𝟟", "𝟠", "𝟡"]
>>> [..."0123456789"].length
10
>>> [..."𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡"].length
10

This is along the lines of what I was looking for. It needs better support for the different string functions. As I add to it I will update this answer.
function wString(str){
var T = this; //makes 'this' visible in functions
T.cp = []; //code point array
T.length = 0; //length attribute
T.wString = true; // (item.wString) tests for wString object
//member functions
sortSurrogates = function(s){ //returns array of utf-16 code points
var chrs = [];
while(s.length){ // loop till we've done the whole string
if(/[\uD800-\uDFFF]/.test(s.substr(0,1))){ // test the first character
// High surrogate found low surrogate follows
chrs.push(s.substr(0,2)); // push the two onto array
s = s.substr(2); // clip the two off the string
}else{ // else BMP code point
chrs.push(s.substr(0,1)); // push one onto array
s = s.substr(1); // clip one from string
}
} // loop
return chrs;
};
//end member functions
//prototype functions
T.substr = function(start,len){
if(len){
return T.cp.slice(start,start+len).join('');
}else{
return T.cp.slice(start).join('');
}
};
T.substring = function(start,end){
return T.cp.slice(start,end).join('');
};
T.replace = function(target,str){
//allow wStrings as parameters
if(str.wString) str = str.cp.join('');
if(target.wString) target = target.cp.join('');
return T.toString().replace(target,str);
};
T.equals = function(s){
if(!s.wString){
s = sortSurrogates(s);
T.cp = s;
}else{
T.cp = s.cp;
}
T.length = T.cp.length;
};
T.toString = function(){return T.cp.join('');};
//end prototype functions
T.equals(str)
};
Test results:
// plain string
var x = "0123456789";
alert(x); // 0123456789
alert(x.substr(4,5)) // 45678
alert(x.substring(2,4)) // 23
alert(x.replace("456","x")); // 0123x789
alert(x.length); // 10
// wString object
x = new wString("𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡");
alert(x); // 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡
alert(x.substr(4,5)) // 𝟜𝟝𝟞𝟟𝟠
alert(x.substring(2,4)) // 𝟚𝟛
alert(x.replace("𝟜𝟝𝟞","x")); // 𝟘𝟙𝟚𝟛x𝟟𝟠𝟡
alert(x.length); // 10

JavaScript string.format function does not work in IE

I have a JavaScript from this source in a comment of a blog: frogsbrain
It's a string formatter, and it works fine in Firefox, Google Chrome, Opera and Safari.
Only problem is in IE, where the script does no replacement at all. The output in both test cases in IE is only 'hello', nothing more.
Please help me to get this script working in IE also, because I'm not the Javascript guru and I just don't know where to start searching for the problem.
I'll post the script here for convenience. All credits go to Terence Honles for the script so far.
// usage:
// 'hello {0}'.format('world');
// ==> 'hello world'
// 'hello {name}, the answer is {answer}.'.format({answer:'42', name:'world'});
// ==> 'hello world, the answer is 42.'
String.prototype.format = function() {
var pattern = /({?){([^}]+)}(}?)/g;
var args = arguments;
if (args.length == 1) {
if (typeof args[0] == 'object' && args[0].constructor != String) {
args = args[0];
}
}
var split = this.split(pattern);
var sub = new Array();
var i = 0;
for (;i < split.length; i+=4) {
sub.push(split[i]);
if (split.length > i+3) {
if (split[i+1] == '{' && split[i+3] == '}')
sub.push(split[i+1], split[i+2], split[i+3]);
else {
sub.push(split[i+1], args[split[i+2]], split[i+3]);
}
}
}
return sub.join('')
}

I think the issue is with this.
var pattern = /({?){([^}]+)}(}?)/g;
var split = this.split(pattern);
Javascript's regex split function act different in IE than other browser.
Please take a look my other post in SO

var split = this.split(pattern);
string.split(regexp) is broken in many ways on IE (JScript) and is generally best avoided. In particular:
it does not include match groups in the output array
it omits empty strings
alert('abbc'.split(/(b)/)) // a,c
It would seem simpler to use replace rather than split:
String.prototype.format= function(replacements) {
return this.replace(String.prototype.format.pattern, function(all, name) {
return name in replacements? replacements[name] : all;
});
}
String.prototype.format.pattern= /{?{([^{}]+)}}?/g;

Develop Reference

JavaScript is the programming language of the Web.