Javascript respecting backslashes in input: negative lookbehind

Javascript respecting backslashes in input: negative lookbehind - javascript

In Javascript, I have a situation where I get input which I .split(/[ \n\t]/g) into an array. The point is that if a space is directly preceded by a backslash, I don't want the split to happen there.
E.g. is_multiply___spaced_text -> ['is','multiply','','','spaced','text']
But: is\_multiply\___spaced_text -> ['is multiply ','','spaced','text']
(Underscores used for spaces for clarity)
If this wasn't Javascript (which doesn't support lookbehinds in regex'es), I'd just use /(?<!\\)[ \n\t]/g. That doesn't work, so what would be the best way to handle this?

You can reverse the string, then use negative lookahead and then reverse the strings in the array:
var pre_results = "is\\ multiply\\ spaced text".split('').reverse().join('').split(/[ \t](?!\\)/);
var results = [];
for(var i = 0; i < pre_results.length; i++) {
results.push(pre_results[i].split('').reverse().join(''));
}
for(var i = 0; i < results.length; i++) {
document.write(results[i] + "<br>");
}
In this example, the result should be:
['text', 'spaced', '', 'is\\ multiply\\']

"is\_multiply\___spaced_text".replace(/\_/, " ").replace(/_/, " ").split("_");

Related

Without using the reverse() method. How do I maintain the original string order, space and punctuation on string that was reverse?

I am able to use a for loop without using a helper method to reverse the string. But, how do I maintain the original order, space, and punctuation on the string?
Without using the reverse() helper method I am able to reverse the string but I cannot maintain the order of the words and punctuations.
// Reverse preserving the order, punctuation without using a helper
function reverseWordsPreserveOrder(words) {
let reverse = '';
for (let i = words.length -1; i >= 0; i--) {
reverse += words[i];
}
return reverse;
}
console.log(reverseWordsPreserveOrder('Javascript, can be challenging.'))
// output-> .gnignellahc eb nac ,tpircsavaJ
I expect the result to be like this:
// output-> tpircsavaJ, nac eb gnignellahc.

I'd use a regular expression and a replacer function instead: match consecutive word characters with \w+, and in the replacer function, use your for loop to reverse the substring, and return it:
function reverseSingleWord(word) {
let reverse = '';
for (let i = word.length -1; i >= 0; i--) {
reverse += word[i];
}
return reverse;
}
const reverseWordsPreserveOrder = str => str.replace(/\w+/g, reverseSingleWord);
console.log(reverseWordsPreserveOrder('Javascript, can be challenging.'))

If you are trying to do it manually — no reverse() of regexs, you could:
• Defined what you mean by punctuation. This can just be a set, or using an ascii range for letters, etc. But somehow you need to be able to tell letters from non letters.
• Maintain a cache of the current word because you are not reversing the whole sentence, just the words so you need to treat them individually.
With that you can loop through once with something like:
function reverseWordsPreserveOrder(s){
// some way to know what is letter and what is punt
let punct = new Set([',',' ', '.', '?'])
// current word reversed
let word = ''
// sentence so far
let sent = ''
for (let l of s){
if (punct.has(l)) {
sent += word + l
word = ''
} else {
word = l + word
}
}
sent += word
return sent
}
console.log(reverseWordsPreserveOrder('Javascript, can be challenging.'))
Having said this, it's probably more efficient to use a regex.

If you are only averse to reverse because you think it can't do the job, here is a more semantic version (based on #CertainPerformance's), in ES6 you can use the spread syntax (...) with the word string (as strings are iterable):
function reverseSingleWord(word) {
return [...word].reverse().join('');
}
const reverseWordsPreserveOrder = str => str.replace(/\w+/g, reverseSingleWord);
console.log(reverseWordsPreserveOrder('Javascript, can be challenging.'))

Javascript: Get first number substring for each semi-colon separated substring

I am creating a script of time calculation from MySQL as I don't want to load the scripts on server-side with PHP.
I am getting the data and parsing it using JSON, which gives me a string of values for column and row data. The format of this data looks like:
1548145153,1548145165,End,Day;1548145209,1548145215,End,Day;1548148072,1548148086,End,Day;1548161279,1548161294,End,Day;1548145161,1548145163,End,Day;1548148082,1548148083,End,Day;1548161291,1548161293,End,Day
I need to split this string by semi-colon, and then extract the first VARCHAR number from before each comma to use that in subsequent calculation.
So for example, I would like to extract the following from the data above:
[1548145153, 1548145209, 1548148072, 1548161279, 1548145161, 1548148082, 1548161291]
I used the following type of for-loop but is not working as I wanted to:
for (var i=0; i < words.length; i++) {
var1 = words[i];
console.log(var1);
}
The string and the for-loop together are like following:
var processData = function(data) {
for(var a = 0; a < data.length; a++) {
var obj = data[a];
var str= obj.report // something like 1548145153,1548145165,End,Day;1548145209,1548145215,End,Day;1548148072,1548148086,End,Day;1548161279,1548161294,End,Day;1548145161,1548145163,End,Day;1548148082,1548148083,End,Day;1548161291,1548161293,End,Day
words = str.split(',');
words = str.split(';');
for (var i=0; i < words.length; i++) {
var1 = words[i];
var2 = var1[0];
console.log(var2);
}

Here is an approach based on a regular expression:
const str = "1548145153,1548145165,End,Day;1548145209,1548145215,End,Day;1548148072,1548148086,End,Day;1548161279,1548161294,End,Day;1548145161,1548145163,End,Day;1548148082,1548148083,End,Day;1548161291,1548161293,End,Day";
const ids = str.match(/(?<=;)(\d+)|(^\d+(?=,))/gi)
console.log(ids)
The general idea here is to classify the first VARCHAR value as either:
a number sequence directly preceded by a ; character (see 1 below) or, for the edge case
the very first number sequence of the input string directly followed by a , character (see 2 below).
These two cases are expressed as follows:
Match any number sequence that is preceded by a ; using the negated lookbehind rule: (?<=;)(\d+), where ; is the character that must follow a number sequence \d+ to be a match
Match any number sequence that is the first number sequence of the input string, and that has a , directly following it using the lookahead rule (^\d+(?=,)), where \d+ is the number sequence and , is the character that must directly follow that number sequence to be a match
These building blocks 1 and 2 are combined using the | operator to achieve the final result

First thing is that you override words with the content of str.split(';'), so it won't hold what you expect. To split the string into chunks, split by ; first, then iterate over the resulting array and within the loop, split by ,.
const str= "1548145153,1548145165,End,Day;1548145209,1548145215,End,Day;1548148072,1548148086,End,Day;1548161279,1548161294,End,Day;1548145161,1548145163,End,Day;1548148082,1548148083,End,Day;1548161291,1548161293,End,Day";
const lines = str.split(';');
lines.forEach(line => {
const parts = line.split(',');
console.log(parts[0]);
});

What you are doing is not correct, you'll have to separate strings twice as there are two separators. i.e. a comma and a semicolon.
I think you need a nested loop for that.
var str = "1548145153,1548145165,End,Day;1548145209,1548145215,End,Day;1548148072,1548148086,End,Day;1548161279,1548161294,End,Day;1548145161,1548145163,End,Day;1548148082,1548148083,End,Day;1548161291,1548161293,End,Day"
let words = str.split(';');
for (var i=0; i < words.length; i++) {
let varChars = words[i].split(',');
for (var j=0; j < varChars.length; i++)
console.log(varChars[j]);
}
I hope this helps. Please don't forget to mark the answer.

How do I determine the width of the result of codePointAt?

I'm trying to loop over the Unicode characters in a Javascript string, that I assume is encoded with UTF-16.
It is my understanding that UTF-16 is variable width. That is, a single Unicode character may be split across multiple 16-bit characters. I can use s[i].codePointAt to get the Unicode character beginning at a given code point. But once I have it, how do I know how far to advance i?
Roughly, what is getWidth here? Is it simply c > Math.pow(2, 16)?
for (var i = 0; i < s.length;) {
var c = s.codePointAt(i);
// do some operation with c
i = i + getWidth(c)
}
Is there a standard library function I can use to determine how far to advance? Or a way to iterate over the Unicode code points in a string?

Is there a standard […] way to iterate over the Unicode code points in a string?
Yes, since ES6 you can simply iterate all strings to get the code points:
for (const character of string) {
const codepoint = character.codePointAt(0);
// do some operation with codepoint
}

A simple approach:
for (var i = 0; i < s.length; ++i) {
var c = s.codePointAt(i);
// do some operation with c
if( s.charCodeAt(i) != c) {
++i; // step past the next sixteen bits of the surrogate pair
}
}
(where the value of c is the Unicode codepoint, not the character).
If you want to split the string into an array of Unicode characters you can make use of the string iterator invoked by the spread operator introduced in ES6:
var array = [...s];
In pre-ES6 browsers the start of a surrogate pair can be identified in order to skip the second part:
for (var i = 0; i < s.length; ++i) {
var k = s.charCodeAt(i);
if( k < 0xD800 || k > 0xDBFF) {
var c = s[i]; // character in BMP
}
else {
c = s.substring( i,i+2); // use surrogate pair
++i;
}
// do something with c
console.log(c)
}

See: http://www.unicode.org/glossary/#supplementary_code_point
Basically, if your code point is 0x010000+ you are dealing with multibyte character.
const MIN_SUPPLEMENTARY_CODE_POINT = 0x010000;
function charCount(int codePoint) {
return codePoint >= MIN_SUPPLEMENTARY_CODE_POINT ? 2 : 1;
}

JavaScript predates Unicode and uses another, older system called UCS2, which is very similar but doesn't handle surrogate pairs nor does it understand any characters that can't be represented by two bytes.
If you are stepping through a string looking at codepoints, you can look at the codepoint value itself... if the value is greater than 2^16, you have to advance 2 string characters, otherwise advance 1 string character.
You might try a new ES6 sytax that works really well at splitting up strings into characters, even if those characters are high-order.
// High order unicode character
const k = '💩';
// Takes four bytes
console.log(k.length);
const chars = [...k];
// But its only one character
console.log(chars.length);

Split JavaScript string into array of codepoints? (taking into account "surrogate pairs" but not "grapheme clusters")

Splitting a JavaScript string into "characters" can be done trivially but there are problems if you care about Unicode (and you should care about Unicode).
JavaScript natively treats characters as 16-bit entities (UCS-2 or UTF-16) but this does not allow for Unicode characters outside the BMP (Basic Multilingual Plane).
To deal with Unicode characters beyond the BMP, JavaScript must take into account "surrogate pairs", which it does not do natively.
I'm looking for how to split a js string by codepoint, whether the codepoints require one or two JavaScript "characters" (code units).
Depending on your needs, splitting by codepoint might not be enough, and you might want to split by "grapheme cluster", where a cluster is a base codepoint followed by all its non-spacing modifier codepoints, such as combining accents and diacritics.
For the purposes of this question I do not require splitting by grapheme cluster.

#bobince's answer has (luckily) become a bit dated; you can now simply use
var chars = Array.from( text )
to obtain a list of single-codepoint strings which does respect astral / 32bit / surrogate Unicode characters.

Along the lines of #John Frazer's answer, one can use this even succincter form of string iteration:
const chars = [...text]
e.g., with:
const text = 'A\uD835\uDC68B\uD835\uDC69C\uD835\uDC6A'
const chars = [...text] // ["A", "𝑨", "B", "𝑩", "C", "𝑪"]

In ECMAScript 6 you'll be able to use a string as an iterator to get code points, or you could search a string for /./ug, or you could call getCodePointAt(i) repeatedly.
Unfortunately for..of syntax and regexp flags can't be polyfilled and calling a polyfilled getCodePoint() would be super slow (O(n²)), so we can't realistically use this approach for a while yet.
So doing it the manual way:
String.prototype.toCodePoints= function() {
chars = [];
for (var i= 0; i<this.length; i++) {
var c1= this.charCodeAt(i);
if (c1>=0xD800 && c1<0xDC00 && i+1<this.length) {
var c2= this.charCodeAt(i+1);
if (c2>=0xDC00 && c2<0xE000) {
chars.push(0x10000 + ((c1-0xD800)<<10) + (c2-0xDC00));
i++;
continue;
}
}
chars.push(c1);
}
return chars;
}
For the inverse to this see https://stackoverflow.com/a/3759300/18936

Another method using codePointAt:
String.prototype.toCodePoints = function () {
var arCP = [];
for (var i = 0; i < this.length; i += 1) {
var cP = this.codePointAt(i);
arCP.push(cP);
if (cP >= 0x10000) {
i += 1;
}
}
return arCP;
}

URL extraction from string

I found a regular expression that is suppose to capture URLs but it doesn't capture some URLs.
$("#links").change(function() {
//var matches = new array();
var linksStr = $("#links").val();
var pattern = new RegExp("^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$","g");
var matches = linksStr.match(pattern);
for(var i = 0; i < matches.length; i++) {
alert(matches[i]);
}
})
It doesn't capture this url (I need it to):
http://www.wupload.com/file/63075291/LlMlTL355-EN6-SU8S.rar
But it captures this
http://www.wupload.com

Several things:
The main reason it didn't work, is when passing strings to RegExp(), you need to slashify the slashes. So this:
"^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$"
Should be:
"^(https?:\/\/)?([\\da-z\\.-]+)\\.([a-z\\.]{2,6})([\/\\w \\.-]*)*\/?$"
Next, you said that FF reported, "Regular expression too complex". This suggests that linksStr is several lines of URL candidates.
Therefore, you also need to pass the m flag to RegExp().
The existing regex is blocking legitimate values, eg: "HTTP://STACKOVERFLOW.COM". So, also use the i flag with RegExp().
Whitespace always creeps in, especially in multiline values. Use a leading \s* and $.trim() to deal with it.
Relative links, eg /file/63075291/LlMlTL355-EN6-SU8S.rar are not allowed?
Putting it all together (except for item 5), it becomes:
var linksStr = "http://www.wupload.com/file/63075291/LlMlTL355-EN6-SU8S.rar \n"
+ " http://XXXupload.co.uk/fun.exe \n "
+ " WWW.Yupload.mil ";
var pattern = new RegExp (
"^\\s*(https?:\/\/)?([\\da-z\\.-]+)\\.([a-z\\.]{2,6})([\/\\w \\.-]*)*\/?$"
, "img"
);
var matches = linksStr.match(pattern);
for (var J = 0, L = matches.length; J < L; J++) {
console.log ( $.trim (matches[J]) );
}
Which yields:
http://www.wupload.com/file/63075291/LlMlTL355-EN6-SU8S.rar
http://XXXupload.co.uk/fun.exe
WWW.Yupload.mil

Why not do make:
URLS = str.match(/https?:[^\s]+/ig);

(https?\:\/\/)([a-z\/\.0-9A-Z_-\%\&\=]*)
this will locate any url in text

Develop Reference

JavaScript is the programming language of the Web.

Javascript respecting backslashes in input: negative lookbehind - javascript

"is\_multiply\___spaced_text".replace(/\_/, " ").replace(/_/, " ").split("_");

Related

Without using the reverse() method. How do I maintain the original string order, space and punctuation on string that was reverse?

Javascript: Get first number substring for each semi-colon separated substring

How do I determine the width of the result of codePointAt?

Split JavaScript string into array of codepoints? (taking into account "surrogate pairs" but not "grapheme clusters")

URL extraction from string

Categories

Resources