Finding punctuation marks in text with string-methods

Finding punctuation marks in text with string-methods - javascript

how can I find out when a punctuation(?!;.) or "<" character comes in the string. I don’t want to use an array or compare any letter, but try to solve it with string methods. Something like that:
var text = corpus.substr(0, corpus.indexOf(".");
Ok, if I explicitly specify a character like a punct, it works fine. The problem with my parsing is that with a long text in a loop, I no longer know how a sentence ends, whether with question marks or exclamation points. I tried following, but it doesn’t work:
var text = corpus.substr(0, corpus.indexOf(corpus.search("."));
I want to loop through a long string and use every punctuation found to use it as the end-of-sentence character.
Do you know how can I solve my problem?

You can start with RegExp and weight it against going character by character and compare ascii codes essentially. Split is another way ( just posted above ).
RegExp solution
function getTextUpToPunc( text ) {
const regExp = /^.+(\!|\?|\.)/mg;
for (let match; (match = regExp.exec( text )) !== null;) {
console.log(match);
}
}
getTextUpToPunc(
"what a chunky funky monkey! this is really someting else"
)
The key advantage here is that you do not need to loop through the entire string and hold control over the iteration by doing regExp.exec( text ).
The split solution posted earlier would work but split will loop over the entire string. Typically that would not be an issue but if your strings are thousands upon thousands of characters and you do this operation a lot that it would make sense to think about performance.
And if this function will be ran many many times, a small performance improvement would be to memoize the RegExp creation:
const regExp = /^.+(\!|\?|\.)/mg;
Into something like this
function getTextUpToPunc( text ) {
if( !this._regExp ) this._regExp = /^.+(\!|\?|\.)/mg;;
const regExp = this._regExp;
for (let match; (match = regExp.exec( text )) !== null;) {
console.log(match);
}
}

Use a regular expression:
var text = corpus.split(/[(?!;.)<]/g);

Related

RegEx loop behaviour in TypeScript - Angular

I have strange problem with regex, im trying to check the user input in contentEditable div with regex, after each keydown, and if it match for example "hello" or "status", it should return modified text with <span style="color: purple">hello<span>. And it works properly with unique words, phrases or on paste, but when i declare both "hello" and "hello world" as key words, and type it in contentEditable, regex match only "hello", even if "hello world" is first in my array of strings.
Here is the code of my function:
searchByRegEx(wordsArr: string[], sentence: string): string {
let matchingWords = []; // matching words array
wordsArr.forEach((label) => {
const regEx = new RegExp(label, 'gi');
regEx.lastIndex = 0
let match = regEx.exec(sentence);
while (match) {
// console.log(match) - results of this console.log below
matchingWords.push(match[0]);
match = regEx.exec(sentence);
}
});
matchingWords = matchingWords.sort(function (a, b) {
return b.length - a.length;
});
matchingWords.forEach((word) => {
sentence = sentence.replaceAll(
word,
`<span style='color:${InputColorsHighlightValue.PURPLE}'>${word}</span>`
);
});
return sentence
}
}
And here is how i use it:
if (this.labels) {
textToShow = this.searchByRegEx(['hello', 'hello world'], textToShow)
}
This is how it looks in devTools, regex match properly, but ONLY on paste :
And here when i try to type it manually, it checks on every keydown, but cant match both hello and hello world. And as you can see, input in regex is the same as above:
I am struggling with this functionality and would appreciate any helpful advice.
Live version in stack blitz

There are a few things wrong with the actual regex in your example. If /hello?/gi is your input for a regex, consider the following things:
Your regex function already has those gi flags (new RegExp(label, 'gi');)
There are two ways to use a regex in javascript, new RegExp(regex) or /regex/. When you put a regex between slashes /hello/, you use the second one. Don't combine these two, it won't work!
You have a question mark on the single matchable part of your regex. This basically means that your regex does not have to match anything (and I don't know if your browser even knows how to process that correctly, I think the regex might be invalid). Since it should match either 'hello' or nothing at all, you should remove the ? parameter.
If you just want to match a plain string, not a pattern, don't use regexes. You can, but it's unnecessary obfuscation and probably computationally more demanding (i.e. lower performance).

regex to remove certain characters at the beginning and end of a string

Let's say I have a string like this:
...hello world.bye
But I want to remove the first three dots and replace .bye with !
So the output should be
hello world!
it should only match if both conditions apply (... at the beginning and .bye at the end)
And I'm trying to use js replace method. Could you please help? Thanks

First match the dots, capture and lazy-repeat any character until you get to .bye, and match the .bye. Then, you can replace with the first captured group, plus an exclamation mark:
const str = '...hello world.bye';
console.log(str.replace(/\.\.\.(.*)\.bye/, '$1!'));
The lazy-repeat is there to ensure you don't match too much, for example:
const str = `...hello world.bye
...Hello again! Goodbye.`;
console.log(str.replace(/\.\.\.(.*)\.bye/g, '$1!'));

You don't actually need a regex to do this. Although it's a bit inelegant, the following should work fine (obviously the function can be called whatever makes sense in the context of your application):
function manipulate(string) {
if (string.slice(0, 3) == "..." && string.slice(-4) == ".bye") {
return string.slice(4, -4) + "!";
}
return string;
}
(Apologies if I made any stupid errors with indexing there, but the basic idea should be obvious.)
This, to me at least, has the advantage of being easier to reason about than a regex. Of course if you need to deal with more complicated cases you may reach the point where a regex is best - but I personally wouldn't bother for a simple use-case like the one mentioned in the OP.

Your regex would be
const rx = /\.\.\.([\s\S]*?)\.bye/g
const out = '\n\nfoobar...hello world.bye\nfoobar...ok.bye\n...line\nbreak.bye\n'.replace(rx, `$1!`)
console.log(out)
In English, find three dots, anything eager in group, and ending with .bye.
The replacement uses the first match $1 and concats ! using a string template.

An arguably simpler solution:
const str = '...hello world.bye'
const newStr = /...(.+)\.bye/.exec(str)
const formatted = newStr ? newStr[1] + '!' : str
console.log(formatted)
If the string doesn't match the regex it will just return the string.

Uppercase for each new word swedish characters and html markup

I was pointed out to this post, which does not seem to follow the criteria I have:
Replace a Regex capture group with uppercase in Javascript
I am trying to make a regex that will:
format a string by adding uppercase for the first letter of each word and lower case for the rest of the characters
ignore HTML markup
Accept swedish characters (åäöÅÄÖ)
Say I've got this string:
<b>app</b>le store östersund
Then I want it to be (changes marked by uppercase characters)
<b>App</b>le Store Östersund
I've been playing around with it and the closest I've got is the following:
(?!([^<])*?>)[åäöÅÄÖ]|\s\b\w
Resulted in
<b>app</b>le Store Östersund
Or this
/(?!([^<])*?>)[åäöÅÄÖ]|\S\b\w/g
Resulted in
<B>App</B>Le store Östersund
Here's a fiddle:
http://refiddle.com/refiddles/598aabef75622d4a531b0000
Any help or advice is much appreciated.

It is not possible to do this with regexp alone, since regexp doesn't understand HTML structure. [*] Instead, we need to process each text node, and carry through our logic for what is the beginning of the word in case a word continues across different text nodes. A character is at start of the word if it is preceded by a whitespace, or if it is at the start of the string and it is either the first text node, or the previous text node ended in whitespace.
function htmlToTitlecase(html, letters) {
let div = document.createElement('div');
let re = new RegExp("(^|\\s)([" + letters + "])", "gi");
div.innerHTML = html;
let treeWalker = document.createTreeWalker(div, NodeFilter.SHOW_TEXT);
let startOfWord = true;
while (treeWalker.nextNode()) {
let node = treeWalker.currentNode;
node.data = node.data.replace(re, function(match, space, letter) {
if (space || startOfWord) {
return space + letter.toUpperCase();
} else {
return match;
}
});
startOfWord = node.data.match(/\s$/);
}
return div.innerHTML;
}
console.log(htmlToTitlecase("<b>app</b>le store östersund", "a-zåäö"));
// <b>App</b>le Store Östersund
[*] Maybe possible, but even if so, it would be horribly ugly, since it would need to cover an awful amount of corner cases. Also might need a stronger RegExp engine than JavaScript's, like Ruby's or Perl's.
EDIT:
Even if just specifying really simple html tags? The only ones I am actually in need of covering is <b> and </b> at the moment.
This was not specified in the question. The solution is general enough to work for any markup (including simple tags). But...
function simpleHtmlToTitlecaseSwedish(html) {
return html.replace(/(^|\s)(<\/?b>|)([a-zåäö])/gi, function(match, space, tag, letter) {
return space + tag + letter.toUpperCase();
});
}
console.log(simpleHtmlToTitlecaseSwedish("<b>app</b>le store östersund", "a-zåäö"));

I have a solution which use almost only regex. It may be not the most intuitive way to do it, but it should be effective and I find it funny :)
You have to append at the end of your string every lowercase character followed by their uppercase counterpart, like this (it must also be preceded by a space for my regex) :
aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZåÅäÄöÖ
(I don't know which letters are missing, I know nothing about swedish alphabet, sorry... I'm counting on you to correct that !)
Then you can use the following regex :
(?![^<]*>)(\s<[^/]*?>|\s|^)([\wåäö])(?=.*\2(.)\S*$)|[\wåÅäÄöÖ]+$
Replace by :
$1$3
Test it here
Here is a working javascript code :
// Initialization
var regex = /(?![^<]*>)(\s<[^/]*?>|\s|^)([\wåäö])(?=.*\2(.)\S*$)|[\wåÅäÄöÖ]+$/g;
var string = "test <b when=\"2>1\">ap<i>p</i></b>le store östersund";
// Processing
result = string + " aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZåÅäÄöÖ";
result = result.replace(regex, "$1$3");
// Display result
console.log(result);
Edit : I forgot to handle first word of the string, it's corrected :)

How to split a string by a character not directly preceded by a character of the same type?

Let's say I have a string: "We.need..to...split.asap". What I would like to do is to split the string by the delimiter ., but I only wish to split by the first . and include any recurring .s in the succeeding token.
Expected output:
["We", "need", ".to", "..split", "asap"]
In other languages, I know that this is possible with a look-behind /(?<!\.)\./ but Javascript unfortunately does not support such a feature.
I am curious to see your answers to this question. Perhaps there is a clever use of look-aheads that presently evades me?
I was considering reversing the string, then re-reversing the tokens, but that seems like too much work for what I am after... plus controversy: How do you reverse a string in place in JavaScript?
Thanks for the help!

Here's a variation of the answer by guest271314 that handles more than two consecutive delimiters:
var text = "We.need.to...split.asap";
var re = /(\.*[^.]+)\./;
var items = text.split(re).filter(function(val) { return val.length > 0; });
It uses the detail that if the split expression includes a capture group, the captured items are included in the returned array. These capture groups are actually the only thing we are interested in; the tokens are all empty strings, which we filter out.
EDIT: Unfortunately there's perhaps one slight bug with this. If the text to be split starts with a delimiter, that will be included in the first token. If that's an issue, it can be remedied with:
var re = /(?:^|(\.*[^.]+))\./;
var items = text.split(re).filter(function(val) { return !!val; });
(I think this regex is ugly and would welcome an improvement.)

You can do this without any lookaheads:
var subject = "We.need.to....split.asap";
var regex = /\.?(\.*[^.]+)/g;
var matches, output = [];
while(matches = regex.exec(subject)) {
output.push(matches[1]);
}
document.write(JSON.stringify(output));
It seemed like it'd work in one line, as it did on https://regex101.com/r/cO1dP3/1, but had to be expanded in the code above because the /g option by default prevents capturing groups from returning with .match (i.e. the correct data was in the capturing groups, but we couldn't immediately access them without doing the above).
See: JavaScript Regex Global Match Groups
An alternative solution with the original one liner (plus one line) is:
document.write(JSON.stringify(
"We.need.to....split.asap".match(/\.?(\.*[^.]+)/g)
.map(function(s) { return s.replace(/^\./, ''); })
));
Take your pick!

Note: This answer can't handle more than 2 consecutive delimiters, since it was written according to the example in the revision 1 of the question, which was not very clear about such cases.
var text = "We.need.to..split.asap";
// split "." if followed by "."
var res = text.split(/\.(?=\.)/).map(function(val, key) {
// if `val[0]` does not begin with "." split "."
// else split "." if not followed by "."
return val[0] !== "." ? val.split(/\./) : val.split(/\.(?!.*\.)/)
});
// concat arrays `res[0]` , `res[1]`
res = res[0].concat(res[1]);
document.write(JSON.stringify(res));

change regex to match some words instead of all words containing PRP

This regex matches all characters between whitespace if the word contains PRP.
How can I get it to match all words, or characters in-between whitepsace, if they contain PRP, but not if they contain me in any case.
So match all words containing PRP, but not containing ME or me.
Here is the regex to match words containing PRP: \S*PRP\S*

You can use negative lookahead for this:
(?:^|\s)((?!\S*?(?:ME|me))\S*?PRP\S*)
Working Demo
PS: Use group #1 for your matched word.
Code:
var re = /(?:^|\s)((?!\S*?(?:ME|me))\S*?PRP\S*)/;
var s = 'word abcPRP def';
var m = s.match(re);
if (m) console.log(m[1]); //=> abcPRP

Instead of using complicated regular expressions which would be confusing for almost anyone who's reading it, why don't you break up your code into two sections, separating the words into an array and filtering out the results with stuff you don't want?
function prpnotme(w) {
var r = w.match(/\S+/g);
if(r == null)
return [];
var i=0;
while(i<r.length) {
if(!r[i].contains('PRP') || r[i].toLowerCase().contains('me'))
r.splice(i,1);
else
i++;
}
return r;
}
console.log(prpnotme('whattttttt ok')); // []
console.log(prpnotme('MELOLPRP PRPRP PRPthemeok PRPmhm')); // ['PRPRP', 'PRPmhm']
For a very good reason why this is important, imagine if you ever wanted to add more logic. You're much more likely to make a mistake when modifying complicated regex to make it even more complicated, and this way it's done with simple logic that make perfect sense when reading each predicate, no matter how much you add on.

Develop Reference

JavaScript is the programming language of the Web.

Finding punctuation marks in text with string-methods - javascript

Use a regular expression: var text = corpus.split(/[(?!;.)<]/g);

Related

RegEx loop behaviour in TypeScript - Angular

regex to remove certain characters at the beginning and end of a string

Uppercase for each new word swedish characters and html markup

How to split a string by a character not directly preceded by a character of the same type?

change regex to match some words instead of all words containing PRP

Categories

Resources