How to detect sentences without comments and markdown using Javascript regex? - javascript

Problem
I have a piece of text. It can contain every character from ASCII 32 (space) to ASCII 126 (tilde) and including ASCII 9 (horizontal tab).
The text may contain sentences. Every sentence ends with dot, question mark or exclamation mark, directly followed by space.
The text may contain a basic markdown styling, that is: bold text (**, also __), italic text (*, also _) and strikethrough (~~). Markdown may occur inside sentences (e.g. **this** is a sentence.) or outside them (e.g. **this is a sentence!**). Markdown may not occur across sentences, that is, there may not be a situation like this: **sentence. sente** nce.. Markdown may include more than one sentence, that is, there may be a situation like this: **sentence. sentence.**.
It can also contain two sequences of characters: <!-- and -->. Everything between these sequences is treated as a comment (like in HTML). Comments can occur at every position in the text, but cannot contains newlines characters (I hope that on Linux it is just ASCII 10).
I want to detect in Javascript all sentences, and for each of them put its length after this sentence in a comment, like this: sentence.<!-- 9 -->. Mainly, I do not care if their length includes the length of the markdown tags or not, but it would be nice if it does not.
What have I done so far?
So far, with help of this answer, I have prepared the following regex for detecting sentences. It mostly fits my needs – except that it includes comments.
const basicSentence = /(?:^|\n| )(?:[^.!?]|[.!?][^ *_~\n])+[.!?]/gi;
I have also prepared the following regex for detecting comments. It also works as expected, at least in my own tests.
const comment = /<!--.*?-->/gi;
Example
To better see what I want to achieve, let us have an example. Say, I have the following piece of text:
foo0
b<!-- comment -->ar.
foo1 bar?
<!-- comment -->
foo2bar!
(There is also a newline at the end of it, but I do not know how to add an empty line in Stackoverflow markdown.)
And the expected result is:
foo0
b<!-- comment -->ar.<!-- 10 -->
foo1 bar?<!-- 9 -->
<!-- comment -->
foo2bar!<!-- 12 -->
(This time, there is no also newline at the end.)
UPDATE: Sorry, I have corrected the expected result in the example.

Pass a callback to .replace that replaces all comments with the empty string, and then returns the length of the resulting trimmed match:
const input = `foo0
b<!-- comment -->ar.
foo1 bar?
<!-- comment -->
foo2bar!
`;
const output = input.replace(
/(?:^|\n| )(?:[^.!?]|[.!?][^ *_~\n])+[.!?]/g,
(match) => {
const matchWithoutComments = match.replace(/<!--.*?-->/g, '');
return `${match}<!-- ${matchWithoutComments.length} -->`;
}
);
console.log(output);
Of course, you can use a similar pattern to replace markdown notation with the inner text content as well, if you wish:
.replace(/([*_]{1,2}|~~)((.|\n)*?)\1/g, '$2')
(due to nested and possibly unbalanced tags, which regex is not very good at working with, you may have to repeat that line until no further replacements can be found)
Also, per comment, your current regular expression is expecting every sentence to end in ., !, or ?. The comment's ! in <!-- is treated as the end of a (short) sentence. One option would be to lookahead for whitespace (a space, or a newline) or the end of the input at the very end of the regex:
const input = `foo0
b<!-- comment -->ar.
foo1 bar?
<!-- comment -->
foo2bar!
<!-- comment -->`;
const output = input.replace(
/(?:^|\n| )(?:[^.!?]|[.!?][^ *_~\n])+[.!?](?=\s|$|[*_~])/g,
(match) => {
const matchWithoutComments = match.replace(/<!--.*?-->/g, '');
return `${match}<!-- ${matchWithoutComments.length} -->`;
}
);
console.log(output);
https://regex101.com/r/RaTIOi/1

Related

Regex in Google Apps Script practical issue. Forms doesn't read regex as it should

I hope its just something i'm not doing right.
I've been using a simple script to create a form out of a spreadsheet. The script seems to be working fine. The output form is going to get some inputs from third parties so i can analyze them in my consulting activity.
Creating the form was not a big deal, the structure is good to go. However, after having the form creator script working, i've started working on its validations, and that's where i'm stuck at.
For text validations, i will need to use specific Regexes. Many of the inputs my clients need to give me are going to be places' and/or people's names, therefore, i should only allow them usign A-Z, single spaces, apostrophes and dashes.
My resulting regexes are:
//Regex allowing a **single name** with the first letter capitalized and the occasional use of "apostrophes" or "dashes".
const reg1stName = /^[A-Z]([a-z\'\-])+/
//Should allow (a single name/surname) like Paul, D'urso, Mac'arthur, Saint-Germaine ecc.
//Regex allowing **composite names and places names** with the first letter capitalized and the occasional use of "apostrophes" or "dashes". It must avoid double spaces, however.
const regNamesPlaces = /^[^\s]([A-Z]|[a-z]|\b[\'\- ])+[^\s]$/
//This should allow (names/surnames/places' names) like Giulius Ceasar, Joanne D'arc, Cosimo de'Medici, Cosimo de Medici, Jean-jacques Rousseau, Firenze, Friuli Venezia-giulia, L'aquila ecc.
Further in the script, these Regexes are called as validation pattern for the forms text items, in accordance with each each case.
//Validation for single names
var val1stName = FormApp.createTextValidation()
.setHelpText("Only the person First Name Here! Use only (A-Z), a single apostrophe (') or a single dash (-).")
.requireTextMatchesPattern(reg1stName)
.build();
//Validation for composite names and places names
var valNamesPlaces = FormApp.createTextValidation()
.setHelpText(("Careful with double spaces, ok? Use only (A-Z), a single apostrophe (') or a single dash (-)."))
.requireTextMatchesPattern(regNamesPlaces)
.build();
Further yet, i have a "for" loop that creates the form based on the spreadsheets fields. Up to this point, things are working just fine.
for(var i=0;i<numberRows;i++){
var questionType = data[i][0];
if (questionType==''){
continue;
}
else if(questionType=='TEXTNamesPlaces'){
form.addTextItem()
.setTitle(data[i][1])
.setHelpText(data[i][2])
.setValidation(valNamesPlaces)
.setRequired(false);
}
else if(questionType=='TEXT1stName'){
form.addTextItem()
.setTitle(data[i][1])
.setHelpText(data[i][2])
.setValidation(val1stName)
.setRequired(false);
}
The problem is when i run the script and test the resulting form.
Both validations types get imported just fine (as can be seen in the form's edit mode), but when testing it in preview mode i get an error, as if the Regex wasn't matching (sry the error message is in portuguese, i forgot to translate them as i did with the code up there):
A screenshot of the form in edit mode
A screeshot of the form in preview mode
However, if i manually remove the bars out of this regex "//" it starts working!
A screenshot of the form in edit mode, Regex without bars
A screenshot of the form in preview mode, Regex without bars
What am i doing wrong? I'm no professional dev but in my understanding, it makes no sense to write a Regex without bars.
If this is some Gforms pattern of reading regexes, i still need all of this to be read by the Apps script that creates this form after all. If i even try to pass the regex without the bars there, the script will not be able to read it.
const reg1stName = ^[A-Z]([a-z\'])+
const regNamesPlaces = ^[^\s]([A-Z]|[a-z]|\b[\'\- ])+[^\s]$
//Can't even be saved. Returns: SyntaxError: Unexpected token '^' (line 29, file "Code.gs")
Passing manually all the validations is not an option. Can anybody help me?
Thanks so much
This
/^[A-Z]([a-z\'\-])+/
will not work because the parser is trying to match your / as a string literal.
This
^[A-Z]([a-z\'\-])+
also will not work, because if the name is hyphenated, you will only match up to the hyphen. This will match the 'Some-' in 'Some-Name', for example. Also, perhaps you want a name like 'Saint John' to pass also?
I recommend the following :)
^[A-Z][a-z]*[-\.' ]?[A-Z]?[a-z]*
^ anchors to the start of the string
[A-Z] matches exactly 1 capital letter
[a-z]* matches zero or more lowercase letters (this enables you to match a name like D'Urso)
[-\.' ]? matches zero or 1 instances of - (hyphen), . (period), ' (apostrophe) or a single space (the . (period) needs to be escaped with a backslash because . is special to regex)
[A-Z]? matches zero or 1 capital letter (in case there's a second capital in the name, like D'Urso, St John, Saint-Germaine)

How to add '>' to every new line in a string in javascript?

I have a text area on a UI and I need the user to type in Markdown. I need to make sure that each line they type will start with > as I want to view everything the typed as a blockquote when they preview it.
So for example if they type in:
> some text user <b>typed</b>
another line
When the markdown is rendered, only the fist line is a blockquote. The rest is plain text outside the blockquote.
Is there a way I can check each line and add the > if it is missing.
Things I have tried:
I tried removing all > characters and replacing each \n with a \n>. This however messed up the markdown as the user can also type in <b>bold text</b>.
I have a loop that checks for the > character after every new line. I just don't know how to insert the > if its missing.
Loop code:
var match = /\r|\n/.exec(theString);
if (match) {
if (theString.charAt(match.index)!='>'){
// don't know how to ad the character
}
}
I also though that maybe I can enforce the > in the textarea, but that research got me nowhere. As in, I don't think that is possible.
I also thought, what if the user types multiple >>>>. At that stage I was thinking about it too much and said I'd leave out cases like that as maybe that is the user's intention.
If anyone has any suggestions and/or alternative solutions it would be very much appreciated. Thank you :)
You can use a regular expression to insert > to the beginning of each line, if it doesn't exist:
const input = `> some text user <b>typed</b>
another line
another line 2
> another line 3`;
const output = input.replace(/^(?!>)/gm, '> ');
console.log(output);
The pattern ^(?!>) means: match the beginning of a line, which is not followed by >.
If you only want to insert >s where lines have text already, then also lookahead for non-whitespace in the line:
const input = `> some text user <b>typed</b>
another line
another line 2
> another line 3`;
const output = input.replace(/^(?!>)(?=[^\n]*\S)/gm, '> ');
console.log(output);
I'd go with replace (first thing you tried). In order to insert literal > in HTML, you have to escape it.
Just replace \n with \n> and you're all set.

Extract specific text in between 2 strings

Assume we have text such as the following.
Title: (some text)
My Title [abc]
Content: (some test)
My long content paragraph. With multiple sentences. [abc]
Short Content: (some text)
Short content [abc]
Using Javascript and RegEx, is it possible to extract the text so that it would be as follows.
Title: My Title
Content: My long content paragraph. With multiple sentences.
Short Content: Short content
Basically ignoring new lines and text in the () and [] brackets?
I've tried to use Regex but I can't get it to do exactly as I'd like. I'm also getting the issue that when I match Content: i'm getting a match for both Content: & Short Content: however i'd want to only match the occurrence where it is an exact match.
EDIT:
I'm new to RegEx. So far to extract the titles such as Title:, Content: and so on I have
/[A-Za-z]+:|[A-Za-z]+ [A-Za-z]+:|[A-Za-z]+ [A-Za-z]+ [A-Za-z]+:|[A-Za-z]+ [A-Za-z]+ [0-9]+:/g
And then I loop through and use this
[TITLENAME]:.*\n.*
I'm struggling to get past this. My next step would be to loop through the text that is matched above and then remove the bracket stuff. I'm sure there is a better way to do this!
You could use String.replace( /(\(|\)|\[|\])/g , '')
If you take a string and use the replace method with these two arguments it will return a string with the ()[] characters removed. I have escaped them all with \ since they are special characters in regex. It might be a little over zealous.
Also g makes the regular expression global so it will remove all instances
If the text within parenthesis (e.g. 'abc') is fixed and have a special meaning you can also go with: '/(\(some text\)\n|\(some test\)\n|(\[abc\]))|(^$\n)/gm'.
This way you would allow parenthesis in the real text that you want to preserve, e.g. some text (this I want to preserve) and other text.
Please note the multiline m flag.
https://regex101.com/r/cS3pRR/1

Remove multiple line breaks (\n) in JavaScript

We have an onboarding form for new employees with multiple newlines (4-5 between lines) that need stripped. I want to get rid of the extra newlines but still space out the blocks with one \n.
example:
New employee<br/>
John Doe
Employee Number<br/>
1234
I'm currently using text = text.replace(/(\r\n|\r|\n)+/g, '$1'); but that gets rid of all newlines without spacing.
text = text.replace(/(\r\n|\r|\n){2,}/g, '$1\n');
use this, it will remove newlines where there are at least 2 or more
update
on specific requirement of the OP I will edit the answer a bit.
text = text.replace(/(\r\n|\r|\n){2}/g, '$1').replace(/(\r\n|\r|\n){3,}/g, '$1\n');
We can tidy up the regex as follows:
text = text.replace(/[\r\n]{2,}/g, "\n");

Remove a long dash from a string in JavaScript?

I've come across an error in my web app that I'm not sure how to fix.
Text boxes are sending me the long dash as part of their content (you know, the special long dash that MS Word automatically inserts sometimes). However, I can't find a way to replace it; since if I try to copy that character and put it into a JavaScript str.replace statement, it doesn't render right and it breaks the script.
How can I fix this?
The specific character that's killing it is —.
Also, if it helps, I'm passing the value as a GET parameter, and then encoding it in XML and sending it to a server.
This code might help:
text = text.replace(/\u2013|\u2014/g, "-");
It replaces all – (–) and — (—) symbols with simple dashes (-).
DEMO: http://jsfiddle.net/F953H/
That character is call an Em Dash. You can replace it like so:
str.replace('\u2014', '');​​​​​​​​​​
Here is an example Fiddle: http://jsfiddle.net/x67Ph/
The \u2014 is called a unicode escape sequence. These allow to to specify a unicode character by its code. 2014 happens to be the Em Dash.
There are three unicode long-ish dashes you need to worry about: http://en.wikipedia.org/wiki/Dash
You can replace unicode characters directly by using the unicode escape:
'—my string'.replace( /[\u2012\u2013\u2014\u2015]/g, '' )
There may be more characters behaving like this, and you may want to reuse them in html later. A more generic way to to deal with it could be to replace all 'extended characters' with their html encoded equivalent. You could do that Like this:
[yourstring].replace(/[\u0080-\uC350]/g,
function(a) {
return '&#'+a.charCodeAt(0)+';';
}
);
With the ECMAScript 2018 standard, JavaScript RegExp now supports Unicode property (or, category) classes. One of them, \p{Dash}, matches any Unicode character points that are dashes:
/\p{Dash}/gu
In ES5, the equivalent expression is:
/[-\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u2E5D\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]|\uD803\uDEAD/g
See the Unicode Utilities reference.
Here are some JavaScript examples:
const text = "Dashes: \uFF0D\uFE63\u058A\u1400\u1806\u2010-\u2013\uFE32\u2014\uFE58\uFE31\u2015\u2E3A\u2E3B\u2053\u2E17\u2E40\u2E5D\u301C\u30A0\u2E1A\u05BE\u2212\u207B\u208B\u3030𐺭";
const es5_dash_regex = /[-\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u2E5D\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]|\uD803\uDEAD/g;
console.log(text.replace(es5_dash_regex, '-')); // Normalize each dash to ASCII hyphen
// => Dashes: ----------------------------
To match one or more dashes and replace with a single char (or remove in one go):
/\p{Dash}+/gu
/(?:[-\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u2E5D\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]|\uD803\uDEAD)+/g

Categories

Resources