Complicated split of string - javascript

I have a group of similarly structured strings that I'm trying to break up into separate pieces via JavaScript.
Sample string:
Jr. Kevin Hooks, Irene Cara, Moses Gunn, Robert Hooks, Ernestine Jackson, José Feliciano. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur ullamcorper sodales nulla id hendrerit.
Ideal output:
[
"Jr. Kevin Hooks","Irene Cara",…
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur
ullamcorper sodales nulla id hendrerit."
]
My first thought was to do a split at '. ' to separate the names from the block of text towards the end, then split the group of names at ', ', but because some names are like 'Jr. Kevin Hooks' and the block of text also contains '. ' that approach fails. Using ', ' as the key also fails because the block of text contains ', '.
Any suggestions on how to accomplish this?
Many thanks!

If we can assume that:
There is no text coming before the first name occurrence
A point in a name only occurs at the end of a word of at most 3 letters
If the last occurring name ends with such an abbreviation, then it still needs to be followed by a point to end the list (e.g. "Abram Lincoln, John Johnsen Jr.. Lorem ipsum dolor"), as otherwise there is no way to know whether the next word belongs to the name or not.
Then you could use this regular expression:
/([a-z]{1,3}\.|[^\s,.]+)(\s+([a-z]{1,3}\.|[^\s,.]+))*(?=[,.])|\..*$/ig
var text = 'Jr. Kevin Hooks, Irene Cara, Moses Gunn, Robert Hooks, Ernestine Jackson, José Feliciano. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur ullamcorper sodales nulla id hendrerit.'
var result = text.match(/([a-z]{1,3}\.|[^\s,.]+)(\s+([a-z]{1,3}\.|[^\s,.]+))*(?=[,.])|\..*$/ig);
// Optionally remove the point at the start of the last match:
if (result) result.push(result.pop().replace(/^\.\s*/, ''));
console.log(result);
.as-console-wrapper { max-height: 100% !important; top: 0; }
Explanation:
[a-z]{1,3}\. matches one to three Latin characters, followed by a point
[^\s,.]+) matches one to many characters that are not white-space, comma or point
( | ): either must match: the above two patterns are combined in this way, meaning that a word in a name must be either up to three Latin letters followed by a point, or any number of non white-space, not including comma nor point.
(\s+([a-z]{1,3}\.|[^\s,.]+))*: optionally (*) allow for more words like that: match one or more white spaces, and repeat the pattern as at the start.
(?=[,.]) that series of words must end with a comma or a point, which is not grabbed (look ahead only): by not grabbing the point, we know for sure that the pattern at the start cannot match anymore, and that is when the next pattern will do the job:
\..*$ matches a literal point and then any characters up to the end of the string ($)
The point preceding the final text block is also included in the last match, so you may want to remove it separately (see snippet).

Related

Creating a regex that targets only the first line with less than 8 words

I'm having trouble creating a regular expression that combines two statements.
What I want is to create a regex which targets only the first line of something - ^(.*)$ - and only if it has 8 words or fewer - /^(?:\s*\S+(?:\s+\S+){0,24})?\s*$/.
I have the individual expressions but I can't seem to join them. Can anyone point me in the direction of where I'm going wrong?
You could try this pattern: /^\s*(\b\w+\b\W*){0,8}\n/gi (find 8 words or fewer, follow by a linefeed)
let text = `one two three four five six seven eight
nine ten eleven twelve`;
let pattern = /^\s*(\b\w+\b\W*){0,8}\n/gi;
let matching = text.match(pattern);
Is there any specific reason for trying to solve this with regular expressions? I feel that this could be achieved easier without regex at all in two steps:
const text = `Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut
labore et dolore magna aliqua.`
const firstLine = text.split("\n")[0]
if (firstLine.split(" ").length <= 8) {
console.log("First line has 8 or less words")
} else {
console.log("First line has more than 8 words")
}
Main issue with doing this the way you described is actually "counting" words, I can't see how regex helps in here with this? Is this a hard requirement?

can I use Template literals in javascript for long text which must be single row, and for *readability only* do it multi-line? [duplicate]

In es6 template literals, how can one wrap a long template literal to multiline without creating a new line in the string?
For example, if you do this:
const text = `a very long string that just continues
and continues and continues`
Then it will create a new line symbol to the string, as interpreting it to have a new line. How can one wrap the long template literal to multiple lines without creating the newline?
If you introduce a line continuation (\) at the point of the newline in the literal, it won't create a newline on output:
const text = `a very long string that just continues\
and continues and continues`;
console.log(text); // a very long string that just continuesand continues and continues
This is an old one. But it came up. If you leave any spaces in the editor it will put them in there.
if
const text = `a very long string that just continues\
and continues and continues`;
just do the normal + symbol
if
const text = `a very long string that just continues` +
`and continues and continues`;
You could just eat the line breaks inside your template literal.
// Thanks to https://twitter.com/awbjs for introducing me to the idea
// here: https://esdiscuss.org/topic/multiline-template-strings-that-don-t-break-indentation
const printLongLine = continues => {
const text = `a very long string that just ${continues}${''
} and ${continues} and ${continues}`;
return text;
}
console.log(printLongLine('continues'));
Another option is to use Array.join, like so:
[
'This is a very long string. ',
'It just keeps going ',
'and going ',
'and going ',
'and going ',
'and going ',
'and going ',
'and going',
].join('')
EDIT: I've made an tiny NPM module with this utility. It works on web and in Node and I highly recommend it over the code in my below answer as it's far more robust. It also allows for preserving newlines in the result if you manually input them as \n, and provides functions for when you already use template literal tags for something else: https://github.com/iansan5653/compress-tag
I know I'm late to answer here, but the accepted answer still has the drawback of not allowing indents after the line break, which means you still can't write very nice-looking code just by escaping newlines.
Instead, why not use a tagged template literal function?
function noWhiteSpace(strings, ...placeholders) {
// Build the string as normal, combining all the strings and placeholders:
let withSpace = strings.reduce((result, string, i) => (result + placeholders[i - 1] + string));
let withoutSpace = withSpace.replace(/\s\s+/g, ' ');
return withoutSpace;
}
Then you can just tag any template literal you want to have line breaks in:
let myString = noWhiteSpace`This is a really long string, that needs to wrap over
several lines. With a normal template literal you can't do that, but you can
use a template literal tag to allow line breaks and indents.`;
This does have the drawback of possibly having unexpected behavior if a future developer isn't used to the tagged template syntax or if you don't use a descriptive function name, but it feels like the cleanest solution for now.
Use the old and the new. Template literals are great but if you want to avoid lengthy literals so as to have compact lines of code, concatenate them and ESLint won't cause a fuss.
const text = `a very long string that just continues`
+` and continues and continues`;
console.log(text);
Similar to Doug's answer this is accepted by my TSLint config and remains untouched by my IntelliJ auto-formatter:
const text = `a very long string that just ${
continues
} and ${continues} and ${continues}`
this npm package allows you to do the following...
import { oneLine } from 'common-tags';
const foo = oneLine`foo
bar
baz`;
console.log(foo); // foo bar baz
The solution proposed by #CodingIntrigue is not working for me on node 7. Well, it works if I do not use a line continuation on the first line, it fails otherwise.
This is probably not the best solution, but it works without problems:
(`
border:1px solid blue;
border-radius:10px;
padding: 14px 25px;
text-decoration:none;
display: inline-block;
text-align: center;`).replace(/\n/g,'').trim();
I'm a bit late to the party, but for any future visits on this question, I found this soultion the most optimal for my use case.
I'm running a Node.js server and wanted to return html in string format, this is how I solved it:
My response object:
const httpResponse = {
message: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent ultrices et odio eget blandit. Donec non tellus diam. Duis massa augue, cursus a ornare vel, pharetra ac turpis.',
html: `
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
<p>Praesent ultrices et odio eget blandit.</p>
<ul>
<li>Donec non tellus diam</li>
<li>Duis massa augue</li>
</ul>
`,
}
This would translate into the following when sending a http request:
{
"message": "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent ultrices et odio eget blandit. Donec non tellus diam. Duis massa augue, cursus a ornare vel, pharetra ac turpis.",
"html": "\n\t\t<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>\n\t\t<p>Praesent ultrices et odio eget blandit.</p>\n\t\t<ul>\n\t\t\t<li>Donec non tellus diam</li>\n\t\t\t<li>Duis massa augue</li>\n\t\t</ul>\n\t"
}
This is of course ugly and hard to work with. So before I sending the http I trim every line of the string.
httpResponse.html = httpResponse.html.split('\n').map(line => line.trim()).join('')
This is what the result looks like after that simple line of code.
{
"message": "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent ultrices et odio eget blandit. Donec non tellus diam. Duis massa augue, cursus a ornare vel, pharetra ac turpis.",
"html": "<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p><p>Praesent ultrices et odio eget blandit.</p><ul><li>Donec non tellus diam</li><li>Duis massa augue</li></ul>"
}
If your problem is the opposite and you need to keep the line breaks but for some reason, they are not being respected, just add the css property in the text container:
#yourTextContainer {
white-space: pre-line;
}

Translating multi-line strings in javascript with django

I am working on a django application, which need to support multiple languages. This application involves some amount of javascript code. In this javascript code, there are some multi-line strings, which need to be translated.
We have tried this structure:
var $text = gettext('Lorem ipsum dolor sit amet, consectetur adipisicing ' +
'elit, sed do eiusmod tempor incididunt ut labore et ' +
'dolore magna aliqua. Ut enim ad minim veniam, quis ');
This does not work. makemessages stops at the first + sign, so in the .po file it shows up as:
msgid "Lorem ipsum dolor sit amet, consectetur adipisicing "
A bit of searching on the net lead to a style guide, which recommends the format we are already using for multi-line strings. But that style is not supported by makemessages.
I tried removing the + characters at the end of the lines. Without the + characters, makemessages can find the full string, but it no longer works in the browser.
Does there exist a style for multi-line strings, which is both supported by makemessages and can be expected to work in all major browsers?
So far I have found that what makemessages is actually doing is to replace all single-quoted strings with double-quoted strings and runs the result through xgettext claiming it to be C code.
The reason it doesn't work automatically is that makemessages doesn't use a real javascript parser. It does a minor transformation and applies a C parser. But in order to concatenate strings in javascript you need a + character, but in C you must not have any tokens between the strings to be concatenated.
I finally found a workaround, that works:
var $text = gettext('Lorem ipsum dolor sit amet, consectetur adipisicing ' //\
+
'elit, sed do eiusmod tempor incididunt ut labore et ' //\
+
'dolore magna aliqua. Ut enim ad minim veniam, quis ');
The javascript parser in the browser will see //\ as a comment, and find + characters between each string as needed. When using makemessages the \ character is parsed as line continuation, and both //\ as well as the + on the following line is considered to be a single comment. So the parser sees string constants separated by just a comment, and implicit string concatenation is performed.
I found this workaround by accident as I came across this piece of code from a fellow developer:
// IE8 only allows string, identifier and number keys between {}s
var parse_bool = {"null": null, "true": true, "false": false}
parse_bool[undefined] = null
parse_bool[null] = null // using null/true/false *this* way works
parse_bool[true] = true // _______
parse_bool[false] = false // ( WAT?! )
// ¯¯¯¯¯¯¯ o ^__^
var render_bool = {} // o (oo)\_______
render_bool[null] = '--' // (__)\ )\/\
render_bool[true] = gettext('yes') // ||----w |
render_bool[false] = gettext('no') // || ||
When makemessages was processing this piece of javascript code, it missed the yes string.

How to manipulate regex to return array of URLs from text?

i am new to Regex usage, and have been searching for some time for suitable regex to retrieve URLs from a paragraph of text.
The current regex I am using:
text.match(/(((ftp|https?):\/\/)(www\.)?|www\.)([\da-z-_\.]+)([a-z\.]{2,7})([\/\w\.-_\?\&]*)*\/?/g);
Returns 'www.mik' as a valid URL from a paragraph of text like '...my webpage is www.mikealbert.com...' and is unsuitable for my purposes.
--
So far, the following regex gives me the best result for matching URLs ('www.mik' is not matched, but 'www.mikealbert.com' is matched)
/(https:[/][/]|http:[/][/]|www.)[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?\/?([a-zA-Z0-9\-\._\?\,\'/\\\+&%\$#\=~])*$/.test("www.google.com");
However, it can only be used to match single URLs. How should I modify the above regex to return an array of matching URLs? I will also need the regex to handle urls with paths, such as www.facebook.com/abc123?apple=pie&blueberry=cake
Thanks for any help!
Remove dollar sing from end of regex
var regex = /(https:[/][/]|http:[/][/]|www.)[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?\/?([a-zA-Z0-9\-\._\?\,\'/\\\+&%\$#\=~])/g;
var input = "https://stackoverflow.com/ lorem ipsum dolor sit amet http://google.com dolor sit amet www.foo.com";
if(regex.test(input)) {
console.log(input.match(regex));
}
output
[ 'https://stackoverflow.com/',
'http://google.com',
'www.foo.com' ]

Javascript regular expression to extract text from string

Let's say I have the following string:
[lorem]{lorem;ipsum;solor;sit;amet}[ipsum]<i>Lorem</i> ipsum <b>dolor</b> sit amet
What I want is an object that contains the following:
{lorem: "{lorem;ipsum;solor;sit;amet}", ipsum: "<i>Lorem</i> ipsum <b>dolor</b> sit amet"}
How would my regular expression look? Is there any way to get the inverse of this?
/\[\w+\]/g
Thanks...
Instead of creating the inverse of /\[\w+\]/g, just use .split():
var string = '[lorem]{lorem;ipsum;solor;sit;amet}[ipsum]<i>Lorem</i> ipsum <b>dolor</b> sit amet'
console.log(string.split(/\[\w+\]/));
Demo

Categories

Resources