I have a string that (potentially) contains HTML tags.
I want to split it into smaller valid HTML strings based on (text) character length. The use case is essentially pagination. I know the length of text that can fit on a single page. So I want to divide the target string into "chunks" or pages based on that character length. But I need each of the resulting pages to contain valid HTML without unclosed tags, etc.
So for example:
const pageCharacterSize = 10
const testString = 'some <strong>text with HTML</strong> tags
function paginate(string, pageSize) { //#TODO }
const pages = paginate(testString, pageCharacterSize)
console.log(pages)
// ['some <strong>text </strong>', '<strong>with HTML</strong> ', 'tags']
I think this is possible to do with a DocumentFragment or Range but I can't figure out how slice the pages based on character offsets.
This MDN page has a demo that does something close to what I need. But it uses caretPositionFromPoint() which takes X, Y coordinates as arguments.
Update
For the purposes of clarity, here are the tests I'm working with:
import { expect, test } from 'vitest'
import paginate from './paginate'
// 1
test('it should chunk plain text', () => {
// a
const testString = 'aa bb cc dd ee';
const expected = ['aa', 'bb', 'cc', 'dd', 'ee']
expect(paginate(testString, 2)).toStrictEqual(expected)
// b
const testString2 = 'a a b b c c';
const expected2 = ['a a', 'b b', 'c c']
expect(paginate(testString2, 3)).toStrictEqual(expected2)
// c
const testString3 = 'aa aa bb bb cc cc';
const expected3 = ['aa aa', 'bb bb', 'cc cc']
expect(paginate(testString3, 5)).toStrictEqual(expected3)
// d
const testString4 = 'aa bb cc';
const expected4 = ['aa', 'bb', 'cc']
expect(paginate(testString4, 4)).toStrictEqual(expected4)
// e
const testString5 = 'a b c d e f g';
const expected5 = ['a b c', 'd e f', 'g']
expect(paginate(testString5, 5)).toStrictEqual(expected5)
// f
const testString6 = 'aa bb cc';
const expected6 = ['aa bb', 'cc']
expect(paginate(testString6, 7)).toStrictEqual(expected6)
})
// 2
test('it should chunk an HTML string without stranding tags', () => {
const testString = 'aa <strong>bb</strong> <em>cc dd</em>';
const expected = ['aa', '<strong>bb</strong>', '<em>cc</em>', '<em>dd</em>']
expect(paginate(testString, 3)).toStrictEqual(expected)
})
// 3
test('it should handle tags that straddle pages', () => {
const testString = '<strong>aa bb cc</strong>';
const expected = ['<strong>aa</strong>', '<strong>bb</strong>', '<strong>cc</strong>']
expect(paginate(testString, 2)).toStrictEqual(expected)
})
Here is a solution that assumes and supports the following:
tags without attributes (you could tweak the regex to support that)
well formed tags assumed, e.g. not: <b><i>wrong nesting</b></i>, missing <b>end tag, missing start</b> tag
tags may be nested
tags are removed & later restored for proper characters per page count
page split is done by looking backwards for first space
function paginate(html, pageSize) {
let splitRegex = new RegExp('\\s*[\\s\\S]{1,' + pageSize + '}(?!\\S)', 'g');
let tagsInfo = []; // saved tags
let tagOffset = 0; // running offset of tag in plain text
let pageOffset = 0; // page offset in plain text
let openTags = []; // open tags carried over to next page
let pages = html.replace(/<\/?[a-z][a-z0-9]*>/gi, (tag, pos) => {
let obj = { tag: tag, pos: pos - tagOffset };
tagsInfo.push(obj);
tagOffset += tag.length;
return '';
}).match(splitRegex).map(page => {
let nextOffset = pageOffset + page.length;
let prefix = openTags.join('');
tagsInfo.slice().reverse().forEach(obj => {
if(obj.pos >= pageOffset && obj.pos < nextOffset) {
// restore tags in reverse order to maintain proper position
page = page.substring(0, obj.pos - pageOffset) + obj.tag + page.substring(obj.pos - pageOffset);
}
});
tagsInfo.forEach(obj => {
let tag = obj.tag;
if(obj.pos >= pageOffset && obj.pos < nextOffset) {
if(tag.match(/<\//)) {
// remove tag from openTags list
tag = tag.replace(/<\//, '<');
let index = openTags.indexOf(tag);
if(index >= 0) {
openTags.splice(index, 1);
}
} else {
// add tag to openTags list
openTags.push(tag);
}
}
});
pageOffset = nextOffset;
let postfix = openTags.slice().reverse().map(tag => tag.replace(/</, '</')).join('');
page = prefix + page.trim() + postfix;
return page.replace(/<(\w+)><\/\1>/g, ''); // remove tags with empty content
});
return pages;
}
[
{ str: 'some <strong>text <i>with</i> HTML</strong> tags, and <i>some <b>nested tags</b> sould be <b>supported</b> as well</i>.', size: 16 },
{ str: 'a a b b c c', size: 3 },
{ str: 'aa aa bb bb cc cc', size: 5 },
{ str: 'aa bb cc', size: 4 },
{ str: 'aa <strong>bb</strong> <em>cc dd</em>', size: 3 },
{ str: '<strong>aa bb cc</strong>', size: 2 }
].forEach(o => {
let pages = paginate(o.str, o.size);
console.log(pages);
});
Output:
[
"some <strong>text <i>with</i></strong>",
"<strong> HTML</strong> tags, and",
"<i>some <b>nested tags</b></i>",
"<i> sould be</i>",
"<i><b>supported</b> as</i>",
"<i>well</i>."
]
[
"a a",
"b b",
"c c"
]
[
"aa aa",
"bb bb",
"cc cc"
]
[
"aa",
"bb",
"cc"
]
[
"aa",
"<strong>bb</strong>",
" <em>cc</em>",
"<em>dd</em>"
]
[
"<strong>aa</strong>",
"<strong>bb</strong>",
"<strong>cc</strong>"
]
Update
Based on new request in comment I fixed the split regex from '[\\s\\S]{1,' + pageSize + '}(?!\\S)' to '\\s*[\\s\\S]{1,' + pageSize + '}(?!\\S)', e.g. added \\s* to catch leading spaces. I also added a page.trim() to remove leading spaces. Finally I added a few of the OP examples.
In PHP, I use Kuwamoto's class to pluralize nouns in my strings. I didn't find something as good as this script in javascript except for some plugins. So, it would be great to have a javascript function based on Kuwamoto's class.
http://kuwamoto.org/2007/12/17/improved-pluralizing-in-php-actionscript-and-ror/
Simple version (ES6):
const pluralize = (count, noun, suffix = 's') =>
`${count} ${noun}${count !== 1 ? suffix : ''}`;
Typescript:
const pluralize = (count: number, noun: string, suffix = 's') =>
`${count} ${noun}${count !== 1 ? suffix : ''}`;
Usage:
pluralize(0, 'turtle'); // 0 turtles
pluralize(1, 'turtle'); // 1 turtle
pluralize(2, 'turtle'); // 2 turtles
pluralize(3, 'fox', 'es'); // 3 foxes
This obviously doesn't support all english edge-cases, but it's suitable for most purposes
Use Pluralize
There's a great little library called Pluralize that's packaged in npm and bower.
This is what it looks like to use:
import Pluralize from 'pluralize';
Pluralize( 'Towel', 42 ); // "Towels"
Pluralize( 'Towel', 42, true ); // "42 Towels"
And you can get it here:
https://github.com/blakeembrey/pluralize
So, I answer my own question by sharing my translation in javascript of Kuwamoto's PHP class.
String.prototype.plural = function(revert){
var plural = {
'(quiz)$' : "$1zes",
'^(ox)$' : "$1en",
'([m|l])ouse$' : "$1ice",
'(matr|vert|ind)ix|ex$' : "$1ices",
'(x|ch|ss|sh)$' : "$1es",
'([^aeiouy]|qu)y$' : "$1ies",
'(hive)$' : "$1s",
'(?:([^f])fe|([lr])f)$' : "$1$2ves",
'(shea|lea|loa|thie)f$' : "$1ves",
'sis$' : "ses",
'([ti])um$' : "$1a",
'(tomat|potat|ech|her|vet)o$': "$1oes",
'(bu)s$' : "$1ses",
'(alias)$' : "$1es",
'(octop)us$' : "$1i",
'(ax|test)is$' : "$1es",
'(us)$' : "$1es",
'([^s]+)$' : "$1s"
};
var singular = {
'(quiz)zes$' : "$1",
'(matr)ices$' : "$1ix",
'(vert|ind)ices$' : "$1ex",
'^(ox)en$' : "$1",
'(alias)es$' : "$1",
'(octop|vir)i$' : "$1us",
'(cris|ax|test)es$' : "$1is",
'(shoe)s$' : "$1",
'(o)es$' : "$1",
'(bus)es$' : "$1",
'([m|l])ice$' : "$1ouse",
'(x|ch|ss|sh)es$' : "$1",
'(m)ovies$' : "$1ovie",
'(s)eries$' : "$1eries",
'([^aeiouy]|qu)ies$' : "$1y",
'([lr])ves$' : "$1f",
'(tive)s$' : "$1",
'(hive)s$' : "$1",
'(li|wi|kni)ves$' : "$1fe",
'(shea|loa|lea|thie)ves$': "$1f",
'(^analy)ses$' : "$1sis",
'((a)naly|(b)a|(d)iagno|(p)arenthe|(p)rogno|(s)ynop|(t)he)ses$': "$1$2sis",
'([ti])a$' : "$1um",
'(n)ews$' : "$1ews",
'(h|bl)ouses$' : "$1ouse",
'(corpse)s$' : "$1",
'(us)es$' : "$1",
's$' : ""
};
var irregular = {
'move' : 'moves',
'foot' : 'feet',
'goose' : 'geese',
'sex' : 'sexes',
'child' : 'children',
'man' : 'men',
'tooth' : 'teeth',
'person' : 'people'
};
var uncountable = [
'sheep',
'fish',
'deer',
'moose',
'series',
'species',
'money',
'rice',
'information',
'equipment'
];
// save some time in the case that singular and plural are the same
if(uncountable.indexOf(this.toLowerCase()) >= 0)
return this;
// check for irregular forms
for(word in irregular){
if(revert){
var pattern = new RegExp(irregular[word]+'$', 'i');
var replace = word;
} else{ var pattern = new RegExp(word+'$', 'i');
var replace = irregular[word];
}
if(pattern.test(this))
return this.replace(pattern, replace);
}
if(revert) var array = singular;
else var array = plural;
// check for matches using regular expressions
for(reg in array){
var pattern = new RegExp(reg, 'i');
if(pattern.test(this))
return this.replace(pattern, array[reg]);
}
return this;
}
Easy to use:
alert("page".plural()); // return plural form => pages
alert("mouse".plural()); // return plural form => mice
alert("women".plural(true)); // return singular form => woman
DEMO
Based on #pmrotule answer with some typescript magic and some additions to the uncountable array. I add here the plural and singular functions.
The plural version:
/**
* Returns the plural of an English word.
*
* #export
* #param {string} word
* #param {number} [amount]
* #returns {string}
*/
export function plural(word: string, amount?: number): string {
if (amount !== undefined && amount === 1) {
return word
}
const plural: { [key: string]: string } = {
'(quiz)$' : "$1zes",
'^(ox)$' : "$1en",
'([m|l])ouse$' : "$1ice",
'(matr|vert|ind)ix|ex$' : "$1ices",
'(x|ch|ss|sh)$' : "$1es",
'([^aeiouy]|qu)y$' : "$1ies",
'(hive)$' : "$1s",
'(?:([^f])fe|([lr])f)$' : "$1$2ves",
'(shea|lea|loa|thie)f$' : "$1ves",
'sis$' : "ses",
'([ti])um$' : "$1a",
'(tomat|potat|ech|her|vet)o$': "$1oes",
'(bu)s$' : "$1ses",
'(alias)$' : "$1es",
'(octop)us$' : "$1i",
'(ax|test)is$' : "$1es",
'(us)$' : "$1es",
'([^s]+)$' : "$1s"
}
const irregular: { [key: string]: string } = {
'move' : 'moves',
'foot' : 'feet',
'goose' : 'geese',
'sex' : 'sexes',
'child' : 'children',
'man' : 'men',
'tooth' : 'teeth',
'person' : 'people'
}
const uncountable: string[] = [
'sheep',
'fish',
'deer',
'moose',
'series',
'species',
'money',
'rice',
'information',
'equipment',
'bison',
'cod',
'offspring',
'pike',
'salmon',
'shrimp',
'swine',
'trout',
'aircraft',
'hovercraft',
'spacecraft',
'sugar',
'tuna',
'you',
'wood'
]
// save some time in the case that singular and plural are the same
if (uncountable.indexOf(word.toLowerCase()) >= 0) {
return word
}
// check for irregular forms
for (const w in irregular) {
const pattern = new RegExp(`${w}$`, 'i')
const replace = irregular[w]
if (pattern.test(word)) {
return word.replace(pattern, replace)
}
}
// check for matches using regular expressions
for (const reg in plural) {
const pattern = new RegExp(reg, 'i')
if (pattern.test(word)) {
return word.replace(pattern, plural[reg])
}
}
return word
}
And the singular version:
/**
* Returns the singular of an English word.
*
* #export
* #param {string} word
* #param {number} [amount]
* #returns {string}
*/
export function singular(word: string, amount?: number): string {
if (amount !== undefined && amount !== 1) {
return word
}
const singular: { [key: string]: string } = {
'(quiz)zes$' : "$1",
'(matr)ices$' : "$1ix",
'(vert|ind)ices$' : "$1ex",
'^(ox)en$' : "$1",
'(alias)es$' : "$1",
'(octop|vir)i$' : "$1us",
'(cris|ax|test)es$' : "$1is",
'(shoe)s$' : "$1",
'(o)es$' : "$1",
'(bus)es$' : "$1",
'([m|l])ice$' : "$1ouse",
'(x|ch|ss|sh)es$' : "$1",
'(m)ovies$' : "$1ovie",
'(s)eries$' : "$1eries",
'([^aeiouy]|qu)ies$' : "$1y",
'([lr])ves$' : "$1f",
'(tive)s$' : "$1",
'(hive)s$' : "$1",
'(li|wi|kni)ves$' : "$1fe",
'(shea|loa|lea|thie)ves$': "$1f",
'(^analy)ses$' : "$1sis",
'((a)naly|(b)a|(d)iagno|(p)arenthe|(p)rogno|(s)ynop|(t)he)ses$': "$1$2sis",
'([ti])a$' : "$1um",
'(n)ews$' : "$1ews",
'(h|bl)ouses$' : "$1ouse",
'(corpse)s$' : "$1",
'(us)es$' : "$1",
's$' : ""
}
const irregular: { [key: string]: string } = {
'move' : 'moves',
'foot' : 'feet',
'goose' : 'geese',
'sex' : 'sexes',
'child' : 'children',
'man' : 'men',
'tooth' : 'teeth',
'person' : 'people'
}
const uncountable: string[] = [
'sheep',
'fish',
'deer',
'moose',
'series',
'species',
'money',
'rice',
'information',
'equipment',
'bison',
'cod',
'offspring',
'pike',
'salmon',
'shrimp',
'swine',
'trout',
'aircraft',
'hovercraft',
'spacecraft',
'sugar',
'tuna',
'you',
'wood'
]
// save some time in the case that singular and plural are the same
if (uncountable.indexOf(word.toLowerCase()) >= 0) {
return word
}
// check for irregular forms
for (const w in irregular) {
const pattern = new RegExp(`${irregular[w]}$`, 'i')
const replace = w
if (pattern.test(word)) {
return word.replace(pattern, replace)
}
}
// check for matches using regular expressions
for (const reg in singular) {
const pattern = new RegExp(reg, 'i')
if (pattern.test(word)) {
return word.replace(pattern, singular[reg])
}
}
return word
}
The new intl API spec from ECMA will provide the plural rules function,
https://github.com/tc39/proposal-intl-plural-rules
Here's the polyfill that can be used today https://github.com/eemeli/IntlPluralRules
Taken from my blog: https://sergiotapia.me/pluralizing-strings-in-javascript-es6-b5d4d651d403
You can use the pluralize library for this.
NPM:
npm install pluralize --save
Yarn:
yarn add pluralize
Wherever you want to use the lib, you can require it easily.
var pluralize = require('pluralize')
I like to add it to the window object so I can just invoke pluralize() wherever I need it. Within my application.js root file:
window.pluralize = require('pluralize')
Then you can just use it anywhere, React components, or just plain Javascript:
<span className="pull-left">
{`${item.score} ${pluralize('point', item.score)}`}
</span>
console.log(pluralize('point', item.score))
I use this simple inline statement
const number = 2;
const string = `${number} trutle${number === 1 ? "" : "s"}`; //this one
console.log(string)
function pluralize( /* n, [ n2, n3, ... ] str */ ) {
var n = Array.prototype.slice.call( arguments ) ;
var str = n.pop(), iMax = n.length - 1, i = -1, j ;
str = str.replace( /\$\$|\$(\d+)/g,
function( m, p1 ) { return m == '$$' ? '$' : n[+p1-1] }
) ;
return str.replace( /[(](.*?)([+-])(\d*)(?:,([^,)]*))?(?:,([^)]*))?[)]/g,
function( match, one, sign, abs, not1, zero ) {
// if abs, use indicated element in the array of numbers
// instead of using the next element in sequence
abs ? ( j = +abs - 1 ) : ( i < iMax && i++, j = i ) ;
if ( zero != undefined && n[j] == 0 ) return zero ;
return ( n[j] != 1 ) == ( sign == '+' ) ? ( not1 || 's' ) : one ;
}
) ;
}
console.log( pluralize( 1, 'the cat(+) live(-) outside' ) ) ;
// the cat lives outside
console.log( pluralize( 2, 'the child(+,ren) (is+,are) inside' ) ) ;
// the children are inside
console.log( pluralize( 0, '$1 dog(+), ($1+,$1,no) dog(+), ($1+,$1,no) dog(+,,)' ) ) ;
// 0 dogs, no dogs, no dog
console.log( pluralize( 100, 1, '$1 penn(y+,ies) make(-1) $$$2' ) ) ;
// 100 pennies make $1
console.log( pluralize( 1, 0.01, '$1 penn(y+,ies) make(-1) $$$2' ) ) ;
// 1 penny makes $0.01
I’ve created a very simple library that can be used for words pluralization in JavaScript. It transparently uses CLDR database for multiple locales, so it supports almost any language you would like to use. It’s API is very minimalistic and integration is extremely simple. It’s called Numerous.
I’ve also written a small introduction article to it: «How to pluralize any word in different languages using JavaScript?».
Feel free to use it in your project. I will also be glad for your feedback on it.
To provide a simple and readable option (ES6):
export function pluralizeAndStringify(value, word, suffix = 's'){
if (value == 1){
return value + ' ' + word;
}
else {
return value + ' ' + word + suffix;
}
}
If you gave something like pluralizeAndStringify(5, 'dog') you'd get "5 dogs" as your output.
Using #sarink's answer, I made a function to create a string using key value pairs data and pluralizing the keys. Here's the snippet:
// Function to create a string from given key value pairs and pluralize keys
const stringPluralize = function(data){
var suffix = 's';
var str = '';
$.each(data, function(key, val){
if(str != ''){
str += val>0 ? ` and ${val} ${key}${val !== 1 ? suffix : ''}` : '';
}
else{
str = val>0 ? `${val} ${key}${val !== 1 ? suffix : ''}` : '';
}
});
return str;
}
var leftDays = '1';
var leftHours = '12';
var str = stringPluralize({day:leftDays, hour:leftHours});
console.log(str) // Gives 1 day and 12 hours
Use -ies or -s (depending on the second-to-last letter) if the word ends in a y, use -es if the word ends in a ‑s, -ss, -sh, -ch, -x, or -z, use a lookup table if the world is an irregular plural, and use -s otherwise.
var pluralize = (function () {
const vowels = "aeiou";
const irregulars = { "addendum": "addenda", "aircraft": "aircraft", "alumna": "alumnae", "alumnus": "alumni", "analysis": "analyses", "antenna": "antennae", "antithesis": "antitheses", "apex": "apices", "appendix": "appendices", "axis": "axes", "bacillus": "bacilli", "bacterium": "bacteria", "basis": "bases", "beau": "beaux", "bison": "bison", "bureau": "bureaux", "cactus": "cacti", "château": "châteaux", "child": "children", "codex": "codices", "concerto": "concerti", "corpus": "corpora", "crisis": "crises", "criterion": "criteria", "curriculum": "curricula", "datum": "data", "deer": "deer", "diagnosis": "diagnoses", "die": "dice", "dwarf": "dwarves", "ellipsis": "ellipses", "erratum": "errata", "faux pas": "faux pas", "fez": "fezzes", "fish": "fish", "focus": "foci", "foot": "feet", "formula": "formulae", "fungus": "fungi", "genus": "genera", "goose": "geese", "graffito": "graffiti", "grouse": "grouse", "half": "halves", "hoof": "hooves", "hypothesis": "hypotheses", "index": "indices", "larva": "larvae", "libretto": "libretti", "loaf": "loaves", "locus": "loci", "louse": "lice", "man": "men", "matrix": "matrices", "medium": "media", "memorandum": "memoranda", "minutia": "minutiae", "moose": "moose", "mouse": "mice", "nebula": "nebulae", "nucleus": "nuclei", "oasis": "oases", "offspring": "offspring", "opus": "opera", "ovum": "ova", "ox": "oxen", "parenthesis": "parentheses", "phenomenon": "phenomena", "phylum": "phyla", "quiz": "quizzes", "radius": "radii", "referendum": "referenda", "salmon": "salmon", "scarf": "scarves", "self": "selves", "series": "series", "sheep": "sheep", "shrimp": "shrimp", "species": "species", "stimulus": "stimuli", "stratum": "strata", "swine": "swine", "syllabus": "syllabi", "symposium": "symposia", "synopsis": "synopses", "tableau": "tableaux", "thesis": "theses", "thief": "thieves", "tooth": "teeth", "trout": "trout", "tuna": "tuna", "vertebra": "vertebrae", "vertex": "vertices", "vita": "vitae", "vortex": "vortices", "wharf": "wharves", "wife": "wives", "wolf": "wolves", "woman": "women", "guy": "guys", "buy": "buys", "person": "people" };
function pluralize(word) {
word = word.toLowerCase();
if (irregulars[word]) {
return irregulars[word];
}
if (word.length >= 2 && vowels.includes(word[word.length - 2])) {
return word + "s";
}
if (word.endsWith("s") || word.endsWith("sh") || word.endsWith("ch") || word.endsWith("x") || word.endsWith("z")) {
return word + "es";
}
if (word.endsWith("y")) {
return word.slice(0, -1) + "ies";
}
return word + "s";
}
return pluralize;
})();
////////////////////////////////////////
console.log(pluralize("dog"));
console.log(pluralize("cat"));
console.log(pluralize("fox"));
console.log(pluralize("dwarf"));
console.log(pluralize("guy"));
console.log(pluralize("play"));
Obviously, this can't support all English edge-cases, but it has the most common ones.
The concise version:
`${x} minute${x - 1 ? 's' : ''}`