Uppercase for each new word swedish characters and html markup - javascript

I was pointed out to this post, which does not seem to follow the criteria I have:
Replace a Regex capture group with uppercase in Javascript
I am trying to make a regex that will:
format a string by adding uppercase for the first letter of each word and lower case for the rest of the characters
ignore HTML markup
Accept swedish characters (åäöÅÄÖ)
Say I've got this string:
<b>app</b>le store östersund
Then I want it to be (changes marked by uppercase characters)
<b>App</b>le Store Östersund
I've been playing around with it and the closest I've got is the following:
(?!([^<])*?>)[åäöÅÄÖ]|\s\b\w
Resulted in
<b>app</b>le Store Östersund
Or this
/(?!([^<])*?>)[åäöÅÄÖ]|\S\b\w/g
Resulted in
<B>App</B>Le store Östersund
Here's a fiddle:
http://refiddle.com/refiddles/598aabef75622d4a531b0000
Any help or advice is much appreciated.

It is not possible to do this with regexp alone, since regexp doesn't understand HTML structure. [*] Instead, we need to process each text node, and carry through our logic for what is the beginning of the word in case a word continues across different text nodes. A character is at start of the word if it is preceded by a whitespace, or if it is at the start of the string and it is either the first text node, or the previous text node ended in whitespace.
function htmlToTitlecase(html, letters) {
let div = document.createElement('div');
let re = new RegExp("(^|\\s)([" + letters + "])", "gi");
div.innerHTML = html;
let treeWalker = document.createTreeWalker(div, NodeFilter.SHOW_TEXT);
let startOfWord = true;
while (treeWalker.nextNode()) {
let node = treeWalker.currentNode;
node.data = node.data.replace(re, function(match, space, letter) {
if (space || startOfWord) {
return space + letter.toUpperCase();
} else {
return match;
}
});
startOfWord = node.data.match(/\s$/);
}
return div.innerHTML;
}
console.log(htmlToTitlecase("<b>app</b>le store östersund", "a-zåäö"));
// <b>App</b>le Store Östersund
[*] Maybe possible, but even if so, it would be horribly ugly, since it would need to cover an awful amount of corner cases. Also might need a stronger RegExp engine than JavaScript's, like Ruby's or Perl's.
EDIT:
Even if just specifying really simple html tags? The only ones I am actually in need of covering is <b> and </b> at the moment.
This was not specified in the question. The solution is general enough to work for any markup (including simple tags). But...
function simpleHtmlToTitlecaseSwedish(html) {
return html.replace(/(^|\s)(<\/?b>|)([a-zåäö])/gi, function(match, space, tag, letter) {
return space + tag + letter.toUpperCase();
});
}
console.log(simpleHtmlToTitlecaseSwedish("<b>app</b>le store östersund", "a-zåäö"));

I have a solution which use almost only regex. It may be not the most intuitive way to do it, but it should be effective and I find it funny :)
You have to append at the end of your string every lowercase character followed by their uppercase counterpart, like this (it must also be preceded by a space for my regex) :
aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZåÅäÄöÖ
(I don't know which letters are missing, I know nothing about swedish alphabet, sorry... I'm counting on you to correct that !)
Then you can use the following regex :
(?![^<]*>)(\s<[^/]*?>|\s|^)([\wåäö])(?=.*\2(.)\S*$)|[\wåÅäÄöÖ]+$
Replace by :
$1$3
Test it here
Here is a working javascript code :
// Initialization
var regex = /(?![^<]*>)(\s<[^/]*?>|\s|^)([\wåäö])(?=.*\2(.)\S*$)|[\wåÅäÄöÖ]+$/g;
var string = "test <b when=\"2>1\">ap<i>p</i></b>le store östersund";
// Processing
result = string + " aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZåÅäÄöÖ";
result = result.replace(regex, "$1$3");
// Display result
console.log(result);
Edit : I forgot to handle first word of the string, it's corrected :)

Related

JavaScript not removing text when a uppercase letter involved

So I have a text box on my website and I have coded this to prevent certain words from being used.
window.onload = function() {
var banned = ['MMM', 'XXX'];
document.getElementById('input_1_17').addEventListener('keyup', function(e) {
var text = document.getElementById('input_1_17').value;
for (var x = 0; x < banned.length; x++) {
if (text.toLowerCase().search(banned[x]) !== -1) {
alert(banned[x] + ' is not allowed!');
}
var regExp = new RegExp(banned[x]);
text = text.replace(regExp, '');
}
document.getElementById('input_1_17').value = text;
}, false);
}
The code works perfectly and removes the text from the text box when all the letters typed are lowercase. The problem is when the text contained an uppercase letter it will give the error but the word will not be removed from the text box.
The RegExp is a good direction, just you need some flags (to make it case-insensitive, and global - so replace all occurrences):
var text="Under the xxx\nUnder the XXx\nDarling it's MMM\nDown where it's mmM\nTake it from me";
console.log("Obscene:",text);
var banned=["XXX","MMM"];
banned.forEach(nastiness=>{
text=text.replace(new RegExp(nastiness,"gi"),"");
});
console.log("Okay:",text);
Normally you should use .toLowerCase() with both sides when comparing the strings so they can logically be matched.
But the problem actually comes from the Regex you are using, where you are ignoring case sensitivity, you just need to add the i flag to it:
var regExp = new RegExp(banned[x], 'gi');
text = text.replace(regExp, '');
Note:
Note also that using an alert() in a loop is not recommended, you can change your logic to alert all the matched items in only one alert().
You seem to have been expecting something unreasonable. Lowercase strings will never match strings containing uppercase letters.
Either convert both for comparison or use lowercase banned strings. The former would be more reliable, taking future human error out of the process.
What you can do is actually convert both variables to either all caps or all lowercase.
if (text.toLowerCase().includes(banned[x].toLowerCase())) {
alert(banned[x] + ' is not allowed!');
}
Not tested but it should work. No need to use search since you don't need the index anyway. using includes is cleaner. includes docs

Finding punctuation marks in text with string-methods

how can I find out when a punctuation(?!;.) or "<" character comes in the string. I don’t want to use an array or compare any letter, but try to solve it with string methods. Something like that:
var text = corpus.substr(0, corpus.indexOf(".");
Ok, if I explicitly specify a character like a punct, it works fine. The problem with my parsing is that with a long text in a loop, I no longer know how a sentence ends, whether with question marks or exclamation points. I tried following, but it doesn’t work:
var text = corpus.substr(0, corpus.indexOf(corpus.search("."));
I want to loop through a long string and use every punctuation found to use it as the end-of-sentence character.
Do you know how can I solve my problem?
You can start with RegExp and weight it against going character by character and compare ascii codes essentially. Split is another way ( just posted above ).
RegExp solution
function getTextUpToPunc( text ) {
const regExp = /^.+(\!|\?|\.)/mg;
for (let match; (match = regExp.exec( text )) !== null;) {
console.log(match);
}
}
getTextUpToPunc(
"what a chunky funky monkey! this is really someting else"
)
The key advantage here is that you do not need to loop through the entire string and hold control over the iteration by doing regExp.exec( text ).
The split solution posted earlier would work but split will loop over the entire string. Typically that would not be an issue but if your strings are thousands upon thousands of characters and you do this operation a lot that it would make sense to think about performance.
And if this function will be ran many many times, a small performance improvement would be to memoize the RegExp creation:
const regExp = /^.+(\!|\?|\.)/mg;
Into something like this
function getTextUpToPunc( text ) {
if( !this._regExp ) this._regExp = /^.+(\!|\?|\.)/mg;;
const regExp = this._regExp;
for (let match; (match = regExp.exec( text )) !== null;) {
console.log(match);
}
}
Use a regular expression:
var text = corpus.split(/[(?!;.)<]/g);

Mixed results with White Spaces, and add a dash in Javascript?

How do you combine eliminating white-spaces and special characters with only a single '-' character?
Here's a little Background:
When publishing a job to my career section for my company, the ATS will turn a job title for the URL, e.g if a job title is:
Olympia, WA: SLP Full or Part Time it will become olympia-wa-slp-full-or-part-time
I've experimented from other similar questions, but have only come close with this bit of code:
function newTitle(str) {
var x = str.replace(/[\W]/g, '-').toLowerCase();
return x;
now if I run it, the output generated is olympia--wa--slp-full-or-part-time
(has 2 dashes from the extra spaces). What am I not getting right?
I've tried the other following bits:
str.replace(/\s+/g, '');
and
str.replaceAll("[^a-zA-Z]+", " ");
but neither get close to the desired format.
Thanks!
You got pretty close in your first example, just add + after [\W] to match one or more non-word characters. You can also give it a try in Regexr
function newTitle(str) {
var x = str.replace(/[\W]+/g, '-').toLowerCase();
return x;
}
alert(newTitle('Olympia, WA: SLP Full or Part Time'));
What you actually want, it looks like, is to create a slug from a string.
Here is a nice reusable function that also takes care of multiple dashes:
function slugify(s) {
s = s.replace(/[^\w\s-]/g, '').trim().toLowerCase();
s = s.replace(/[-\s]+/g, '-');
return s;
}
console.log(
slugify("Olympia, WA: SLP Full or Part Time")
);
Your last example [^a-zA-Z]+ almost works if you use a dash as the replacement. This uses a negated character class to match not what you specified so that would include whitespaces and special characters.
Note that if you have a job with for example a digit or an underscore that that would also be replaced. Your could expand the character class with what you don't want to be replaced like [^a-zA-Z0-9]+ or if you also want to keep the underscore \W+ as that would match [^a-zA-Z0-9_]
function newTitle(str) {
return str.replace(/[^a-zA-Z]+/g, '-').toLowerCase();
}
console.log(newTitle("Olympia, WA: SLP Full or Part Time"));

Javascript/Jquery - how to replace a word but only when not part of another word?

I am currently doing a regex comparison to remove words (rude words) from a text field when written by the user. At the moment it performs the check when the user hits space and removes the word if matches. However it will remove the word even if it is part of another word. So if you type apple followed by space it will be removed, that's ok. But if you type applepie followed by space it will remove 'apple' and leave pie, that's not ok. I am trying to make it so that in this instance if apple is part of another word it will not be removed.
Is there any way I can perform the comparison on the whole word only or ignore the comparison if it is combined with other characters?
I know that this allows people to write many rude things with no space. But that is the desired effect by the people that give me orders :(
Thanks for any help.
function rude(string) {
var regex = /apple|pear|orange|banana/ig;
//exaple words because I'm sure you don't need to read profanity
var updatedString = string.replace( regex, function(s) {
var blank = "";
return blank;
});
return updatedString;
}
$(input).keyup(function(event) {
var text;
if (event.keyCode == 32) {
var text = rude($(this).val());
$(this).val(text);
$("someText").html(text);
}
}
You can use word boundaries (\b), which match 0 characters, but only at the beginning or end of a word. I'm also using grouping (the parentheses), so it's easier to read an write such expressions.
var regex = /\b(apple|pear|orange|banana)\b/ig;
BTW, in your example you don't need to use a function. This is sufficient:
function rude(string) {
var regex = /\b(apple|pear|orange|banana)\b/ig;
return string.replace(regex, '');
}

Javascript regular expression to replace word but not within curly brackets

I have some content, for example:
If you have a question, ask for help on StackOverflow
I have a list of synonyms:
a={one typical|only one|one single|one sole|merely one|just one|one unitary|one small|this solitary|this slight}
ask={question|inquire of|seek information from|put a question to|demand|request|expect|inquire|query|interrogate}
I'm using JavaScript to:
Split synonyms based on =
Looping through every synonym, if found in content replace with {...|...}
The output should look like:
If you have {one typical|only one|one single|one sole|merely one|just one|one unitary|one small|this solitary|this slight} question, {question|inquire of|seek information from|put a question to|demand|request|expect|inquire|query|interrogate} for help on StackOverflow
Problem:
Instead of replacing the entire word, it's replacing every character found. My code:
for(syn in allSyn) {
var rtnSyn = allSyn[syn].split("=");
var word = rtnSyn[0];
var synonym = (rtnSyn[1]).trim();
if(word && synonym){
var match = new RegExp(word, "ig");
postProcessContent = preProcessContent.replace(match, synonym);
preProcessContent = postProcessContent;
}
}
It should replace content word with synonym which should not be in {...|...}.
When you build the regexps, you need to include word boundary anchors at both the beginning and the end to match whole words (beginning and ending with characters from [a-zA-Z0-9_]) only:
var match = new RegExp("\\b" + word + "\\b", "ig");
Depending on the specific replacements you are making, you might want to apply your method to individual words (rather than to the entire text at once) matched using a regexp like /\w+/g to avoid replacing words that themselves are the replacements for others. Something like:
content = content.replace(/\w+/g, function(word) {
for(var i = 0, L = allSyn.length; i < L; ++i) {
var rtnSyn = allSyn[syn].split("=");
var synonym = (rtnSyn[1]).trim();
if(synonym && rtnSyn[0].toLowerCase() == word.toLowerCase()) return synonym;
}
});
Regular expressions include something called a "word-boundary", represented by \b. It is a zero-width assertion (it just checks something, it doesn't "eat" input) that says in order to match, certain word boundary conditions have to apply. One example is a space followed by a letter; given the string ' X', this regex would match it: / \bX/. So to make your code work, you just have to add word boundaries to the beginning and end of your word regex, like this:
for(syn in allSyn) {
var rtnSyn = allSyn[syn].split("=");
var word = rtnSyn[0];
var synonym = (rtnSyn[1]).trim();
if(word && synonym){
var match = new RegExp("\\b"+word+"\\b", "ig");
postProcessContent = preProcessContent.replace(match, synonym);
preProcessContent = postProcessContent;
}
}
[Note that there are two backslashes in each of the word boundary matchers because in javascript strings, the backslash is for escape characters -- two backslashes turns into a literal backslash.]
For optimization, don't create a new RegExp on each iteration. Instead, build up a big regex like [^{A-Za-z](a|ask|...)[^}A-Za-z] and an hash with a value for each key specifying what to replace it with. I'm not familiar enough with JavaScript to create the code on the fly.
Note the separator regex which says the match cannot begin with { or end with }. This is not terribly precise, but hopefully acceptable in practice. If you genuinely need to replace words next to { or } then this can certainly be refined, but I'm hoping we won't have to.

Categories

Resources