JS conditional RegEx that removes different parts of a string between two delimiters - javascript

I have a string of text with HTML line breaks. Some of the <br> immediately follow a number between two delimiters «...» and some do not.
Here's the string:
var str = ("«1»<br>«2»some text<br>«3»<br>«4»more text<br>«5»<br>«6»even more text<br>");
I’m looking for a conditional regex that’ll remove the number and delimiters (ex. «1») as well as the line break itself without removing all of the line breaks in the string.
So for instance, at the beginning of my example string, when the script encounters »<br> it’ll remove everything between and including the first « to the left, to »<br> (ex. «1»<br>). However it would not remove «2»some text<br>.
I’ve had some help removing the entire number/delimiters (ex. «1») using the following:
var regex = new RegExp(UsedKeys.join('|'), 'g');
var nextStr = str.replace(/«[^»]*»/g, " ");
I sure hope that makes sense.
Just to be super clear, when the string is rendered in a browser, I’d like to go from this…
«1»
«2»some text
«3»
«4»more text
«5»
«6»even more text
To this…
«2»some text
«4»more text
«6»even more text
Many thanks!

Maybe I'm missing a subtlety here, if so I apologize. But it seems that you can just replace with the regex: /«\d+»<br>/g. This will replace all occurrences of a number between « & » followed by <br>
var str = "«1»<br>«2»some text<br>«3»<br>«4»more text<br>«5»<br>«6»even more text<br>"
var newStr = str.replace(/«\d+»<br>/g, '')
console.log(newStr)
To match letters and digits you can use \w instead of \d
var str = "«a»<br>«b»some text<br>«hel»<br>«4»more text<br>«5»<br>«6»even more text<br>"
var newStr = str.replace(/«\w+?»<br>/g, '')
console.log(newStr)

This snippet assumes that the input within the brackets will always be a number but I think it solves the problem you're trying to solve.
const str = "«1»<br>«2»some text<br>«3»<br>«4»more text<br>«5»<br>«6»even more text<br>";
console.log(str.replace(/(«(\d+)»<br>)/g, ""));
/(«(\d+)»<br>)/g
«(\d+)» Will match any brackets containing 1 or more digits in a row
If you would prefer to match alphanumeric you could use «(\w+)» or for any characters including symbols you could use «([^»]+)»
<br> Will match a line break
//g Matches globally so that it can find every instance of the substring
Basically we are only removing the bracketed numbers if they are immediately followed by a line break.

Related

Regex Match End of Line Unless it Ends with a Closed Bracket

I'm trying to write a JavaScript Regex that will grab the end of a line unless said line ends with a closing bracket, example:
[word]
lengthy text line
[other word]
even lengthier text line! Whoo!
That part I have down pat writing up this Regex new RegExp(/[\n]\n|(?![^\]])$/gm)
But I also need to be able to grab the end of the line even where there isn't a double space, and that is proving to be SUPER difficult since I don't really know a ton about Regex.
-- [word]
These two lines need to be grouped -- lengthy text line
-- [other word]
These two lines need to be grouped -- even lengthier text line! Whoo!
This needs to be it's own group -- This text line is the longest of them all!
-- [more words]
These two lines need to be grouped -- The last guy can win...
What's annoying is that there is a very simple Regex that accomplishes this goal, but it's not currently supported in FireFox, and that's a problem. (?<!])\n Negative Look Behind Assertion
EDIT: The method used for the information is splitting, it splits the value placed into a textarea and matches it to array[i].match(/^\[(.*?)\]\n/). It'd look something like this:
var regex = new RegExp(/[\n]\n|(?![^\]])$/gm);
var array = $('#textar').val().split(regex);
for (var i = 0; i < array.length; i++) {
var match = array[i].match(/^\[(.*?)\]\n/)
}
but with a lot more code taking those variables and using them.
SOLUTION:
Wiktor Stribiżew had the solution. Changing .split(regex) to .match(regex) and adding their regex fixed the problem
var regex = new RegExp(/^.*[^\]\n](?:\]\n.*[^\]\n])*$/gm);
var array = $('#textar').val().match(regex);
for (var i = 0; i < array.length; i++) {
var match = array[i].match(/^\[(.*?)\]\n/)
}
You may use String#match:
text.match(/^.*[^\]\n](?:\]\n.*[^\]\n])*$/gm)
Regex details
^ - start of a line
.*[^\]\n] - 0 or more chars other than line break chars, as many as possible and then a char other than a newline and ]
(?:\]\n.*[^\]\n])* - 0 or more repetitions of
\]\n - ] and a newline, LF, char
.*[^\]\n] - 0 or more chars other than line break chars, as many as possible and then a char other than a newline and ]
$ - end of a line.
See the JS demo:
var text = "[word]\nlengthy text line\n\n[other word]\neven lengthier text line! Whoo!\nThis text is the longest of them all!\n[more words]\nThe last gyu can win...";
console.log(text.match(/^.*[^\]\n](?:\]\n.*[^\]\n])*$/gm));
You are looking for a regex like this:
/^\[.+(\n+[^\[]+)/gm
^ at the begining of the string,
look for [
.+ followed by any character
(\n+[^\[]+) an enter any number of times or any character as long as it is not [
Demo: https://regex101.com/r/c1giqu/3
For your convenience, the full match keeps the text between brackets. The first group includes only the text without the brackets.

How to check if a string contains specific words in different languages [duplicate]

I have simple regex which founds some word in text:
var patern = new RegExp("\bsomething\b", "gi");
This match word in text with spaces or punctuation around.
So it match:
I have something.
But doesn't match:
I havesomething.
what is fine and exactly what I need.
But I have issue with for example Arabic language. If I have regex:
var patern = new RegExp("\bرياضة\b", "gi");
and text:
رياضة أنا أحب رياضتي وأنا سعيد حقا هنا لها حبي
The keyword which I am looking for is at the end of the text.
But this doesn't work, it just doesn't find it.
It works if I remove \b from regex:
var patern = new RegExp("رياضة", "gi");
But that is now what I want, because I don't want to find it if it's part of another word like in english example above:
I havesomething.
So I really have low knowledge about regex and if anyone can help me to work this with english and languages like arabic.
We have first to understand what does \b mean:
\b is an anchor that matches at a position that is called a "word boundary".
In your case, the word boundaries that you are looking for are not having other Arabic letters.
To match only Arabic letters in Regex, we use unicode:
[\u0621-\u064A]+
Or we can simply use Arabic letters directly
[ء-ي]+
The code above will match any Arabic letters. To make a word boundary out of it, we could simply reverse it on both sides:
[^ء-ي]ARABIC TEXT[^ء-ي]
The code above means: don't match any Arabic characters on either sides of an Arabic word which will work in your case.
Consider this example that you gave us which I modified a little bit:
أنا أحب رياضتي رياض رياضة رياضيات وأنا سعيد حقا هنا
If we are trying to match only رياض, this word will make our search match also رياضة, رياضيات, and رياضتي. However, if we add the code above, the match will successfully be on رياض only.
var x = " أنا أحب رياضتي رياض رياضة رياضيات وأنا سعيد حقا هنا ";
x = x.replace(/([^ء-ي]رياض[^ء-ي])/g, '<span style="color:red">$1</span>');
document.write (x);
If you would like to account for أآإا with one code, you could use something like this [\u0622\u0623\u0625\u0627] or simply list them all between square brackets [أآإا]. Here is a complete code
var x = "أنا هنا وانا هناك .. آنا هنا وإنا هناك";
x = x.replace(/([أآإا]نا)/g, '<span style="color:red">$1</span>');
document.write (x);
Note: If you want to match every possible Arabic characters in Regex including all Arabic letters أ ب ت ث ج, all diacritics َ ً ُ ٌ ِ ٍ ّ, and all Arabic numbers ١٢٣٤٥٦٧٨٩٠, use this regex: [،-٩]+
Useful link about the ranking of Arabic characters in Unicode: https://en.wikipedia.org/wiki/Arabic_script_in_Unicode
This doesn't work because of the Arabic language which isn't supported on the regex engine.
You could search for the unicode chars in the text (Unicode ranges).
Or you could use encoding to convert the text into unicode and then make somehow the regex (i never have tried this but it should work).
I used this ء-ي٠-٩ and it works for me
If you don't need a complicated RegEx (for instance, because you're looking for a particular word or a short list of words), then I've found that it's actually easier to tokenize the search text and find it that way:
>>> text = 'رياضة أنا أحب رياضتي وأنا سعيد حقا هنا لها حبي '
>>> tokens = text.split()
>>> print(tokens)
['رياضة', 'أنا', 'أحب', 'رياضتي', 'وأنا', 'سعيد', 'حقا', 'هنا', 'لها', 'حبي']
>>> search_words = ['رياضة', 'رياضت']
>>> found = [w for w in tokens if w in search_words]
>>> print(found)
['رياضة'] # returns only full-word match
I'm sure that this is slower than RegEx, but not enough that I've ever noticed.
If your text had punctuation, you could do a more sophisticated tokenization (so it would find things like 'رياضة؟') using NLTK.

Test if a sentence is matching a text declaration using regex

I want to test if a sentence like type var1,var2,var3 is matching a text declaration or not.
So, I used the following code :
var text = "int a1,a2,a3",
reg = /int ((([a-z_A-Z]+[0-9]*),)+)$/g;
if (reg.test(text)) console.log(true);
else console.log(false)
The problem is that this regular expression returns false on text that is supposed to be true.
Could someone help me find a good regular expression matching expressions as in the example above?
You have a couple of mistekes.
As you wrote, the last coma is required at the end of the line.
I suppose you also want to match int abc123 as correct string, so you need to include letter to other characters
Avoid using capturing groups for just testing strings.
const str = 'int a1,a2,a3';
const regex = /int (?:[a-zA-Z_](?:[a-zA-Z0-9_])*(?:\,|$))+/g
console.log(regex.test(str));
You will need to add ? after the comma ,.
This token ? matches between zero and one.
Notice that the last number in your text a3 does not have , afterward.
int ((([a-z_A-Z]+[0-9]*),?)+)$

Javascript Regular expression to remove unwanted <br>,

I have a JS stirng like this
<div id="grouplogo_nav"><br> <ul><br> <li><a class="group_hlfppt" target="_blank" href="http://www.hlfppt.org/">&nbsp;</a></li><br> </ul><br> </div>
I need to remove all <br> and $nbsp; that are only between > and <. I tried to write a regular expression, but didn't got it right. Does anybody have a solution.
EDIT :
Please note i want to remove only the tags b/w > and <
Avoid using regex on html!
Try creating a temporary div from the string, and using the DOM to remove any br tags from it. This is much more robust than parsing html with regex, which can be harmful to your health:
var tempDiv = document.createElement('div');
tempDiv.innerHTML = mystringwithBRin;
var nodes = tempDiv.childNodes;
for(var nodeId=nodes.length-1; nodeId >= 0; --nodeId) {
if(nodes[nodeId].tagName === 'br') {
tempDiv.removeChild(nodes[nodeId]);
}
}
var newStr = tempDiv.innerHTML;
Note that we iterate in reverse over the child nodes so that the node IDs remain valid after removing a given child node.
http://jsfiddle.net/fxfrt/
myString = myString.replace(/^( |<br>)+/, '');
... where /.../ denotes a regular expression, ^ denotes start of string, ($nbsp;|<br>) denotes " or <br>", and + denotes "one or more occurrence of the previous expression". And then simply replace that full match with an empty string.
s.replace(/(>)(?: |<br>)+(\s?<)/g,'$1$2');
Don't use this in production. See the answer from Phil H.
Edit: I try to explain it a bit and hope my english is good enough.
Basically we have two different kinds of parentheses here. The first pair and third pair () are normal parentheses. They are used to remember the characters that are matched by the enclosed pattern and group the characters together. For the second pair, we don't need to remember the characters for later use, so we disable the "remember" functionality by using the form (?:) and only group the characters to make the + work as expected. The + quantifier means "one or more occurrences", so or <br> must be there one or more times. The last part (\s?<) matches a whitespace character (\s), which can be missing or occur one time (?), followed by the characters <. $1 and $2 are kind of variables that are replaces by the remembered characters of the first and third parentheses.
MDN provides a nice table, which explains all the special characters.
You need to replace globally. Also don't forget that you can have the being closed . Try this:
myString = myString.replace(/( |<br>|<br \/>)/g, '');
This worked for me, please note for the multi lines
myString = myString.replace(/( |<br>|<br \/>)/gm, '');
myString = myString.replace(/^( |<br>)+/, '');
hope this helps

How do I read a list from a textarea with Javascript?

I am trying to read in a list of words separated by spaces from a textbox with Javascript. This will eventually be in a website.
Thank you.
This should pretty much do it:
<textarea id="foo">some text here</textarea>
<script>
var element = document.getElementById('foo');
var wordlist = element.value.split(' ');
// wordlist now contains 3 values: 'some', 'text' and 'here'
</script>
A more accurate way to do this is to use regular expressions to strip extra spaces first, and than use #Aron's method, otherwise, if you have something like "a b c d e" you will get an array with a lot of empty string elements, which I'm sure you don't want
Therefore, you should use:
<textarea id="foo">
this is some very bad
formatted text a
</textarea>
<script>
var str = document.getElementById('foo').value;
str = str.replace(/\s+/g, ' ').replace(/^\s+|\s$/g);
var words = str.split(' ');
// words will have exactly 7 items (the 7 words in the textarea)
</script>
The first .replace() function replaces all consecutive spaces with 1 space and the second one trims the whitespace from the start and the end of the string, making it ideal for word parsing :)
Instead of splitting by whitespaces, you can also try matching sequences of non-whitespace characters.
var words = document.getElementById('foo').value.match(/\S+/g);
Problems with the splitting method is that when there are leading or trailing whitespaces, you will get an empty element for them. For example, " hello world " would give you ["", "hello", "world", ""].
You may strip the whitespaces before and after the text, but there is another problem: When the string is empty. For example, splitting "" will give you [""].
Instead of finding what we don't want and split it, I think it is better to look for what we want.

Categories

Resources