This question already has answers here:
Strip HTML from Text JavaScript
(44 answers)
removing html tags from string
(3 answers)
Closed 7 years ago.
I need to get rid of any text inside < and >, including the two delimiters themselves.
So for example, from string
<brev-y>th</brev-y><sw-ex>a</sw-ex><sl>t</sl>
I would like to get this one
that
This is what i've tried so far:
var str = annotation.split(' ');
str.substring(str.lastIndexOf("<") + 1, str.lastIndexOf(">"))
But it doesn't work for every < and >.
I'd rather not use RegEx if possible, but I'm happy to hear if it's the only option.
You can simply use the replace method with /<[^>]*>/g.It matches < followed by [^>]* any amount of non> until > globally.
var str = '<brev-y>th</brev-y><sw-ex>a</sw-ex><sl>t</sl>';
str = str.replace(/<[^>]*>/g, "");
alert(str);
For string removal you can use RegExp, it is ok.
"<brev-y>th</brev-y><sw-ex>a</sw-ex><sl>t</sl>".replace(/<\/?[^>]+>/g, "")
Since the text you want is always after a > character, you could split it at that point, and then the first character in each String of the array would be the character you need. For example:
String[] strings = stringName.split("<");
String word = "";
for(int i = 0; i < strings.length; i++) {
word += strings[i].charAt(0);
}
This is probably glitchy right now, but I think this would work. You don't need to actually remove the text between the "<>"- just get the character right after a '>'
Using a regular expression is not the only option, but it's a pretty good option.
You can easily parse the string to remove the tags, for example by using a state machine where the < and > characters turns on and off a state of ignoring characters. There are other methods of course, some shorter, some more efficient, but they will all be a few lines of code, while a regular expression solution is just a single replace.
Example:
function removeHtml1(str) {
return str.replace(/<[^>]*>/g, '');
}
function removeHtml2(str) {
var result = '';
var ignore = false;
for (var i = 0; i < str.length; i++) {
var c = str.charAt(i);
switch (c) {
case '<': ignore = true; break;
case '>': ignore = false; break;
default: if (!ignore) result += c;
}
}
return result;
}
var s = "<brev-y>th</brev-y><sw-ex>a</sw-ex><sl>t</sl>";
console.log(removeHtml1(s));
console.log(removeHtml2(s));
There are several ways to do this. Some are better than others. I haven't done one lately for these two specific characters, so I took a minute and wrote some code that may work. I will describe how it works. Create a function with a loop that copies an incoming string, character by character, to an outgoing string. Make the function a string type so it will return your modified string. Create the loop to scan from incoming from string[0] and while less than string.length(). Within the loop, add an if statement. When the if statement sees a "<" character in the incoming string it stops copying, but continues to look at every character in the incoming string until it sees the ">" character. When the ">" is found, it starts copying again. It's that simple.
The following code may need some refinement, but it should get you started on the method described above. It's not the fastest and not the most elegant but the basic idea is there. This did compile, and it ran correctly, here, with no errors. In my test program it produced the correct output. However, you may need to test it further in the context of your program.
string filter_on_brackets(string str1)
{
string str2 = "";
int copy_flag = 1;
for (size_t i = 0 ; i < str1.length();i++)
{
if(str1[i] == '<')
{
copy_flag = 0;
}
if(str1[i] == '>')
{
copy_flag = 2;
}
if(copy_flag == 1)
{
str2 += str1[i];
}
if(copy_flag == 2)
{
copy_flag = 1;
}
}
return str2;
}
Related
This question already has answers here:
How to parse CSV data?
(14 answers)
Closed 6 months ago.
If given an comma separated string as follows
'UserName,Email,[a,b,c]'
i want a split array of all the outermost elements so expected result
['UserName','Email', '[a,b,c]']
string.split(',') will split across every comma but that wont work so any suggestions? this is breaking a CSV reader i have.
I wrote 2 similar answers, so might as well make it a 3rd instead of referring you there. It's a stateful split. This doesn't support nested arrays, but can easily made so.
var str = 'UserName,Email,[a,b,c]'
function tokenize(str) {
var state = "normal";
var tokens = [];
var current = "";
for (var i = 0; i < str.length; i++) {
c = str[i];
if (state == "normal") {
if (c == ',') {
if (current) {
tokens.push(current);
current = "";
}
continue;
}
if (c == '[') {
state = "quotes";
current = "";
continue;
}
current += c;
}
if (state == "quotes") {
if (c == ']') {
state = "normal";
tokens.push(current);
current = "";
continue;
}
current += c;
}
}
if (current) {
tokens.push(current);
current = "";
}
return tokens;
}
console.log(tokenize(str))
You can do this by matching the string to this Regex:
/(^|(?<=,))\[[^[]+\]|[^,]+((?=,)|$)/
let string = '[a,b,c],UserName,[1,2],Email,[a,b,c],password'
let regex = /(^|(?<=,))\[[^[]+\]|[^,]+((?=,)|$)/g
let output = string.match(regex);
console.log(output)
The regex can be summarized as:
Match either an array or a string that's enclosed by commas or at the start/end of our input
The key token we're using is alternative | which works as a sort of either this, or that and since the regex engine is eager, when it matches one, it moves on. So if we match and array, then we move on and don't consider what's inside.
We can break it down to 3 main sections:
(^|(?<=,))
^ Match from the beginning of our string
| Alternatively
(?<=,) Match a string that's preceded by a comma without returning the comma. Read more about positive lookaround here.
\[[^[]+\] | [^,]+
\[[^[]+\] Match a string that starts with [ and ends with ] and can contain a string of one or more characters that aren't [
This because in [1,2],[a,b] it can match the whole string at once since it starts with [ and ends with ]. This way our condition stops that by removing matches that also contain [ indicating that it belongs the second array.
| Alternatively
[^,]+ Match a string of any length that doesn't contain a comma, for the same reason as the brackets above since with ,asd,qwe, technically all of asd,qwe is enclosed with commas.
((?=,)|$)
(?=,) Match any string that's followed by a comma
| Alternatively
$ Match a string that ends with the end of the main string. Read here for a better explanation.
Input = ABCDEF ((3) abcdef),GHIJKLMN ((4)(5) Value),OPQRSTUVW((4(5)) Value (3))
Expected Output = ABCDEF,GHIJKLMN,OPQRSTUVW
Tried so far
Output = Input.replace(/ *\([^)]*\)*/g, "");
Using a regex here probably won't work, or scale, because you expect nested parentheses in your input string. Regex works well when there is a known and fixed structure to the input. Instead, I would recommend that you approach this using a parser. In the code below, I iterate over the input string, one character at at time, and I use a counter to keep track of how many open parentheses there are. If we are inside a parenthesis term, then we don't record those characters. I also have one simple replacement at the end to remove whitespace, which is an additional step which your output implies, but you never explicitly mentioned.
var pCount = 0;
var Input = "ABCDEF ((3) abcdef),GHIJKLMN ((4)(5) Value),OPQRSTUVW((4(5)) Value (3))";
var Output = "";
for (var i=0; i < Input.length; i++) {
if (Input[i] === '(') {
pCount++;
}
else if (Input[i] === ')') {
pCount--;
}
else if (pCount == 0) {
Output += Input[i];
}
}
Output = Output.replace(/ /g,'');
console.log(Output);
If you need to remove nested parentheses, you may use a trick from Remove Nested Patterns with One Line of JavaScript.
var Input = "ABCDEF ((3) abcdef),GHIJKLMN ((4)(5) Value),OPQRSTUVW((4(5)) Value (3))";
var Output = Input;
while (Output != (Output = Output.replace(/\s*\([^()]*\)/g, "")));
console.log(Output);
Or, you could use a recursive function:
function remove_nested_parens(s) {
let new_s = s.replace(/\s*\([^()]*\)/g, "");
return new_s == s ? s : remove_nested_parens(new_s);
}
console.log(remove_nested_parens("ABCDEF ((3) abcdef),GHIJKLMN ((4)(5) Value),OPQRSTUVW((4(5)) Value (3))"));
Here, \s*\([^()]*\) matches 0+ whitespaces, (, 0+ chars other than ( and ) and then a ), and the replace operation is repeated until the string does not change.
I need something that takes a string, and divides it into an array.
I want to split it after every space, so that this -
"Hello everybody!" turns into ---> ["Hello", "Everybody!"]
However, I want it to ignore spaces inbetween apostrophes. So for examples -
"How 'are you' today?" turns into ---> ["How", "'are you'", "today?"]
Now I wrote the following code (which works), but something tells me that what I did is pretty much horrible and that it can be done with probably 50% less code.
I'm also pretty new to JS so I guess I still don't adhere to all the idioms of the language.
function getFixedArray(text) {
var textArray = text.split(' '); //Create an array from the string, splitting by spaces.
var finalArray = [];
var bFoundLeadingApostrophe = false;
var bFoundTrailingApostrophe = false;
var leadingRegExp = /^'/;
var trailingRegExp = /'$/;
var concatenatedString = "";
for (var i = 0; i < textArray.length; i++) {
var text = textArray[i];
//Found a leading apostrophe
if(leadingRegExp.test(text) && !bFoundLeadingApostrophe && !trailingRegExp.test(text)) {
concatenatedString =concatenatedString + text;
bFoundLeadingApostrophe = true;
}
//Found the trailing apostrophe
else if(trailingRegExp.test(text ) && !bFoundTrailingApostrophe) {
concatenatedString = concatenatedString + ' ' + text;
finalArray.push(concatenatedString);
concatenatedString = "";
bFoundLeadingApostrophe = false;
bFoundTrailingApostrophe = false;
}
//Found no trailing apostrophe even though the leading flag indicates true, so we want this string.
else if (bFoundLeadingApostrophe && !bFoundTrailingApostrophe) {
concatenatedString = concatenatedString + ' ' + text;
}
//Regular text
else {
finalArray.push(text);
}
}
return finalArray;
}
I would deeply appreciate it if somebody could go through this and teach me how this should be rewritten, in a more correct & efficient way (and perhaps a more "JS" way).
Thanks!
Edit -
Well I just found a few problems, some of which I fixed, and some I'm not sure how to handle without making this code too complex (for example the string "hello 'every body'!" doesn't split properly....)
You could try matching instead of splitting:
string.match(/(?:['"].+?['"])|\S+/g)
The above regex will match anything in between quotes (including the quotes), or anything that's not a space otherwise.
If you want to also match characters after the quotes, like ? and ! you can try:
/(?:['"].+?['"]\W?)|\S+/g
For "hello 'every body'!" it will give you this array:
["hello", "'every body'!"]
Note that \W matches space as well, if you want to match punctuation you could be explicit by using a character class in place of \W
[,.?!]
Or simply trim the strings after matching:
string.match(regex).map(function(x){return x.trim()})
I am trying to create a regex which will ultimately be used with Google Forms to validate a texarea input.
The rule is,
Input area can have one or more URLs (http or https)
Each URL must be separated either by one or more new lines
Each line which has text, must be a single valid URL
Last URL may have or may not have new line character/s after it
Till now, I have written this regex ^(https?://.+[\r\n]+)*(https?://.+[\r\n]+?)$ but the problem is that if a line has more than 1 url, it validates that too.
Here is my testing playground: http://goo.gl/YPdvBH.
Here is what you are looking for
Demo , Demo with your URLS
function validate(ele) {
str = ele.value;
str = str.replace(/\r/g, "");
while (/\s\n/.test(str)) {
str = str.replace(/\s\n/g, "\n");
}
while (/\n\n/.test(str)) {
str = str.replace(/\n\n/g, "\n");
}
ele.value = str;
str = str.replace(/\n/g, "_!_&_!_").split("_!_&_!_")
var result = [], counter = 0;
for (var i = 0; i < str.length; i++) {
str[i] = str[i].replace(/(?:(?:^|\n)\s+|\s+(?:$|\n))/g, '').replace(/\s+/g, ' ');
if(str[i].length !== 0){
if (isValidAddress(str[i])) {
result.push(str[i]);
}
counter += 1;
}
}
function isValidAddress(s) {
return /^(https?|ftp):\/\/(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*#)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|\/|\?)*)?$/i.test(s)
}
return (result.length === str.length);
}
var ele = document.getElementById('urls');
validate(ele);
This is closer to the regex you are looking for:
^(https?://[\S]+[\r\n]+)*(https?://[\S]+[\r\n]+?)$
The difference between your regex and this one is that you use .+ which will match all characters except newline whereas I use [\S]+ (note it is a capital S) which will match all non-whitespace characters. So, this doesn't match more than one token on one line. Hence, on each line you can match at max one token and that must be of the form that you have defined.
For a regex to match a single URL, look at this question on StackOverflow:
What is the best regular expression to check if a string is a valid URL?
I don't know whether google-forms have a length limit. But if they have, it is sure to almost bounce into it.
If i understand right - in your regexp missing m flag for multiline, so you need something like this
/^(https?://.+this your reg exp for one url)$/m
sample with regexp from Javascript URL validation regex
/^(ht|f)tps?:\/\/[a-z0-9-\.]+\.[a-z]{2,4}\/?([^\s<>\#%"\,\{\}\\|\\\^\[\]`]+)?$/m
How would I split a javascript string such as foo\nbar\nbaz to an array of lines, while preserving the newlines? I'd like to get ['foo\n', 'bar\n', 'baz'] as output;
I'm aware there are numerous possible answers - I'm just curious to find a stylish one.
With perl I'd use a zero-width lookbehind assertion: split /(?<=\n)/, but they are not supported in javascript regexs.
PS. Extra points for handling different line endings (at least \r\n) and handling the missing last newline (as in my example).
You can perform a global match with this pattern: /[^\n]+(?:\r?\n|$)/g
It matches any non-newline character then matches an optional \r followed by \n, or the end of the string.
var input = "foo\r\n\nbar\nbaz";
var result = input.match(/[^\n]+(?:\r?\n|$)/g);
Result: ["foo\r\n", "bar\n", "baz"]
how about this?
"foo\nbar\nbaz".split(/^/m);
Result
["foo
", "bar
", "baz"]
The other answers and answers in comments are all flawed in different ways. I needed a function that works correctly on any string or file.
Here is a simple and correct answer:
function split_lines(s) {
return s.match(/[^\n]*\n|[^\n]+/g);
}
input = "foo\r\n\nbar\n\r\nba\rz\r\r\r";
a = split_lines(input);
Array(5) [ "foo\r\n", "\n", "bar\n", "\r\n", "ba\rz\r\r\r" ]
It effectively splits at each newline \n but includes the \n, and includes a final line without trailing \n if and only if it is not empty. It includes all input characters in the output. We don't need any special treatment for \r.
I've tested this on a large chunk of random data, it does preserve all input characters, and \n only occur at the end of the lines.
Here's a test script:
function split_lines(s) {
return s.match(/[^\n]*\n|[^\n]+/g);
}
function gen_random_string(n, ncharset=256, nlprob=0.05, crprob=0.05) {
var s = "";
for (let i = 0; i < n; ++i) {
var r = Math.random();
if (r < nlprob)
s += "\n";
else if (r < nlprob + crprob)
s += "\r";
else {
var cc = Math.floor(r / (1 - nlprob - crprob) * ncharset);
var c = String.fromCharCode(cc);
s += c;
}
}
return s;
}
function test(...args) {
var s = gen_random_string(...args);
console.log(`generated random string of length ${s.length} with args:`, ...args);
var ok = true, ok1;
var a = split_lines(s);
console.log(`split into ${a.length} lines`);
ok1 = s === a.join('');
ok = ok && ok1;
console.log("split lines combine to give the original string?", ok1 ? "OK" : "FAIL");
for (var i = 0; i < a.length; ++i) {
var s1 = a[i];
ok1 = s1.endsWith("\n") || i == a.length-1;
ok = ok && ok1;
ok1 = !s1.slice(0, -1).includes("\n");
ok = ok && ok1;
}
console.log("tested each line other than the last ends with \\n");
console.log("tested each line does not contain \\n before the last character");
console.log("Final result", ok ? "OK" : "FAIL");
}
test(10000, 256);
test(10000, 65536);
I'd stay away from split with regular expressions since IE has a failed implementation of it. Use match instead.
"foo\nbar\nbaz".match(/^.*(\r?\n|$)/mg)
Result: ["foo\n", "bar\n", "baz"]
One simple but crude method would be first to replace "\n"s with a 2 special characters. Split on the second one, and replace the first with "\n" after splitting. Not efficient and not elegant, but definitely works.