Conditional regex to retrieve text - javascript

I'm new to regex and what I wanna do is that to parse my input as explained below using javascript:
There are 3 types of inputs that I might get:
someEmail#domain.com;anotherEmail#domain.com;
and
some name<someEmail#domain.com>;another name<anotherEmail#domain.com>;
or it might be like
someEmail#domain.com;another name<anotherEmail#domain.com>;
what I'm trying to do is separate the whole input by ; which will give me an array of emails, then check if each of those array items:
has < and > then retrieve the text between < and > as value.
doesn't have < and > then take the whole text as value.
I'm already trying to learn regex. If anyone gives me the regex, I would appreciate if it comes with an explanation so I can understand and learn.
Cheers

Try something like this as a starter - avoid the complicated regex - it's not required if your inputs are in the form stated:
str = 'someEmail#domain.com;another name<anotherEmail#domain.com>;someEmail#domain.com;anotherEmail#domain.com;some name<someEmail#domain.com>;another name<anotherEmail#domain.com>;test#test.com';
var splits = str.split(';');
for (var i = 0; i < splits.length; i++) {
if (splits[i].indexOf('<') == -1) {
$('#output').append(splits[i] + '<br>');
} else {
var address = splits[i].match(/<(.*?)>/)[1];
$('#output').append(address + '<br>');
}
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id='output'></div>

Splitting is not a bad idea. You can even go further and avoid regular expressions altogether as mentioned in another answer, but since you specifically asked about them:
You could then check each entry in your array with a regular expression like
/^([^<]+);/
which will match anything that consists only of characters that are not < followed by a ; and
/^.*<(.*)>;/
which will match anything of the second form your entries may have.
You can combine these into a single regular expression using |, but I suggest you simply test twice to avoid having to deal with too many capturing groups. You can even avoid the splitting part by using the global modifier, but again, it would make matters a lot more complicated, especially if you're new to regular expressions.
Please note that these examples will match a lot more than email addresses, but checking if they are actually valid is not easy. If you want to look into it, there are plenty of questions on SO about it.

Related

Use only one of the characters in regular expression javascript

I guess that should be smth very easy, but I'm stuck with that for at least 2 hours and I think it's better to ask the question here.
So, I've got a reg expression /&t=(\d*)$/g and it works fine while it is not ?t instead of &t in url. I've tried different combinations like /\?|&t=(\d*)$/g ; /\?t=(\d*)$|/&t=(\d*)$/g ; /(&|\?)t=(\d*)$/g and various others. But haven't got the expected result which is /\?t=(\d*)$/g or /&t=(\d*)$/g url part (whatever is placed to input).
Thx for response. I think need to put some details here. I'm actually working on this peace of code
var formValue = $.trim($("#v").val());
var formValueTime = /&t=(\d*)$/g.exec(formValue);
if (formValueTime && formValueTime.length > 1) {
formValueTime = parseInt(formValueTime[1], 10);
formValue = formValue.replace(/&t=\d*$/g, "");
}
and I want to get the t value whether reference passed with &t or ?t in references like youtu.be/hTWKbfoikeg?t=82 or similar one youtu.be/hTWKbfoikeg&t=82
To replace, you may use
var formValue = "some?some=more&t=1234"; // $.trim($("#v").val());
var formValueTime;
formValue = formValue.replace(/[&?]t=(\d*)$/g, function($0,$1) {
formValueTime = parseInt($1,10);
return '';
});
console.log(formValueTime, formValue);
To grab the value, you may use
/[?&]t=(\d*)$/g.exec(formValue);
Pattern details
[?&] - a character class matching ? or &
t= - t= substring
(\d*) - Group 1 matching zero or more digits
$ - end of string
/\?t=(\d*)|\&t=(\d*)$/g
you inverted the escape character for the second RegEx.
http://regexr.com/3gcnu
I want to thank you all guys for trying to help. Special thanks to #Wiktor Stribiżew who gave the closest answer.
Now the piece of code I needed looks exactly like this:
/[?&]t=(\d*)$/g.exec(formValue);
So that's the [?&] part that solved the problem.
I use array later, so /\?t=(\d*)|\&t=(\d*)$/g doesn't help because I get an array like [t&=50,,50] when reference is & type and the correct answer [t?=50,50] when reference is ? type just because of the order of statements in RegExp.
Now, if you're looking for a piece of RegExp that picks either character in one place while the rest of RegExp remains the same you may use smth like this [?&] for the example where wanted characters are ? and &.

Split Kannada word into syllabic clusters

We are wondering if there is any method to split a Kannada word to get the syllabic clusters using JavaScript.
For example, I want to split the word ಕನ್ನಡ into the syllabic clusters ["ಕ", "ನ್ನ", "ಡ"]. But when I split it with split, the actual array obtained is ["ಕ", "ನ", "್", "ನ", "ಡ"]
Example Fiddle
I cannot say that this is a complete solution. But works to an extent with some basic understanding of how words are formed:
var k = 'ಕನ್ನಡ';
var parts = k.split('');
arr = [];
for(var i=0; i< parts.length; i++) {
var s = k.charAt(i);
// while the next char is not a swara/vyanjana or previous char was a virama
while((i+1) < k.length && k.charCodeAt(i+1) < 0xC85 || k.charCodeAt(i+1) > 0xCB9 || k.charCodeAt(i) == 0xCCD) {
s += k.charAt(i+1);
i++;
}
arr.push(s);
}
console.log(arr);
As the comments in the code say, we keep appending chars to previous char as long as they are not swara or vyanjana or previous char was a virama. You might have to work with different words to make sure you cover different cases. This particular case doesn't cover the numbers.
For Character codes you can refer to this link:
http://www.unicode.org/charts/PDF/U0C80.pdf
Consider using the "inSC" property associated with Unicode characters--you can get this from a database--which indicates the Indic Syllabic Character. (You might also want to consult the "category", to see if it is "non-spacing mark"). For instance, ""್" has the type "Virama" (see http://graphemica.com/0CCD). To take another example, "ಿ" (KANNADA VOWEL SIGN I) has an InSC of "Vowel_Dependent" (and is also in the "non-spacing mark" category). You could potentially then detect which individual graphemes need to be combined with others, and put back together complete characters, as follows:
const graphemes = [..."ಕನ್ನಡ"];
console.log("graphemes are", graphemes);
const rebuild = [graphemes[0], graphemes.slice(1, 4).join(''), graphemes[4]];
console.log(rebuild);
Even if you can make this work, you'll have more work to do. It's unclear to me how you would detect that the three characters "ನ", ""್", and "ನ" are to be combined, rather than treated as the two characters "ನ್" and "ನ". The problem is that in this case the virama is used to indicate a consonant cluster, so you would need to identify the X-V-X pattern (where V is virama) and treat that as one combined character. There are probably many, many other such special cases.
This might be of interest: https://www.microsoft.com/typography/OpenTypeDev/kannada/intro.htmj. It talks about finding "syllable clusters", in this particular case as a prelude for rendering the characters graphically. You may also want to take a look at http://www.unicode.org/L2/L2003/03068-kannada.pdf.

Regex character sets - and what they contain

I'm working on a pretty crude sanitizer for string input in Node(express):
I have glanced at some plugins and library, but it seems most of them are either too complex or too heavy. Therefor i decided to write a couple of simple sanitizer-functions on my own.
One of them is this one, for hard-sanitizing most strings (not numbers...)
function toSafeString( str ){
str = str.replace(/[^a-öA-Ö0-9\s]+/g, '');
return str;
}
I'm from Sweden, therefore i Need the åäö letters. And i have noticed that this regex also accept others charachters aswell... for example á or é....
Question 1)
Is there some kind of list or similar where i can see WHICH charachters are actually accepted in, say this regex: /[^a-ö]+/g
Question 2)
Im working in Node and Express... I'm thinking this simple function is going to stop attacks trough input fields. Am I wrong?
Question 1: Find out. :)
var accepted = [];
for(var i = 0; i < 65535 /* the unicode BMP */; i++) {
var s = String.fromCharCode(i);
if(/[a-ö]+/g.test(s)) accepted.push(s);
}
console.log(s.join(""));
outputs
abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³
´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö
on my system.
Question 2: What attacks are you looking to stop? Either way, the answer is "No, probably not".
Instead of mangling user data (I'm sure your, say, French or Japanese customers will have some beef with your validation), make sure to sanitize your data whenever it's going into customer view or out thereof (HTML escaping, SQL parameter escaping, etc.).
[x-y] matches characters whose unicode numbers are between that of x and that of y:
charsBetween = function(a, b) {
var a = a.charCodeAt(0), b = b.charCodeAt(0), r = "";
while(a <= b)
r += String.fromCharCode(a++);
return r
}
charsBetween("a", "ö")
> "abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö"
See character tables for the reference.
For your validation, you probably want something like this instead:
[^a-zA-Z0-9ÅÄÖåäö\s]
This matches ranges of latin letters and digits + individual characters from a list.
There is a lot of characters that we actually have no idea about, like Japanese or Russian and many more.
So to take them in account we need to use Unicode ranges rather than ASCII ranges in regular expressions.
I came with this regular expression that covers almost all written letters of the whole Unicode table, plus a bit more, like numbers, and few other characters for punctuation (Chinese punctuation is already included in Unicode ranges).
It is hard to cover everything and probably this ranges might include too many characters including "exotic" ones (symbols):
/^[\u0040-\u1FE0\u2C00-\uFFC00-9 ',.?!]+$/i
So I was using it this way to test (have to be not empty):
function validString(str) {
return str && typeof(str) == 'string' && /^[\u0040-\u1FE0\u2C00-\uFFC00-9 ',.?!]+$/i.test(str);
}
Bear in mind that this is missing characters like:
:*()&#'\-:%
And many more others.

Javascript string validation using the regex object

I am complete novice at regex and Javascript. I have the following problem: need to check into a textfield the existence of one (1) or many (n) consecutive * (asterisk) character/characters eg. * or ** or *** or infinite (n) *. Strings allowed eg. *tomato or tomato* or **tomato or tomato** or as many(n)*tomato many(n)*. So, far I had tried the following:
var str = 'a string'
var value = encodeURIComponent(str);
var reg = /([^\s]\*)|(\*[^\s])/;
if (reg.test(value) == true ) {
alert ('Watch out your asterisks!!!')
}
By your question it's hard to decipher what you're after... But let me try:
Only allow asterisks at beginning or at end
If you only allow an arbitrary number (at least one) of asterisks either at the beginning or at the end (but not on both sides) like:
*****tomato
tomato******
but not **tomato*****
Then use this regular expression:
reg = /^(?:\*+[^*]+|[^*]+\*+)$/;
Match front and back number of asterisks
If you require that the number of asterisks at the biginning matches number of asterisks at the end like
*****tomato*****
*tomato*
but not **tomato*****
then use this regular expression:
reg = /^(\*+)[^*]+\1$/;
Results?
It's unclear from your question what the results should be when each of these regular expressions match? Are strings that test positive to above regular expressions fine or wrong is on you and your requirements. As long as you have correct regular expressions you're good to go and provide the functionality you require.
I've also written my regular expressions to just exclude asterisks within the string. If you also need to reject spaces or anything else simply adjust the [^...] parts of above expressions.
Note: both regular expressions are untested but should get you started to build the one you actually need and require in your code.
If I understand correctly you're looking for a pattern like this:
var pattern = /\**[^\s*]+\**/;
this won't match strings like ***** or ** ***, but will match ***d*** *d or all of your examples that you say are valid (***tomatos etc).If I misunderstood, let me know and I'll see what I can do to help. PS: we all started out as newbies at some point, nothing to be ashamed of, let alone apologize for :)
After the edit to your question I gather the use of an asterisk is required, either at the beginning or end of the input, but the string must also contain at least 1 other character, so I propose the following solution:
var pattern = /^\*+[^\s*]+|[^\s*]+\*+$/;
'****'.match(pattern);//false
' ***tomato**'.match(pattern);//true
If, however *tomato* is not allowed, you'll have to change the regex to:
var pattern = /^\*+[^\s*]+$|^[^\s*]+\*+$/;
Here's a handy site to help you find your way in the magical world of regular expressions.

regex to get contents between <b> tag

I have used following regex to get only contents between <b> and </b> tags.
var bonly = defaultVal.match("<b>(.*?)</b>");
but it did not worked. I'm not getting proper result. Sample string I'm using regex on:
<b>Item1</b>: This is item 1 description.
<b>Item1</b>: This is item 1 description.<b>Item2</b>: This is item 2 description.
<b>Item1</b>: <b>Item2</b>: This is item 2 description. <b>Item3</b>: This is item 3 description.<b>Item4</b>:
<b>Item1</b>: This is item 1 description.<b>Item2</b>: This is item 2 description. <b>Item3</b>: This is item 3 description.<b>Item4</b>:
Here item name is compulsory but it may have description or may not have description.
Why don't you skip regex and try...
var div = document.createElement('div');
div.innerHTML = str;
var b = div.getElementsByTagName('b');
for (var i = 0, length = b.length; i < length; i++) {
console.log(b[i].textContent || b[i].innerText);
}
jsFiddle.
There are a zillion questions/answers here on SO about using regex to match HTML tags. You can probably learn a lot with some appropriate searching.
You may want to start by turning your regular expression into a regular expression:
var defaultVal = "<b>Item1</b>: This is item 1 description.";
var bonly = defaultVal.match(/<b>(.*?)<\/b>/);
if (bonly && (bonly.length > 1)) {
alert(bonly[1]); // alerts "Item1"
}
You may also need to note that regular expressions are not well suited for HTML matching because there can be arbitrary strings as attributes on HTML tags that can contain characters that can really mess up the regex match. Further, line breaks can be an issue in some regex engines. Further capitalization can mess you up. Further, an extra space here or there can mess you up. Some of this can be accounted for with a more complicated regex, but it still may not be the best tool.
Depending upon the context of what you're trying to do, it may be easier to create actual HTML objects with this HTML (letting the browser do all the complex parsing) and then use DOM access methods to fetch the info you want.
It works here: http://jsfiddle.net/jfriend00/Man2J/.
try this regexp
var bonly = defaultVal.match(/<([A-z0-9]*)\b[^>]*>(.*?)<\/\1>/)

Categories

Resources