Regex match all href in string, except if containing a word

Regex match all href in string, except if containing a word - javascript

I'm trying to match all href within a string, but exclude (I believe using negative lookahead) when the href contains a specific text, such as login, for example:
const str = `This is some a string google and this is another that should not be found login`
const match = str.match(/href="(.*?)"/g)
console.log(match)
This matches all of the href, but doesn't factor in the exclusion of login being found in one. I've tried a few different variations, but really haven't gotten anywhere. Any help would be greatly appreciated!

You can use this regex which does a negative look behind just before the quote,
href="(.*?)(?<!login)"
Demo,
https://regex101.com/r/15DwZE/1
Edit 1:
As fourth bird pointed out that above regex may not work in general and instead of coming up with a complicated regex that can cover all possibilities of login appearance in url to be rejected, here is a javascript solution.
var myString = 'This is some a string google and this is another that should not be found login';
var myRegexp = /href="(.*?)"/g;
match = myRegexp.exec(myString);
while (match != null) {
if (match[1].indexOf('login') == -1) {
console.log(match[1]);
}
match = myRegexp.exec(myString);
}

You could do this without a regex using a DOMParser and use for example includes to check if href contains your string.
let parser = new DOMParser();
let html = `This is some a string google and this is another that should not be found login`;
let doc = parser.parseFromString(html, "text/html");
let anchors = doc.querySelectorAll("a");
anchors.forEach(a => {
if (!a.href.includes("login")) {
console.log(a.href);
}
});

You can have temporary HTML node and get all <a> tags from it. Then filter by href. Sample code:
const str = `This is some a string google and this is another that should not be found login`;
const d = document.createElement('div');
d.innerHTML = str;
Array.from(d.getElementsByTagName("a")).filter(a => !/login/.test(a.href))

You can use this regex to do that
/<[\w:]+(?=\s)(?=(?:[^>"']|"[^"]*"|'[^']*')*?\shref\s*=\s*(?:(['"])(?:(?!\1|login)[\S\s])*\1))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>/
https://regex101.com/r/LEQL7h/1
More info
< [\w:]+ # Any tag
(?= \s )
(?= # Asserttion (a pseudo atomic group)
(?: [^>"'] | " [^"]* " | ' [^']* ' )*?
\s href \s* = \s* # href attribute
(?:
( ['"] ) # (1), Quote
(?:
(?! \1 | login ) # href cnnot contain login
[\S\s]
)*
\1
)
)
# Have href that does not contain login, match the rest of tag
\s+
(?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
>

Related

regex for ignoring character if inside () parenthesis?

I was doing some regex, but I get this bug:
I have this string for example "+1/(1/10)+(1/30)+1/50" and I used this regex /\+.[^\+]*/g
and it working fine since it gives me ['+1/(1/10)', '+(1/30)', '+1/50']
BUT the real problem is when the + is inside the parenthesis ()
like this: "+1/(1+10)+(1/30)+1/50"
because it will give ['+1/(1', '+10)', '+(1/30)', '+1/50']
which isn't what I want :(... the thing I want is ['+1/(1+10)', '+(1/30)', '+1/50']
so the regex if it see \(.*\) skip it like it wasn't there...
how to ignore in regex?
my code (js):
const tests = {
correct: "1/(1/10)+(1/30)+1/50",
wrong : "1/(1+10)+(1/30)+1/50"
}
function getAdditionArray(string) {
const REGEX = /\+.[^\+]*/g; // change this to ignore the () even if they have the + sign
const firstChar = string[0];
if (firstChar !== "-") string = "+" + string;
return string.match(REGEX);
}
console.log(
getAdditionArray(test.correct),
getAdditionArray(test.wrong),
)

You can exclude matching parenthesis, and then optionally match (...)
\+[^+()]*(?:\([^()]*\))?
The pattern matches:
\+ Match a +
[^+()]* Match optional chars other than + ( )
(?: Non capture group to match as a whole part
\([^()]*\) Match from (...)
)? Close the non capture group and make it optional
See a regex101 demo.
Another option could be to be more specific about the digits and the + and / and use a character class to list the allowed characters.
\+(?:\d+[+/])?(?:\(\d+[/+]\d+\)|\d+)
See another regex101 demo.

Multiple OR conditions for words in JavaScript regular expression

I trying to have a regular expression which is finding between two words but those words are not certain one.
2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞASınıfıE10/ENo303
This is my text. I'm trying to find the word between Soyadı and Sınıfı, in this case ERTANĞA, but the word Sınıfı also can be no, numara or any number. This is what I did.
soyad[ıi](.*)S[ıi]n[ıi]f[ıi]|no|numara|[0-9]
[ıi] is for Turkish character issue, don't mind that.

You can use something like below :
/.*Soyad(ı|i)|S(ı|i)n(ı|i)f(ı|i).*|no.*|numera.*|[0-9]/gmi
Here is the link I worked on : https://regex101.com/r/QXLjLF/1
In JS code:
const regex = /.*Soyad(ı|i)|S(ı|i)n(ı|i)f(ı|i).*|no.*|numera.*|[0-9]/gmi;
var str = `2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞASınıfıE10/ENo303`;
var newStr = str.replace(regex, '');
console.log(newStr);

You can use a single capture group to get the word ERTANĞA, keep the character class [ıi] instead of using an alternation for (ı|i) and group the alternatives at the end of the pattern using a non capture group (?:
soyad[ıi](.+?)(?:S[ıi]n[ıi]f[ıi]|n(?:o|umara)|[0-9])
soyad[ıi] Match soyadı or soyadi
(.+?) Capture group 1, match 1 or more chars as least as possible
(?: Non capture group
S[ıi]n[ıi]f[ıi] Match S and then ı or i etc..
| Or
n(?:o|umara) Match either no or numara
| Or
[0-9] Match a digit 0-9
) Close non capture group
Note that you don't need the /m flag as there are no anchors in the pattern.
Regex demo
const regex = /soyad[ıi](.+?)(?:S[ıi]n[ıi]f[ıi]|n(?:o|umara)|[0-9])/gi;
const str = "2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞASınıfıE10/ENo303\n";
console.log(Array.from(str.matchAll(regex), m => m[1]));

This might do it
const str = `2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞASınıfıE10/ENo303
2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞAnumaraE10/ENo303
2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞAnoE10/ENo303`
const re = /(?:Soyad(ı|i))(.*?)(?:S(ı|i)n(ı|i)f(ı|i)|no|numara)/gmi
console.log([...str.matchAll(re)].map(x => x[2]))
ES5
const str = `2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞASınıfıE10/ENo303
2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞAnumaraE10/ENo303
2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞAnoE10/ENo303`
const re = /(?:Soyad(ı|i))(.*?)(?:S(ı|i)n(ı|i)f(ı|i)|no|numara)/gmi
const res = []
let match;
while ((match = re.exec(str)) !== null) res.push(match[2])
console.log(res)

Reg Exp for finding hashtag words

I have the following sentence as a test:
This is a test with #shouldshow and to see if there #show
#yes this#shouldnotshow what is going on here
I have figured out most of the Reg Exp I need. Here's what I have so far: /((?<=#)([A-Z]*))/gi
This matches every tag but also matches the shouldnotshow portion. I want to not match words that are prefixed by anything but # (excluding whitespace & \n).
So the only matched words I should get are: shouldshow show yes.
Note: after #show is a newline

You just need to see if the hash is prefixed with whitespace or starts the string
https://regex101.com/r/JDuGvr/1
/(\s|^)#(\w+)/gm
with positive lookbehind as OP used
https://regex101.com/r/06X3ZX/1
/(?<=(\s|^)#)(\w+)/gm;
use [a-zA-Z0-9] if you do not want an underscore
const re1 = /(\s|^)#(\w+)/gm;
const re2 = /(?<=(\s|^)#)(\w+)/gm;
const str = `This is a test with #shouldshow and to see if there #show
#yes this#shouldnotshow what is going on here`;
const res1 = [...str.matchAll(re1)].map(match => match[2]); // here the match is the third item
console.log(res1)
const res2 = [...str.matchAll(re2)].map(match => match[0]); // match is the first item
console.log(res2)

Another option could be using your pattern asserting a # on the left that does not have a non whitespace char before it using (?<!\S)# and get the match only without capture groups.
Match at least 1+ times a char A-Z to prevent matching an empty string.
(?<=(?<!\S)#)[A-Z]+
Regex demo
const regex = /(?<=(?<!\S)#)[A-Z]+/gi;
const str = `This is a test with #shouldshow and to see if there #show
#yes this#shouldnotshow what is going on her`;
console.log(str.match(regex));

Regex - ignoring text between quotes / HTML(5) attribute filtering

So I have this Regular expression, which basically has to filter the given string to a HTML(5) format list of attributes. It currently isn't doing my fulfilling, but that's about to change! (I hope so)
I'm trying to achieve that whenever an occurrence is found, it selects the text until the next occurrence OR the end of the string, as the second match. So if you'd take a look at the current regular expression:
/([a-zA-Z]+|[a-zA-Z]+-[a-zA-Z0-9]+)=["']/g
A string like this: hey="hey world" hey-heyhhhhh3123="Hello world" data-goed="hey"
Would be filtered / matched out like this:
MATCH 1. [0-3] `hey`
MATCH 2. [16-32] `hey-heyhhhhh3123`
MATCH 3. [47-56] `data-goed`
This has to be seen as the attribute-name(s), and now.. we just have to fetch the attribute's value(s). So the mentioned string has to have an outcome like this:
MATCH 1.
1 [0-3] `hey`
2 [6-14] `hey world`
MATCH 2.
1 [16-32] `hey-heyhhhhh3123`
2 [35-45] `Hello world`
MATCH 3.
1 [47-56] `data-goed`
2 [59-61] `hey`
Could anyone try and help me to get my fulfilling? It would be appericiated a lot!

You can use
/([^\s=]+)=(?:"([^"\\]*(?:\\.[^"\\]*)*)"|(\S+))/g
See regex demo
Pattern details:
([^\s=]+) - Group 1 capturing 1 or more characters other than whitespace and = symbol
= - an equal sign
(?:"([^"\\]*(?:\\.[^"\\]*)*)"|(\S+)) - a non-capturing group of 2 alternatives (one more '([^'\\]*(?:\\.[^'\\]*)*)' alternative can be added to account for single quoted string literals)
"([^"\\]*(?:\\.[^"\\]*)*)" - a double quoted string literal pattern:
" - a double quote
([^"\\]*(?:\\.[^"\\]*)*) - Group 2 capturing 0+ characters other than \ and ", followed with 0+ sequences of any escaped symbol followed with 0+ characters other than \ and "
" - a closing dlouble quote
| - or
(\S+) - Group 3 capturing one or more non-whitespace characters
JS demo (no single quoted support):
var re = /([^\s=]+)=(?:"([^"\\]*(?:\\.[^"\\]*)*)"|(\S+))/g;
var str = 'hey="hey world" hey-heyhhhhh3123="Hello \\"world\\"" data-goed="hey" more=here';
var res = [];
while ((m = re.exec(str)) !== null) {
if (m[3]) {
res.push([m[1], m[3]]);
} else {
res.push([m[1], m[2]]);
}
}
console.log(res);
JS demo (with single quoted literal support)
var re = /([^\s=]+)=(?:"([^"\\]*(?:\\.[^"\\]*)*)"|'([^'\\]*(?:\\.[^'\\]*)*)'|(\S+))/g;
var str = 'pseudoprefix-before=\'hey1"\' data-hey="hey\'hey" more=data and="more \\"here\\""';
var res = [];
while ((m = re.exec(str)) !== null) {
if (m[2]) {
res.push([m[1], m[2]])
} else if (m[3]) {
res.push([m[1], m[3]])
} else if (m[4]) {
res.push([m[1], m[4]])
}
}
console.log(res);

Inverting a rather complex set of regexes

I'm sort of new to regular expressions, and none of the solutions I found online helped/worked.
I'm dealing with a one-line String in JavaScript, it'll contain five types of data mixed in.
A "#" followed by six numbers/letters (HTML color) (/#....../g)
A forward slash followed by any of a few specific characters (/\/(\+|\^|\-|#|!\+|_|#|\*|%|&|~)/g)
A "$" followed by a sequence of letters and a "|" (/\$([^\|]+)/g)
A "|" alone (/\|/g)
Alphanumeric characters that do not fall under any of these categories
The thing is, I have regexes to match the first four categories, that are important.
The problem is that I need a single Regex that I'll use to replace all the characters that DO NOT match for the first four regexes with a single character, such as "§".
Example:
This#00CC00 is green$Courier| and /^mono|spaced
§§§§#00CC00§§§§§§§§§$Courier|§§§§§/^§§§§|§§§§§§
I know I may be attacking this problem the wrong way, I'm rather new to regular expressions.
Essentially, how do I make a regex that means "anything that doesn't have any matches for regexes x, y, or z"?
Thank you for your time.

use this pattern
((#\w{6}|\/[\/\(\+\^\-]|\$\w+\||\|)*).
and replace w/ $1§
Downside is your preserved pattern has to be followed by at least one character
Demo
( # Capturing Group (1)
( # Capturing Group (2)
# # "#"
\w # <ASCII letter, digit or underscore>
{6} # (repeated {6} times)
| # OR
\/ # "/"
[\/\(\+\^\-] # Character Class [\/\(\+\^\-]
| # OR
\$ # "$"
\w # <ASCII letter, digit or underscore>
+ # (one or more)(greedy)
\| # "|"
| # OR
\| # "|"
) # End of Capturing Group (2)
* # (zero or more)(greedy)
) # End of Capturing Group (1)
. # Any character except line break
Code copied from Regex101
var re = /((#\w{6}|\/[\/\(\+\^\-]|\$\w+\||\|)*)./gm;
var str = 'This#00CC00 is green$Courier| and /^mono|spaced|\n';
var subst = '$1§';
var result = str.replace(re, subst);

This isn't as efficient as a working regular expression but it works. Basically it gets all of the matches and fills the parts between with § characters. One nice thing is you don't have to be a regular expression genius to update it, so hopefully more people can use it.
var str = 'This#00CC00 is green$Courier| and /^mono|spaced';
var patt=/#(\d|\w){6}|\/(\+|\^|\-|#|!\+|_|#|\*|%|&|~)|\$([^\|]+)\||\|/g;
var ret = "";
pos = [];
while (match=patt.exec(str)) {
pos.push(match.index);
pos.push(patt.lastIndex);
console.log(match.index + ' ' + patt.lastIndex);
}
for (var i=0; i<pos.length; i+=2) {
ret += Array(1+pos[i]- (i==0 ? 0 : pos[i-1])).join("§");
ret += str.substring(pos[i], pos[i+1]);
}
ret += Array(1+str.length-pos[pos.length-1]).join("§");
document.body.innerHTML = str +"<br>"+ret;
console.log(str);
console.log(ret);
demo here

Develop Reference

JavaScript is the programming language of the Web.

Regex match all href in string, except if containing a word - javascript

Related

regex for ignoring character if inside () parenthesis?

Multiple OR conditions for words in JavaScript regular expression

Reg Exp for finding hashtag words

Regex - ignoring text between quotes / HTML(5) attribute filtering

Inverting a rather complex set of regexes

Categories

Resources