Regex Negative Lookahead in Web Page - javascript

I am trying to come up with a regex that will match all code in a web page unless it contains a certain phrase.
I'm testing it on this string:
<html> This is a web page </html>
It should look at the entire string before the word 'is', see that 'is' is present, and return a non-match. The negative lookahead portion of this will be much more specific in my implementation, I just wanted to give a simple example.
The regex I'm trying to use looks like this:
^[\s\S]+(?!is)[\s\S]+$
This consists of the beginning of the string:
(^)
literally anything:
([\s\S]+)
negative lookahead:
((?!is))
another literally anything:
([\s\S]+)
and end of string:
($)
I'm using a scanning tool that takes a selenium authentication script. When the tool runs the script, it uses a regex to find a value on the web page after authentication to verify that the login script ran correctly. This regex value is different for every single site I scan. But all of the sites it's visiting use the same authentication method that will always show the same page if authentication fails. So basically I need to come up with a regex that will fail if this bad login page is displayed, I'm currently trying to employ a negative lookahead to accomplish this. The scanner is kind of dumb, so the regex is the only way I can interact with this authentication verification process.

The first alternative matches non-whitespace characters, while the second alternative matches any whitespace not followed by 'is' plus a whitespace.
^(\S|\s(?!is\s))*$
Taken together, as above, they should achieve the desired result

I was able to solve my problem and wanted to share it here in case anyone ever stumbles across this thread. I ended up using a negative lookbehind to achieve the targeted functionality. Since the regex is being executed in multiline mode, I had to make my match string look at the very last bit of text on the page. Then the lookbehind would search everything that came before the match string. Since this was happening at the very end of the page, the engine couldn't advance any further and the result from the last line was retained.

Related

Is there anyway to simulate a "Did you mean" in Java Script?

So I'm creating a bot with an API, and the list is pretty case sensitive and only allowing exact matches.
For example, there I have the word "ENCHANTED_GLISTERING_MELON". Its all-caps have underscores and complicated spelling, and the site does not accept if it is not an exact match. It is not so user-friendly. Is there any way to so that when a user inputs something, it will auto-capitalize, replace spaces with underscores, and most importantly, check for misspellings, then consider the closest word? I have a dictionary of what the site accepts.
It not a a simple task to disallow some words with typos.
To avoid reinventing the wheel I would recommend you to use the one of the Open Source engines like RASA to enable neural language processing with your chat.
https://rasa.com/
However, it's not so easy to use if you having troubles with parsing the string in JavaScript.
For a words similarities you check Levenshtein Distance algorithm:
https://www.npmjs.com/package/autocorrect
https://www.npmjs.com/package/string-similarity
Getting the closest string match
For a simple solution you can just replace your disallowed words:
How to replace several words in javascript
Also, if it's just a filter for a bad words in your chat you can use some existing libraries like bad-words:
https://www.npmjs.com/package/bad-words
And you can capitalize everything for your particular strange case:
'enchanted glistering melon'.trim().replace(/ /g,'_').toLocaleUpperCase()

How can I prevent extra characters at the end of my string in Regex?

Background
I'm working on a Javascript application where users have to use a specific email domain to sign up (Either #a.com or #b.com. Anything else gets rejected).
I've created a regex string that makes sure the user doesn't do #a.com with nothing in front of it and limits users to only #a.com and #b.com. The last step is to make sure the user doesn't add extra characters to the end of #a.com by doing something like #a.com.gmail.com
This is the regex I currently have:
\b[a-zA-Z0-9\.]*#(a.com|b.com)
Question
What can I use at the end to prevent anything from being added after a.com or b.com? I'm very novice at regex and have no idea where to start.
To solve your problem add $ to the regex's end. $ means your match should be at the strings' end.
Also you can reduce (a.com|b.com) to [ab]\.com. Look I've also escaped the dot
The character class [ab] means one of its characters should be matched.
Check this demo.
As stated in the comments, be sure to use the (m)ultiline flag, this way the regex engine will threat each line as a separate string.

Trying to make URL-Matching RegEx faster for an IRC bot

Hello fellow programmers, long time lurker here xD
So I was writing this IRC bot on Node.js, and one of the main functionalities is to automatically timeout users that post links without having permissions.
After much testing and researching I came up with this regex that would match almost any URLs, considering that users will often try to circumvent the bot to post links without permission.
/((?!\w+\.+\s\w+\b)\w+\W*(\.|dot|d0t)\W*(aero|asia|biz|cat|com|coop|info|int|jobs|mobi|museum|name|net|org|post|pro|tel|travel|xxx|edu|gov|mil|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b)/i
It takes into consideration users adding spaces between dots, replacing dots with "dot" or adding special characters between the dots, while ignoring matches when users type something like "word. It was good" (since it is a valid url extension).
This regex takes care of almost any cases of users trying to circumvent the url protection, while matching almost no false positives, however my concern is that it might be a bit slow.
Does anyone know of a better regex that has the same function that runs faster or maybe know how to make improvements for this one to run faster?
Regex explanation:
Full regex:
/((?!\w+\.+\s\w+\b)\w+\W*(\.|dot|d0t)\W*(aero|asia|biz|cat|com|coop|info|int|jobs|mobi|museum|name|net|org|post|pro|tel|travel|xxx|edu|gov|mil|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b)/i
Groups:
(?!\w+\.+\s\w+) - Negative lookahead - Checks if user typed a word (\w+) followed by a dot or multiple dots (.+) and a space (\s), if so, check if the next characters are a word (\w+). If this regex group matches, then most likely the user is ending a sentence with a full stop or ellipsis, followed by another sentence, and therefore the regex shouldnt match, even if the second sentence starts with a possible url extension such as "is" or "so", and therefore the negative lookahead should stop the url matching.
\w+ - A word - this is the first part of the url considering a url such as google.com (this ignores the url protocol, if present, and the first part of the url, usually www, since our goal is just to detect urls, and not actually extract them for some other purpose).
\W*(\.|dot|d0t)\W* - Any number of non-alphanumerical characters followed by a dot (or ways to circumvent dot) followed by any number of non-alphanumerical characters - This prevents users from circumventing the filter by typing urls such as google(dot)com as well as spacing between the url words and the dots such as google . com.
(aero|asia|biz|cat|com|coop|info|int|jobs|mobi|museum|name|net|org|post|pro|tel|travel|xxx|edu|gov|mil|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw) - Matches any possible url domain extension - Not much to say here, this prevents false positives from users that do weird punctuations such as "phrase . Next phrase"
\b - A boundary match (boundary character or end-of-string)
Edit: Ive made the obvious improvement from (ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au...) to (a[cdefgilmnoqrstuwxz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnorsuvxyz]|d[dejkmoz]|e[ceghrstu]|f[ijkmor]|g[abdefghilmnpqrstuwy]|h[kmnrtu]|i[delmnoqrst]|j[emop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdeghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrstwy]|qa|r[eosuw]|s[abcdeghijklmnorstuvxyz]|t[cdfghjklmnoprtvwz]|u[agksyz]|v[aceginu]|w[fs]|y[etu]|z[amw]) does anyone know of any more improvements, or a better way to do this?
Thanks in advance,
Gabriel.
Possibly faster:
domainExtTable = { aero: true, asia: true, biz: true, ... }; // init just once
results = text.match(/((?!\w+\.+\s\w+\b)\w+\W*(\.|dot|d0t)\W*(\w{2,4})\b)/i);
domainExt = results[4];
if (domainExt in domainExtTable) { ... } // this is a match
It is hard to say, depends on how good the regexp compiler is.
Removing the lookahead is likely to speed this much more. Just to be sure, you want to NOT match "google. com"?

Regular Expression problem

I need a regular expression for javascript that will get "jones.com/ca" from "Hello we are jones.com/ca in Tampa". The "jones.com/ca" could be any web url extension (example: .net, .co, .gov, etc), and any name. So the regular expression needs to find all instances of say ".com" and all the text to the last white space or beginning of line and to the last white space or end of line (minus any ending punctuation).
Right now I have as an example line: "jones.com/ca some text", using a javascript regular expression of: "\\(.+?^\\s).com?([^\\s]+)?\\", and all I get is ".com/ca" as the output.
This example will capture specific domains com,org and gov
\b\w+\.(?:com|org|gov)/[a-z]{2}\b
And this will capture almost any domain
\b\w+\.[a-z]{2,3}/[a-z]{2}\b
It uses word boundaries so that it does not capture white space.
Matching URLs is a bit of a dark art. The following site has a fairly well-designed regex for this purpose: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
A comprehensive regex for this is going to be much more complicated than you think. The list of top-level domains is fairly long (.gov, .info, .edu, .museum, etc.), and there are "special" domains like localhost as well. Also, many domains end in a two-letter country abbreviation (google.com.br for Google Brazil, for example, or del.icio.us).
The easiest thing would be to look for http(s):// or www at the beginning and just assume what comes after is a domain name. If you don't, you're going to either miss a lot, or get a lot of false positives.
You could try the following, but the last option (after the last |) is going to be open to a significant number of false positives:
/https?:\/\/\S+|www\.\S+|([-a-z0-9_]+\.)+(com|org|edu|gov|mil|info|[a-z]{2})(\/\S*)?|([-a-z0-9_]+\.)+[-a-z0-9_]+\/\S*/ig

Javascript/Regex for finding just the root domain name without sub domains

I had a search and found lot's of similar regex examples, but not quite what I need.
I want to be able to pass in the following urls and return the results:
www.google.com returns google.com
sub.domains.are.cool.google.com returns google.com
doesntmatterhowlongasubdomainis.idont.wantit.google.com
returns google.com
sub.domain.google.com/no/thanks returns google.com
Hope that makes sense :)
Thanks in advance!-James
You can't do this with a regular expression because you don't know how many blocks are in the suffix.
For example google.com has a suffix of com. To get from subdomain.google.com to google.com you'd have to take the last two blocks - one for the suffix and one for google.
If you apply this logic to subdomain.google.co.uk though you would end up with co.uk.
You will actually need to look up the suffix from a list like http://publicsuffix.org/
Don't use regex, use the .split() method and work from there.
var s = domain.split('.');
If your use case is fairly narrow you could then check the TLDs as needed, and then return the last 2 or 3 segments as appropriate:
return s.slice(-2).join('.');
It'll make your eyes bleed less than any regex solution.
I've not done a lot of testing on this, but if I understand what you're asking for, this should be a decent starting point...
([A-Za-z0-9-]+\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))\b
EDIT:
To clarify, it's looking for:
one or more alpha-numeric characters or dashes, followed by a literal dot
and then one of three things...
three or more alpha characters (i.e. com/net/mil/coop, etc.)
two alpha characters, followed by a literal dot, followed by two more alphas (i.e. co.uk)
two alpha characters (i.e. us/uk/to, etc)
and at the end of that, a word boundary (\b) meaning the end of the string, a space, or a non-word character (in regex word characters are typically alpha-numerics, and underscore).
As I say, I didn't do much testing, but it seemed a reasonable jumping off point. You'd likely need to try it and tune it some, and even then, it's unlikely that you'll get 100% for all test cases. There are considerations like Unicode domain names and all sorts of technically-valid-but-you'll-likely-not-encounter-in-the-wild things that'll trip up a simple regex like this, but this'll probably get you 90%+ of the way there.
If you have limited subset of data, I suggest to keep the regex simple, e.g.
(([a-z\-]+)(?:\.com|\.fr|\.co.uk))
This will match:
www.google.com --> google.com
www.google.co.uk --> google.co.uk
www.foo-bar.com --> foo-bar.com
In my case, I know that all relevant URLs will be matched using this regex.
Collect a sample dataset and test it against your regex. While prototyping, you can do that using a tool such https://regex101.com/r/aG9uT0/1. In development, automate it using a test script.
([A-Za-z0-9-]+\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))(?!\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))\b
This is an improvement upon theracoonbear's answer.
I did a quick bit of testing and noticed that if you give it a domain where the subdomain has a subdomain, it will fail. I also wanted to point out that the "90%" was definitely not generous. It will be a lot closer to 100% than you think. It works on all subdomains of the top 50 most visited websites which accounts for a huge chunk of worldwide internet activity. The only time it would fail is potentially with unicode domains, etc.
My solution starts off working the same way that theracoonbear's does. Instead of checking for a word boundary, it uses a negative lookahead to check if there is not something that could be a TLD at the end (just copied the TLD checking part over into a negative lookahead).
Without testing the validity of top level domain, I'm using an adaptation of stormsweeper's solution:
domain = 'sub.domains.are.cool.google.com'
s = domain.split('.')
tld = s.slice(-2..-1).join('.')
EDIT: Be careful of issues with three part TLDs like domain.co.uk.

Categories

Resources