Why does this regex execute slowly? - javascript

So i've got a regex that identifies URLs:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
But when I use it to identify urls that a user has entered, simply using .test slows the page down considerably, even though according to the MDN, it's supposed to be faster than exec. Am I using an outdated method of testing Regular Expressions? Is there a faster method that i don't know about? or is my regex just really long and complicated?
Here's a JSFiddle.
Edit:
Takes 20.7 seconds in Chrome, v24
1:48.5 in Internet Explorer 9

So it seems that the regex only lags when it processes a url that has posted information, for example in the jsfiddle url Product.aspx?Item=N82E16811139009. When that part of the url is removed, the regex preforms correctly, and quickly.
However, removing the last star from ([\/\w \.-]*)* makes the regex preform incorrectly, so using ([\/\w \.-]*) is not an option.
Rather, for the regex to be able to handle urls with posted information, the last part needs to be removed:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
to
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*/
this is because the regex is designed to detect filetypes or a backslash at the end of the url, not a question mark and posted information. removing the last part fixes the problem and the regex runs correctly and quickly.

Related

Is there a length limitation when using replace method of a string?

I have a big string (1116902 char length) that I want to process with a regex (pretty simple one). I get a response from a soap server that is encoded in base64. So I just get the result between the appropriate xml tags and then decode the response.
This working for a small request. But when I get a big response back, the callback function of the replace() method is never called. I have tried to test the string on the regex101 website and it can find the result. So I wonder if there is a limitation in my JavaScript engine. I'm working on a Wakanda Server V10 that use Webkit as JavaScript engine. I cannot provide the string because it contains some enterprise information.
Here is my regex : /xsd:base64Binary">((.|\n)*?)<\/responseData>/
I taught it is maybe a special character that is not included in the ((.|\n)*?) group. But then why the regex101 find out the result (then maybe is the JavaScript engine)
Maybe anybody can help me?
Thanks
If you can guarantee that there are no tags between your start and end delimiter, which sounds like it might be the case, you could just change your RE to
/xsd:base64Binary">([^<]*)<\/responseData>/
which shouldn't require any backtracking and might work for you.
[^<] simply means everything but the < character. Since there shouldn't be any tags between the open and closing tags of your section (at least that's what I understand) that will accept everything until you hit your closing tag. The important thing is that the RE engine can tell immediately whether something matches that or not, so no branching or backtracking is required.

Dynamically created JavaScript function not working with long parameter

I have several html A tag generated programmatically in ASP.NET with a JavaScript function taking long parameter in href. One of those has over 20K characters when it get assigned in backend, but I am seeing the actual link has only 5239 characters on the browser side and the JavaScript function does not have closing. So the link never works. I am thinking about workarounds for this implementation since it's not a good idea to put this much amount of data in links, but now I'm just curious about cause of the issue.
Examples of the code assigning values to the link:
HtmlAnchor.HRef = "javascript:doSomething('Import','" + strHeader_LineIds + "');"
In this case the variable strHeader_LineIds carries a string over 20k characters.
Example of what I'm actually seeing in client side:
<a id=anchor1 class=class1 href="javascript:doSomething('Import', 'blahblahblahblah....">Link Text</a>
Please note the javascript function has no closing here. But when I'm debugging in backend I do see the closing of the function.
I guess this issue may have something to do with the browser's URL limit? I am using IE and I learned IE has a maximum URL length limit as 2,083 characters from Here. But how can the link show up with 5,239 characters?
I've had a similar issue with javascript like dynamic functions created in code and then called. I found that I had to play with swapping out single quotes in the javascript function with double quotes or escaping the quotes.
Then again just reading your post could be a limit issue.
Have you tried assigning the long to an element in the background and then referencing that as part of the javacript. I know IE gets funny with spaces in passed in parameters.
I think found an answer to the issue though. According to This Article:
JavaScript URIs
The JavaScript protocol is used for bookmarklets (aka favlets), a lightweight form of extensibility that permits a user to click a button and run some stored JavaScript on the currently loaded page. In IE9, the team did some work to relax the length limit (from ~260 characters, if I recall correctly) to something significantly larger (~5kb, if I recall correctly).
So I just hit the ~5kb limit.

string.substring(1) give different results on different sites with same html

I'm using substring(1) to get the last character off of an href attribute.
I'm talking javascript. On my development site, everything works great and the href http://www.mysite.come/page/#B returns B and all is well.
On the live version of the site, the same substring(1) on the same href returns ttp://www.mysite.com/page/#B - it takes off the first character and returns the rest.
I have no idea why it does this. I've used this same script on multiple sites with no issues. The only thing possibly related is that the dev site is running a newer version of jquery (1.8.1) and the live is on 1.7.1.
I used substr(-1,1) instead and everything works fine, but I'd like to know if someone can tell me why I got 2 different results from the same input.
Thanks!
Extra Info
I've been using the OrganicTabs script from CSS Tricks for a while now (http://css-tricks.com/organic-tabs/) and never had a problem with it until recently. I tracked it down to the substring line:
var curList = base.$el.find("a.current").attr("href").substring(1)
An example of one of the links I'm this script is targeting:
I have modified the script to use substr instead, but I don't know why I would get inconsistent results with substring.
The href you are reading locally, is storing only the hash part of the url,
while for some reason the Live website href's have the whole url
Link vs Link
The substring function is working as expected, since substring(1) should return everything but the first character.
Check how you generate your urls
EDIT: The best solution is to get the substring starting from the hash (excluding it), which covers both cases, as well as cases where the hash is longer than 1 character
href.substring(href.indexOf("#")+1)
String.prototype.substring(n [,...]) returns the first n characters from the string if n is positive, not the last. You need substr(-1) for this.
See:
MDN: substr vs. substring

Detect FQDN and URL REGEX MATCH (Javascript)

This is not related to a previous question I posted. I need a regex to detect FDQN such as google.ca/ and www.google.ca/ (must detect the forward slash) as well as urls such as http://www.google.ca and https://www.stackoverflow.com. Can someone help me with this. I am using match (in javascript) to detect these FDQN and URL. Sorry if this seems to be a repeat to my previous question but it isn't (more specific).
I am using this to match Twitter's character count. When they detect a URL or FDQN, they will compress the URL (if its https) to 21 characters and others to 20 characters (no matter how long it is).
Is "google.ca/" FQDN? I guess it is, if even this resolves http://uz/
The question really is what exactly are you searching for? :)
Check if this one works for you: http://regexlib.com/redetails.aspx?regexp_id=1735&AspxAutoDetectCookieSupport=1
If not, regexplib.com is a good source, but I would suggest defining your requirements more precisely/explicitly.
You could just detect anything with a . and no space, but its likely to cause false positives.
Eg.
var s = "This is not related to a previous question I posted. I need a regex to detect FDQN such as google.ca/ and www.google.ca/ (must detect the forward slash) as well as urls such as http://www.google.ca and https://www.stackoverflow.com. Can someone help me with this"
console.log(s.match(/(https?\:\/\/)?([a-z0-9\-]+\.)+[a-z0-9\-]{2,8}\/?/ig))
Output
[
"google.ca/",
"www.google.ca/",
"http://www.google.ca",
"https://www.stackoverflow.com"
]

Javascript/Regex for finding just the root domain name without sub domains

I had a search and found lot's of similar regex examples, but not quite what I need.
I want to be able to pass in the following urls and return the results:
www.google.com returns google.com
sub.domains.are.cool.google.com returns google.com
doesntmatterhowlongasubdomainis.idont.wantit.google.com
returns google.com
sub.domain.google.com/no/thanks returns google.com
Hope that makes sense :)
Thanks in advance!-James
You can't do this with a regular expression because you don't know how many blocks are in the suffix.
For example google.com has a suffix of com. To get from subdomain.google.com to google.com you'd have to take the last two blocks - one for the suffix and one for google.
If you apply this logic to subdomain.google.co.uk though you would end up with co.uk.
You will actually need to look up the suffix from a list like http://publicsuffix.org/
Don't use regex, use the .split() method and work from there.
var s = domain.split('.');
If your use case is fairly narrow you could then check the TLDs as needed, and then return the last 2 or 3 segments as appropriate:
return s.slice(-2).join('.');
It'll make your eyes bleed less than any regex solution.
I've not done a lot of testing on this, but if I understand what you're asking for, this should be a decent starting point...
([A-Za-z0-9-]+\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))\b
EDIT:
To clarify, it's looking for:
one or more alpha-numeric characters or dashes, followed by a literal dot
and then one of three things...
three or more alpha characters (i.e. com/net/mil/coop, etc.)
two alpha characters, followed by a literal dot, followed by two more alphas (i.e. co.uk)
two alpha characters (i.e. us/uk/to, etc)
and at the end of that, a word boundary (\b) meaning the end of the string, a space, or a non-word character (in regex word characters are typically alpha-numerics, and underscore).
As I say, I didn't do much testing, but it seemed a reasonable jumping off point. You'd likely need to try it and tune it some, and even then, it's unlikely that you'll get 100% for all test cases. There are considerations like Unicode domain names and all sorts of technically-valid-but-you'll-likely-not-encounter-in-the-wild things that'll trip up a simple regex like this, but this'll probably get you 90%+ of the way there.
If you have limited subset of data, I suggest to keep the regex simple, e.g.
(([a-z\-]+)(?:\.com|\.fr|\.co.uk))
This will match:
www.google.com --> google.com
www.google.co.uk --> google.co.uk
www.foo-bar.com --> foo-bar.com
In my case, I know that all relevant URLs will be matched using this regex.
Collect a sample dataset and test it against your regex. While prototyping, you can do that using a tool such https://regex101.com/r/aG9uT0/1. In development, automate it using a test script.
([A-Za-z0-9-]+\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))(?!\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))\b
This is an improvement upon theracoonbear's answer.
I did a quick bit of testing and noticed that if you give it a domain where the subdomain has a subdomain, it will fail. I also wanted to point out that the "90%" was definitely not generous. It will be a lot closer to 100% than you think. It works on all subdomains of the top 50 most visited websites which accounts for a huge chunk of worldwide internet activity. The only time it would fail is potentially with unicode domains, etc.
My solution starts off working the same way that theracoonbear's does. Instead of checking for a word boundary, it uses a negative lookahead to check if there is not something that could be a TLD at the end (just copied the TLD checking part over into a negative lookahead).
Without testing the validity of top level domain, I'm using an adaptation of stormsweeper's solution:
domain = 'sub.domains.are.cool.google.com'
s = domain.split('.')
tld = s.slice(-2..-1).join('.')
EDIT: Be careful of issues with three part TLDs like domain.co.uk.

Categories

Resources