I have a regex that matches iframe urls, and captures various components. The regex is given below
/(<iframe.*?src=['|"])((?:https?:\/\/|\/\/)[^\/]*)(?:.*?)(['|"][^>]*some-token:)([a-zA-Z0-9]+)(.*?>)/igm
To be clear my actual requirement is to transforms in a html string, such strings
<iframe src="http://somehost.com/somepath1/path2" class="some-token:abc123">
to
<iframe src="http://somehost.com/newpath?token=abc123" class="some-token:abc123">
The regex works as it is supposed to be, but for normal length html, it takes around 2 seconds to execute, which i think is very, high.
I would really appreciate if someone could point me how to optimise this regex, i am sure i am doing something terribly wrong, because before i used this regex
/(<iframe.*?src=['|"])(?:.*?)(['|"][^>]*some-token:)([a-zA-Z0-9]+)(.*?>)/igm
to completely replace the source url and just add the paramter, it was taking just 100 ms
You do not need to (and should not) parse the iframe element as a string; you just need to access its attributes, and retrieve information from them and rewrite them.
function fix_iframe_src(iframe) {
var src = iframe.getAttribute('src');
var klass = iframe.getAttribute('class');
var token = get_token(klass);
src = fix_src(src, token);
iframe.setAttribute('src', src);
}
Writing get_token and fix_src are left as an exercise.
If you want to find a bunch of iframes and fix them all up, then
var iframes = document.querySelectorAll('iframe');
for (var i = 0; i < iframes.length; i++) {
fix_iframe_src(iframes[i]);
}
By the way, the value of your class attribute seems to be broken. I doubt if it will match any CSS rules, if that's the intent. Are you using it for something other than to provide the token? In that case, you would be best off using a data attribute such as data-token.
Minor point about regexp flags: the g and m flags are going to do nothing for you. m is about matching anchors like ^ and $ to the beginning and end of lines within the source string, which is not an issue for you. g is about matching multiple times, which is also not an issue.
The reason your regexp is taking so long is most likely that you are throwing the entire DOM at it. Hard to tell unless you show us the code from which you are calling it.
Related
Background
I am getting an HTML string which I am getting from a ajax call to a web api which I created. I have seen examples which load the DOM however, I am working with a string, and I think there can be a regex, jquery, substring approach to my problem.
Question
I will need the ability to go through an HTML object which will contain IMG tags that I will need to change to predefined URL, which will be triggered through a button action. Currently I am trying a Jquery approach to this problem however I am having in issue changing the URL. I am getting the html as a string and I am tokenizing it with ~~/picture~~ so in the example string I will provide I will have an array of 2 which contain two different img tags. Would a regex or substring approach be more efficient since I am tokenizing this HTML string.
Sample Code
var newUrl = "http://maps.google.com/mapfiles/kml/shapes/bus.png";
var imageTag = contentObject.Content.split("~~/picture~~");
for (var i = 0; i < imageTag.length; i++) {
var checkValue = imageTag[i]; // checking the array value for accuracy
var testChangeSrc = $(imageTag[i]).attr("src", newUrl);
}
Issue with the Code
This code is not changing the html as I expected. From what I see it is loading other none img objects.
Sample String
<p>~~/Document Text~~When it comes to accounting and advisory services, is the unique firm of choice. As a trusted advisor to our clients, we bring an integrated client service approach with dedicated industry experience. respects the value of every client relationship and provides clients throughout the U.S. with an unwavering commitment to hands-on, personal attention from our partners and senior-level professionals.</p><p>~~/Document Text~~in search of a trusted advisor to deal with their state and local tax needs. Through our leading best practices and experience, our SALT professionals offer quality and ease to the client engagement. We are proud to provide highly comprehensive services.</p><p>~~/picture~~<br></p><p>
<img src="/sites/ContentCenter/Graphics/map-al.jpg" alt="map al" style="width:611px;height:262px;" /> ~~end~~<br></p><p><br></p><p>~~/picture~~<br></p><p>
<img src="/sites/ContentCenter/Graphics/Firm_Telescope_Illustration.jpg" alt="Firm_Telescope_Illustration.jpg" style="margin:5px;width:155px;height:155px;" /> </p><p>~~end~~<br></p><p>~~/Document Heading 1~~Image Test 3<br></p>
<img alt="base64 test" />
</div><div class="ExternalClassAF0833CB235F437993D7BEE362A1A88A"><br></div><div class="ExternalClassAF0833CB235F437993D7BEE362A1A88A"><br></div><div class="ExternalClassAF0833CB235F437993D7BEE362A1A88A"><br></div>
IntendedGoal
My intended Goal is to change the URL in the SRC in every img tag. The example string will be the standard format which comes from my controller. since I am tokenizing this string would a substring approach work ? or regex?
I found this to be the easiest way to do it, and also more efficient than jQuery. basically you are injecting the string into a div and using JavaScript to parse through the Html string which is now in a div. By placing the string in the div it give more functionality and flexibility if you want to add more logic/filtering to your solution. I placed a very simple block of code anyone can follow and change to accommodate their solution. Good luck
function parseHtmlString(yourHtmlString) {
var element = document.createElement('div');
element.innerHTML = yourHtmlString;
var imgSrcUrls = element.getElementsByTagName("img");
for (var i = 0; i < imgSrcUrls.length; i++) {
var urlValue = imgSrcUrls[i].getAttribute("src");
if (urlValue) {
imgSrcUrls[i].setAttribute("src", "You Desired Change");
}
}
}
I'm trying to make a bit of a crude ad-blocker with javascript
The code I currently have:
var pattern = '<iframe(.*?)</iframe>|<object(.*?)</object>';
if (document.body.parentNode.innerHTML.match(pattern))
{
document.body.parentNode.innerHTML =
document.body.parentNode.innerHTML.replace(pattern, '<b>AD BLOCKED</b>');
}
The problem is that the page reloads. Is there a way I can stop the page from reloading? (My main target is adsense)
This does not seem right, since you're just wanting to replace the html on the page. I can't imagine what that will do. To answer your Regex question, though, try this.
var pattern = /<iframe.*<\/iframe>/gi;
document.body.innerHTML =
document.body.innerHTML.replace(pattern, '<strong>bye iframe</strong>');
replace() will swap out all the matches found by the RegExp with the second parameter.
/<iframe.*<\/iframe>/ is a regular expression matching anything within iframe tags.
gi modifies the regex telling it to be global and case-insensitive.
Again, you will probably have some unexpected behavior rewriting the innerHTML of the body, so I'd rethink your approach. Perhaps you could use jQuery to find the tags you don't want and hide or remove them. (example here)
what I would like to do is to the html page for a specific string and read in a certain amount of characters after it and present those characters in an anchor tag.
the problem I'm having is figuring out how to search the page for a string everything I've found relates to by tag or id. Also hoping to make it a greasemonkey script for my personal use.
function createlinks(srchstart,srchend){
var page = document.getElementsByTagName('html')[0].innerHTML;
page = page.substring(srchstart,srchend);
if (page.search("file','http:") != -1)
{
var begin = page.search("file','http:") + 7;
var end = begin + 79;
var link = page.substring(begin,end);
document.body.innerHTML += 'LINK | ';
createlinks(end+1,page.length);
}
};
what I came up with unfortunately after finding the links it loops over the document again
Assisted Direction
Lookup JavaScript Regex.
Apply your regex to the page's HTML (see below).
Different regex functions do different things. You could search the document for the string, as suggested, but you'd have to do it recursively, since the string you're searching for may be listed in multiple places.
To Get the Text in the Page
JavaScript: document.getElementsByTagName('html')[0].innerHTML
jQuery: $('html').html()
Note:
IE may require the element to be capitalized (eg 'HTML') - I forget
Also, the document may have newline characters \n that might want to take out, since one could be between the string you're looking for.
Okay, so in javascript you've got the whole document in the DOM tree. You an search for your string by recursively searching the DOM for the string you want. This is striaghtforward; I'll put in pseudocode because you want to think about what libraries (if any) you're using.
function search(node, string):
if node.innerHTML contains string
-- then you found it
else
for each child node child of node
search(child,string)
rof
fi
I have some text in an element in my page, and i want to scrap the price on that page without any text beside.
I found the page contain price like that:
<span class="discount">now $39.99</span>
How to filter this and just get "$39.99" just using JavaScript and regular expressions.
The question may be too easy or asked by another way before but i know nothing about regular expressions so asked for your help :).
<script language="javascript">
window.onload = function () {
// Get all of the elements with class name "discount"
var elements = document.getElementsByClassName('discount');
// Loop over each <span class="discount">
for (var i=0; i < elements.length; i++) {
// get the text, e.g. "now $39.99"
var rawText = elements[i].innerHTML;
// Here's a regular expression to match one or more digits (\d+)
// followed by a period (\.) and one or more digits again (\d+)
var priceAsString = rawText.match(/\d+\.\d+/)
// You'll want to make the price a floating point number if you
// intend to do any calculations with it.
var price = parseFloat(priceAsString);
// Now what do you want to do with the price? I'll just write it out
// to the console (using FireBug or something similar)
console.log(price);
}
}
</script>
document.evaluate("//span[#class='discount']",
document,
null,
XPathResult.ANY_UNORDERED_NODE_TYPE,
null).singleNodeValue.textContent.replace("now $", "");
EDIT: This is standard XPath. I'm not sure what kind of explanation you're seeking. For outdated browsers, you will need a third-party library like Sarissa and/or Java-line.
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
Patrick McElhaney's and Matthew Flaschen's answers are both good ways to solve the problem.
as Matthew Flaschen suggested, XPATH is a better way to go, if you know something about the node structure of the target document (and since you provided an example, you seem to). If you don't know the node structure, regexes are still lousy for parsing XML.
some more resources to kick-start you:
XPath in Javascript: Introduction
DOM Parsing With XPath and JavaScript
Mozilla dev-center: Introduction to using XPath in JavaScript
I've also found the FireFox extension combo of DOM Inspector and XPather to be an invaluable tool for deriving and testing XPath expressions on a given page. (If you're using another browser -- well, I don't know).
We have a project that generates a code snippet that can be used on various other projects. The purpose of the code is to read two parameters from the query string and assign them to the "src" attribute of an iframe.
For example, the page at the URL http://oursite/Page.aspx?a=1&b=2 would have JavaScript in it to read the "a" and "b" parameters. The JavaScript would then set the "src" attribute of an iframe based on those parameters. For example, "<iframe src="http://someothersite/Page.aspx?a=1&b=2" />"
We're currently doing this with server-side code that uses Microsoft's Anti Cross-Scripting library to check the parameters. However, a new requirement has come stating that we need to use JavaScript, and that it can't use any third-party JavaScript tools (such as jQuery or Prototype).
One way I know of is to replace any instances of "<", single quote, and double quote from the parameters before using them, but that doesn't seem secure enough to me.
One of the parameters is always a "P" followed by 9 integers.
The other parameter is always 15 alpha-numeric characters.
(Thanks Liam for suggesting I make that clear).
Does anybody have any suggestions for us?
Thank you very much for your time.
Upadte Sep 2022: Most JS runtimes now have a URL type which exposes query parameters via the searchParams property.
You need to supply a base URL even if you just want to get URL parameters from a relative URL, but it's better than rolling your own.
let searchParams/*: URLSearchParams*/ = new URL(
myUrl,
// Supply a base URL whose scheme allows
// query parameters in case `myUrl` is scheme or
// path relative.
'http://example.com/'
).searchParams;
console.log(searchParams.get('paramName')); // One value
console.log(searchParams.getAll('paramName'));
The difference between .get and .getAll is that the second returns an array which can be important if the same parameter name is mentioned multiple time as in /path?foo=bar&foo=baz.
Don't use escape and unescape, use decodeURIComponent.
E.g.
function queryParameters(query) {
var keyValuePairs = query.split(/[&?]/g);
var params = {};
for (var i = 0, n = keyValuePairs.length; i < n; ++i) {
var m = keyValuePairs[i].match(/^([^=]+)(?:=([\s\S]*))?/);
if (m) {
var key = decodeURIComponent(m[1]);
(params[key] || (params[key] = [])).push(decodeURIComponent(m[2]));
}
}
return params;
}
and pass in document.location.search.
As far as turning < into <, that is not sufficient to make sure that the content can be safely injected into HTML without allowing script to run. Make sure you escape the following <, >, &, and ".
It will not guarantee that the parameters were not spoofed. If you need to verify that one of your servers generated the URL, do a search on URL signing.
Using a whitelist-approach would be better I guess.
Avoid only stripping out "bad" things. Strip out anything except for what you think is "safe".
Also I'd strongly encourage to do a HTMLEncode the Parameters. There should be plenty of Javascript functions that can this.
you can use javascript's escape() and unescape() functions.
Several things you should be doing:
Strictly whitelist your accepted values, according to type, format, range, etc
Explicitly blacklist certain characters (even though this is usually bypassable), IF your whitelist cannot be extremely tight.
Encode the values before output, if youre using Anti-XSS you already know that a simple HtmlEncode is not enough
Set the src property through the DOM - and not by generating HTML fragment
Use the dynamic value only as a querystring parameter, and not for arbitrary sites; i.e. hardcode the name of the server, target page, etc.
Is your site over SSL? If so, using a frame may cause inconsistencies with SSL UI...
Using named frames in general, can allow Frame Spoofing; if on a secure site, this may be a relevant attack vector (for use with phishing etc.)
You can use regular expressions to validate that you have a P followed by 9 integers and that you have 15 alphanumeric values. I think that book that I have at my desk of RegEx has some examples in JavaScript to help you.
Limiting the charset to only ASCII values will help, and follow all the advice above (whitelist, set src through DOM, etc.)