Need help with modifying an existing regex search extension - javascript

I would like to ramp on extension development by modifying an existing extension.
I have zero experience with JavaScript, but i do have experience with C, C++, Java and Python.
I chose the Regular Expression Search extension by bizsimon.
Here is the JavaScript code of the content script which i am trying to understand.
chrome.extension.onRequest.addListener(function(request, sender, sendResponse) { sendResponse(chrome_regex_search(request.exp)); });
function chrome_regex_search(exp) {
var tw=document.createTreeWalker(document.getElementsByTagName("body")[0], NodeFilter.SHOW_TEXT, null, false);
while (node = tw.nextNode()) {
node.parentNode.innerHTML=node.parentNode.innerHTML.replace(/<font class="chrome_search_highlight"[^>]*>(.+)<\/font>/igm, '$1');
}
try {
var pattern=eval("/(>[^<]*)("+exp+")([^<]*<)/igm");
var tw=document.createTreeWalker(document.getElementsByTagName("body")[0], NodeFilter.SHOW_TEXT, null, false);
while(node=tw.nextNode()) {
node.parentNode.innerHTML=node.parentNode.innerHTML.replace(pattern, '$1<font class="chrome_search_highlight" style="background: yellow">$2</font>$3');
}
return {"count": document.getElementsByClassName("chrome_search_highlight").length};
} catch(e) {
return {"count": 0};
}
}
And here are my questions:
What does this code do?
node.parentNode.innerHTML=node.parentNode.innerHTML.replace(/]*>(.+)</font>/igm, '$1');
I would like to add navigation buttons which enable the user to move from one search result to another. What changes are required in the script? I assume that now i will need to save some state which remembers which search result is currently being visited. How do i make the browser jump from one search result to another?
Any useful comments which would help understand the code or even a code walkthru would be very much appreciated.

Question #1: as Jason S said, it's stripping the <font> tag: specifically those that are of the "chrome_search_highlight" class. In other words, it's walking the node tree of the body element and removing previous search hit highlights.
Then in the second tree-walking loop, it's adding those same font tags around occurrences of the supplied regexp. The cryptic (>[^<]*) group before, and similar group after, the regexp is there to help ensure that you're matching actual page text, not the name of an HTML element or an attribute name or value. I.e. the regexp search hit must be preceded by a > that is not followed by a < until after the search hit.
Off to bed...

for your question #1: That code looks like it's trying to strip a <font> tag from HTML, e.g. change <font ...>real content here</font> to real content here.
just a side comment: prefer using new Regexp(somestring) to eval("/"+somestring+"/"), as the eval can lead to a possible security hole. (see MDC docs for syntax particulars)

Related

Using grease/tampermonkey (or another extension) to omit certain characters inside web elements

Firstly, I apologize if my terminology here isn't the most accurate; I'm very much a novice when it comes to programming. A forum I frequent has added a bunch of unneccessary, "glitchy" images and text to the page as a part of some promotion, but the result is that the forum is now difficult to use and read. I was able to script out most of it using adblock, but there's one last bit that shows up inside the forum elements themselves, and adblock wants to remove the whole element (which breaks the forum). This is part of the code in question, with the URLs changed:
<td class="windowbg" valign="middle" width="42%">&blk34;&blk34;&blk34;&blk34;&blk34;
Thread title <span class="smalltext"></span><img src="example.com/forumicon.gif"></td>
As you can see, the ▓ character shows up a bunch of times for no reason. Is there a way to make my browser ignore this character when it's inside of an element? If there's a way to do this using AdBlock, I am not smart enough to see it.
Here's one way to do it, using a NodeIterator:
var iter = document.createNodeIterator( document.body, NodeFilter.SHOW_TEXT );
var node;
while (node = iter.nextNode()) {
node.textContent = node.textContent.replace( /[\u2580-\u259f]+/g, '' );
}
This is just plain JavaScript code; you can paste it into the Firefox / Chrome JS console to test it. The regexp /[\u2580-\u259f]+/ matches any sequence of characters in the "Block Elements" Unicode block, including U+2593 Dark Shade (▓). You may want to tweak the regexp to match the characters you want to remove. (Tip: If you don't know what the codes for those characters are, copy and paste them into the "UTF8 String" box on this page.)
Ps. If these characters that you want to remove occur only in a certain part of the document, you can make this code a bit more efficient by replacing the root node (document.body above) with the specific DOM node that you want to remove the characters from. To find the nodes you want, you can use e.g. document.getElementById() or, more generally, document.querySelector() (or even document.querySelectorAll() and loop over the results).

How can spaces be converted to &nbsp without breaking HTML tags?

I've inherited some pretty complex code for a web forum, and one of the features I'm trying to implement is the ability for spaces to not be truncated into only one. This is mainly because our users often want to include ASCII art, tables etc in their posts.
I first did this using a simple search and replace in javascript, which had the side effect of breaking HTML tags (eg <a href=....> became <a href=.....>).
I then tried doing this on server side, when the strings are retrieved, by having spaces converted before links and code people insert is converted to HTML. This works to a degree but it causes some issues with other parts of the code, for example where a message is truncated to appear on the home page, it might leave some of the space code, such as
Here is a message&nb
I think there may be a way to just alter the original javascript to achieve this - it just needs to only match spaces that are not inside a HTML tag.
The script I was using originally was message = message.replace(/\s/g, " ").
Thanks for any help you can provide with this.
You can use the pre element to include preformatted text, which renders spaces as-is. See http://www.w3.org/TR/html5-author/the-pre-element.html
Those docs specifically say one of the best uses of the pre element is "Displaying ASCII art".
Example: http://jsbin.com/owuruz/edit#preview
<pre>
/\_/\
____/ o o \
/~____ =ø= /
(______)__m_m)
</pre>
In your case, just put your message inside a pre tag.
Yes, but you need to process text content of elements, not all of the HTML document content. Moreover, you need to exclude style and script element content. As you can limit yourself to things inside the body element, you could use a recursive function like following, calling it with process(document.body) to apply it to the entire document (but you probably want to apply it to a specific element only):
function process(element) {
var children = element.childNodes;
for(var i = 0; i < children.length; i++) {
var child = children[i];
if(child.nodeType === 3) {
if(child.data) {
child.data = child.data.replace(/[ ]/g, "\xa0");
}
} else if(child.tagName != "SCRIPT") {
process(child);
}
}
}
(No reason to use the entity reference here; you can use the no-break space character U+00A0 itself, referring to it as "\xa0" in JavaScript.)
One way is to use <pre> tags to wrap your users posts so that their ASCII art is preserved. But why not use Markdown (like Stackoverflow does). There's a couple of different ports of Markdown to Javascript:
Showdown
WMD
uedit

Interactive string manipulation via javascript

I have a webapp that must allow users to interactively manipulate strings (words, phrases and so on...)
Example:
given a foobar string, if the user clicks on b the string is split in two and a whitespace is added, resulting in foo bar.
I could put each single character inside a span element, but I fear this would be troublesome for long strings.
Any advice?
This version using jQuery (not necessary) should pretty much do what you need if I understood you correctly:
// Given a textarea with the content
var text = $('textarea').text().split('');
$('textarea').click(function(){
text.splice(this.selectionStart, 0, " ");
this.value = text.join('');
});
It's a very simple and not cross browser enabled example, but it should get you started.
Yes, it will be ok, but setup your event handler not on individual spans, but on the whole container and then see here: http://en.wikipedia.org/wiki/Flyweight_pattern

Looking for a way to search an html page with javascript

what I would like to do is to the html page for a specific string and read in a certain amount of characters after it and present those characters in an anchor tag.
the problem I'm having is figuring out how to search the page for a string everything I've found relates to by tag or id. Also hoping to make it a greasemonkey script for my personal use.
function createlinks(srchstart,srchend){
var page = document.getElementsByTagName('html')[0].innerHTML;
page = page.substring(srchstart,srchend);
if (page.search("file','http:") != -1)
{
var begin = page.search("file','http:") + 7;
var end = begin + 79;
var link = page.substring(begin,end);
document.body.innerHTML += 'LINK | ';
createlinks(end+1,page.length);
}
};
what I came up with unfortunately after finding the links it loops over the document again
Assisted Direction
Lookup JavaScript Regex.
Apply your regex to the page's HTML (see below).
Different regex functions do different things. You could search the document for the string, as suggested, but you'd have to do it recursively, since the string you're searching for may be listed in multiple places.
To Get the Text in the Page
JavaScript: document.getElementsByTagName('html')[0].innerHTML
jQuery: $('html').html()
Note:
IE may require the element to be capitalized (eg 'HTML') - I forget
Also, the document may have newline characters \n that might want to take out, since one could be between the string you're looking for.
Okay, so in javascript you've got the whole document in the DOM tree. You an search for your string by recursively searching the DOM for the string you want. This is striaghtforward; I'll put in pseudocode because you want to think about what libraries (if any) you're using.
function search(node, string):
if node.innerHTML contains string
-- then you found it
else
for each child node child of node
search(child,string)
rof
fi

How to filter using Regex and javascript?

I have some text in an element in my page, and i want to scrap the price on that page without any text beside.
I found the page contain price like that:
<span class="discount">now $39.99</span>
How to filter this and just get "$39.99" just using JavaScript and regular expressions.
The question may be too easy or asked by another way before but i know nothing about regular expressions so asked for your help :).
<script language="javascript">
window.onload = function () {
// Get all of the elements with class name "discount"
var elements = document.getElementsByClassName('discount');
// Loop over each <span class="discount">
for (var i=0; i < elements.length; i++) {
// get the text, e.g. "now $39.99"
var rawText = elements[i].innerHTML;
// Here's a regular expression to match one or more digits (\d+)
// followed by a period (\.) and one or more digits again (\d+)
var priceAsString = rawText.match(/\d+\.\d+/)
// You'll want to make the price a floating point number if you
// intend to do any calculations with it.
var price = parseFloat(priceAsString);
// Now what do you want to do with the price? I'll just write it out
// to the console (using FireBug or something similar)
console.log(price);
}
}
</script>
document.evaluate("//span[#class='discount']",
document,
null,
XPathResult.ANY_UNORDERED_NODE_TYPE,
null).singleNodeValue.textContent.replace("now $", "");
EDIT: This is standard XPath. I'm not sure what kind of explanation you're seeking. For outdated browsers, you will need a third-party library like Sarissa and/or Java-line.
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
Patrick McElhaney's and Matthew Flaschen's answers are both good ways to solve the problem.
as Matthew Flaschen suggested, XPATH is a better way to go, if you know something about the node structure of the target document (and since you provided an example, you seem to). If you don't know the node structure, regexes are still lousy for parsing XML.
some more resources to kick-start you:
XPath in Javascript: Introduction
DOM Parsing With XPath and JavaScript
Mozilla dev-center: Introduction to using XPath in JavaScript
I've also found the FireFox extension combo of DOM Inspector and XPather to be an invaluable tool for deriving and testing XPath expressions on a given page. (If you're using another browser -- well, I don't know).

Categories

Resources