Looking for a way to search an html page with javascript - javascript

what I would like to do is to the html page for a specific string and read in a certain amount of characters after it and present those characters in an anchor tag.
the problem I'm having is figuring out how to search the page for a string everything I've found relates to by tag or id. Also hoping to make it a greasemonkey script for my personal use.
function createlinks(srchstart,srchend){
var page = document.getElementsByTagName('html')[0].innerHTML;
page = page.substring(srchstart,srchend);
if (page.search("file','http:") != -1)
{
var begin = page.search("file','http:") + 7;
var end = begin + 79;
var link = page.substring(begin,end);
document.body.innerHTML += 'LINK | ';
createlinks(end+1,page.length);
}
};
what I came up with unfortunately after finding the links it loops over the document again

Assisted Direction
Lookup JavaScript Regex.
Apply your regex to the page's HTML (see below).
Different regex functions do different things. You could search the document for the string, as suggested, but you'd have to do it recursively, since the string you're searching for may be listed in multiple places.
To Get the Text in the Page
JavaScript: document.getElementsByTagName('html')[0].innerHTML
jQuery: $('html').html()
Note:
IE may require the element to be capitalized (eg 'HTML') - I forget
Also, the document may have newline characters \n that might want to take out, since one could be between the string you're looking for.

Okay, so in javascript you've got the whole document in the DOM tree. You an search for your string by recursively searching the DOM for the string you want. This is striaghtforward; I'll put in pseudocode because you want to think about what libraries (if any) you're using.
function search(node, string):
if node.innerHTML contains string
-- then you found it
else
for each child node child of node
search(child,string)
rof
fi

Related

cheerio find a text in a script tag

I want to extract js script in script tag.
this the script tag :
<script>
$(document).ready(function(){
$("#div1").click(function(){
$("#divcontent").load("ajax.content.php?p=0&cat=1");
});
$("#div2").click(function(){
$("#divcontent").load("ajax.content.php?p=1&cat=1");
});
});
</script>
I have an array of ids like ['div1', 'div2'], and I need to extract url link inside it :
so if i call a function :
getUrlOf('div1');
it will return ajax.content.php?p=0&cat=1
If you're using a newer version of cheerio (1.0.0-rc.2), you'll need to use .html() instead of .text()
const cheerio = require('cheerio');
const $ = cheerio.load('<script>script one</script> <script> script two</script>');
// For the first script tag
console.log($('script').html());
// For all script tags
console.log($('script').map((idx, el) => $(el).html()).toArray());
https://github.com/cheeriojs/cheerio/issues/1050
With Cheerio, it is very easy to get the text of the script tag:
const cheerio = require('cheerio');
const $ = cheerio.load("the HTML the webpage you are scraping");
// If there's only one <script>
console.log($('script').text());
// If there's multiple scripts
$('script').each((idx, elem) => console.log(elem.text()));
From here, you're really just asking "how do I parse a generic block of javascript and extract a list of links". I agree with Patrick above in the comments, you probably shouldn't. Can you craft a regex that will let you find each link in the script and deduce the page it links to? Yes. But very likely, if anything about this page changes, your script will immediately break - the author of the page might switch to inline <a> tags, refactor the code, use live events, etc.
Just be aware that relying on the exact contents of this script tag will make your application very brittle -- even more brittle than page scraping generally is.
EDIT: Sure, here's an example of a loose but effective regex:
let html = "incoming html";
let regex = /\$\("(#.+?)"\)\.click(?:.|\n)+?\.load\("(.+?)"/;
let match;
while (match = regex.exec(html)) {
console.log(match[1] + ': ' + match[2]);
}
In case you are new to regex: this expression contains two capture groups, in parens (the first is the div id, the second is the link text), as well as a non-capturing group in the middle, which exists only to make sure the regex will continue through a line break. I say it's "loose" because the match it is looking for looks like this:
$("***").click***ignored chars***.load("***"
So, depending on how much javascript there is and how similar it is, you might have to tighten it up to avoid false positives.

Optimising regex for matching domain name in url

I have a regex that matches iframe urls, and captures various components. The regex is given below
/(<iframe.*?src=['|"])((?:https?:\/\/|\/\/)[^\/]*)(?:.*?)(['|"][^>]*some-token:)([a-zA-Z0-9]+)(.*?>)/igm
To be clear my actual requirement is to transforms in a html string, such strings
<iframe src="http://somehost.com/somepath1/path2" class="some-token:abc123">
to
<iframe src="http://somehost.com/newpath?token=abc123" class="some-token:abc123">
The regex works as it is supposed to be, but for normal length html, it takes around 2 seconds to execute, which i think is very, high.
I would really appreciate if someone could point me how to optimise this regex, i am sure i am doing something terribly wrong, because before i used this regex
/(<iframe.*?src=['|"])(?:.*?)(['|"][^>]*some-token:)([a-zA-Z0-9]+)(.*?>)/igm
to completely replace the source url and just add the paramter, it was taking just 100 ms
You do not need to (and should not) parse the iframe element as a string; you just need to access its attributes, and retrieve information from them and rewrite them.
function fix_iframe_src(iframe) {
var src = iframe.getAttribute('src');
var klass = iframe.getAttribute('class');
var token = get_token(klass);
src = fix_src(src, token);
iframe.setAttribute('src', src);
}
Writing get_token and fix_src are left as an exercise.
If you want to find a bunch of iframes and fix them all up, then
var iframes = document.querySelectorAll('iframe');
for (var i = 0; i < iframes.length; i++) {
fix_iframe_src(iframes[i]);
}
By the way, the value of your class attribute seems to be broken. I doubt if it will match any CSS rules, if that's the intent. Are you using it for something other than to provide the token? In that case, you would be best off using a data attribute such as data-token.
Minor point about regexp flags: the g and m flags are going to do nothing for you. m is about matching anchors like ^ and $ to the beginning and end of lines within the source string, which is not an issue for you. g is about matching multiple times, which is also not an issue.
The reason your regexp is taking so long is most likely that you are throwing the entire DOM at it. Hard to tell unless you show us the code from which you are calling it.

Remove remote content links in HTML using javascript

I have to scan an HTML for remote content (Iframe tags, Img tags ,Script tags etc) and remove the links present in them based on certain blacklist.
I am able to remove Iframe ,img , script tags whose src points to a Blacklisted URL.
var mySpan = document.createElement(\"span\");
mySpan.innerHTML = \"\";
var block = p[key];
var re = new RegExp(block);
a = document.getElementsByTagName('iframe');
for(i=0;i<a.length;i++)
{
var str = a.item(i).src;
if(str.match(re))
{
a[i].parentNode.replaceChild(mySpan, a[i]);
// + "a.item(i).src = '';
}
}
Similarly for script and img tags . But there can be many more such tags. Can i have a generic solution to traverse all tags in HTML and find/replace links that are blacklisted
I am very new to Javascript so a bit weak in its basics. Can this solution work in my case ?
I dont want to use JQuery etc libraries as i am doing this on Android.
Get all elements in the document document.getElementsByTagName('*')
Once you do that use what ever code you find suitable to check each element for your condition.
This will make sure that you have checked everything, if you were using jQuery i could make thinks simpler.
But much respect for being a pure JavaScripter !
Don't use any regexp on HTML - use DOM.
Review HTML standard for list of attributes on tags that can contain external links.
Loop over collections returned from document.getElementsByTagName(tagname).
Check attribute against blacklist and clean-up with .getAttribute and .removeAttribte (bonus: you will have normalized data, no need to worry about people trying to sneak by with funky escaping!).
Many of those attributes will be called src, so you might want to loop over tag name "*" with this attribute just to be little future-proof/paranoid. Or just loop over all attributes on all elements. This will be very slow though and still don't guarantee that somebody won't avoid it with using URLs that hard to distinguish from plain text (like IP or domain name without protocol), so I recommend against full scan.

How can spaces be converted to &nbsp without breaking HTML tags?

I've inherited some pretty complex code for a web forum, and one of the features I'm trying to implement is the ability for spaces to not be truncated into only one. This is mainly because our users often want to include ASCII art, tables etc in their posts.
I first did this using a simple search and replace in javascript, which had the side effect of breaking HTML tags (eg <a href=....> became <a href=.....>).
I then tried doing this on server side, when the strings are retrieved, by having spaces converted before links and code people insert is converted to HTML. This works to a degree but it causes some issues with other parts of the code, for example where a message is truncated to appear on the home page, it might leave some of the space code, such as
Here is a message&nb
I think there may be a way to just alter the original javascript to achieve this - it just needs to only match spaces that are not inside a HTML tag.
The script I was using originally was message = message.replace(/\s/g, " ").
Thanks for any help you can provide with this.
You can use the pre element to include preformatted text, which renders spaces as-is. See http://www.w3.org/TR/html5-author/the-pre-element.html
Those docs specifically say one of the best uses of the pre element is "Displaying ASCII art".
Example: http://jsbin.com/owuruz/edit#preview
<pre>
/\_/\
____/ o o \
/~____ =ΓΈ= /
(______)__m_m)
</pre>
In your case, just put your message inside a pre tag.
Yes, but you need to process text content of elements, not all of the HTML document content. Moreover, you need to exclude style and script element content. As you can limit yourself to things inside the body element, you could use a recursive function like following, calling it with process(document.body) to apply it to the entire document (but you probably want to apply it to a specific element only):
function process(element) {
var children = element.childNodes;
for(var i = 0; i < children.length; i++) {
var child = children[i];
if(child.nodeType === 3) {
if(child.data) {
child.data = child.data.replace(/[ ]/g, "\xa0");
}
} else if(child.tagName != "SCRIPT") {
process(child);
}
}
}
(No reason to use the entity reference here; you can use the no-break space character U+00A0 itself, referring to it as "\xa0" in JavaScript.)
One way is to use <pre> tags to wrap your users posts so that their ASCII art is preserved. But why not use Markdown (like Stackoverflow does). There's a couple of different ports of Markdown to Javascript:
Showdown
WMD
uedit

Interactive string manipulation via javascript

I have a webapp that must allow users to interactively manipulate strings (words, phrases and so on...)
Example:
given a foobar string, if the user clicks on b the string is split in two and a whitespace is added, resulting in foo bar.
I could put each single character inside a span element, but I fear this would be troublesome for long strings.
Any advice?
This version using jQuery (not necessary) should pretty much do what you need if I understood you correctly:
// Given a textarea with the content
var text = $('textarea').text().split('');
$('textarea').click(function(){
text.splice(this.selectionStart, 0, " ");
this.value = text.join('');
});
It's a very simple and not cross browser enabled example, but it should get you started.
Yes, it will be ok, but setup your event handler not on individual spans, but on the whole container and then see here: http://en.wikipedia.org/wiki/Flyweight_pattern

Categories

Resources