Use JavaScript string operations to cut out exact text - javascript

I'm trying to cut out some text from a scraped site and not sure what functions or library's I can use to make this easier:
example of code I run from PhantomJS:
var latest_release = page.evaluate(function () {
// everything inside this function is executed inside our
// headless browser, not PhantomJS.
var links = $('[class="interesting"]');
var releases = {};
for (var i=0; i<links.length; i++) {
releases[links[i].innerHTML] = links[i].getAttribute("href");
}
// its important to take note that page.evaluate needs
// to return simple object, meaning DOM elements won't work.
return JSON.stringify(releases);
});
Class interesting has what I need, surrounded by new lines and tabs and whatnot.
here it is:
{"\n\t\t\t\n\t\t\t\tI_Am_Interesting\n\t\t\t\n\t\t":null,"\n\t\t\t\n\t\t\t\tI_Am_Interesting\n\t\t\t\n\t\t":null,"\n\t\t\t\n\t\t\t\tI_Am_Interesting\n\t\t\t\n\t\t":null}
I tried string.slice("\n"); and nothing happened, I really want a effective way to be able to cut out strings like this, based on its relationship to those \n''s and \t's
By the way this was my split code:
var x = latest_release.split('\n');
Cheers.

Its a simple case of stripping out all whitespace. A job that regexes do beautifully.
var s = " \n\t\t\t\n\t\t\t\tI Am Interesting\n\t\t \t \n\t\t";
s = s.replace(/[\r\t\n]+/g, ''); // remove all non space whitespace
s = s.replace(/^\s+/, ''); // remove all space from the front
s = s.replace(/\s+$/, ''); // remove all space at the end :)
console.log(s);
Further reading: https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/RegExp

var interesting = {
"\n\t\t\t\n\t\t\t\tI_Am_Interesting1\n\t\t\t\n\t\t":null,
"\n\t\t\t\n\t\t\t\tI_Am_Interesting2\n\t\t\t\n\t\t":null,
"\n\t\t\t\n\t\t\t\tI_Am_Interesting3\n\t\t\t\n\t\t":null
}
found = new Array();
for(x in interesting) {
found[found.length] = x.match(/\w+/g);
}
alert(found);

Could you try with "\\n" as pattern? your \n may be understood as plain string rather than special character

new_string = string.replace("\n", "").replace("\t", "");

Related

Matching css selectors with RegExp doesn't work in browser

I try to match css selectors as can be seen here:
https://regex101.com/r/kI3rW9/1
. It matches the teststring as desired, however when loading a .js file to test it in the browser it fails both in firefox and chrome.
The .js file:
window.onload = function() {
main();
}
main = function() {
var regexSel = new RegExp('([\.|#][a-zA-Z][a-zA-Z0-9.:_-]*) ?','g');
var text = "#left_nav .buildings #rfgerf .rtrgrgwr .rtwett.ww-w .tw:ffwwwe";
console.log(regexSel.exec(text));
}
In the browser it returns:["#left_nav ", "#left_nav", index: 0, input: "#left_nav .buildings #rfgerf .rtrgrgwr .rtwett.ww-w .tw:ffwwwe"]
So it appears it only captures the first selector with and without the whitespace, despite the whitespace beeing outside the () and the global flag set.
Edit:
So either looping over RegExp.exec(text) or just using String.match(str) will lead to the correct solution. Thanks to Wiktor's answer i was able to implement a convenient way of calling this functionality:
function Selector(str){
this.str = str;
}
with(Selector.prototype = new String()){
toString = valueOf = function () {
return this.str;
};
}
Selector.prototype.constructor = Selector;
Selector.prototype.parse = function() {
return this.match(/([\.|#][a-zA-Z][a-zA-Z0-9.:_-]*) ?/g);
}
//Using it the following way:
var text = new Selector("#left_nav .buildings #rfgerf .rtrgrgwr .rtwett.ww-w .tw:ffwwwe");
console.log(text.parse());
I decided however using
/([\.|#][a-zA-Z][a-zA-Z0-9.:_-]*) ?/g over the suggested
/([.#][a-zA-Z][a-zA-Z0-9.:_-]*)(?!\S)/g because it matches with 44 vs. 60 steps on regex101.com on my teststring.
You ran exec once, so you got one match object. You'd need to run it inside a loop.
var regexSel = new RegExp('([\.|#][a-zA-Z][a-zA-Z0-9.:_-]*) ?','g');
var text = "#left_nav .buildings #rfgerf .rtrgrgwr .rtwett.ww-w .tw:ffwwwe";
while((m=regexSel.exec(text)) !== null) {
console.log(m[1]);
}
A regex with a (?!\S) lookaround at the end (that fails the match if there is no non-whitespace after your main consuming pattern) will allow simpler code:
var text = "#left_nav .buildings #rfgerf .rtrgrgwr .rtwett.ww-w .tw:ffwwwe";
console.log(text.match(/[.#][a-zA-Z][a-zA-Z0-9.:_-]*(?!\S)/g));
Note that you should consider using regex literal notation when defining your static regexps. Only prefer constructor notation with RegExp when your patterns are dynamic, have some variables or too many / that you do not want to escape.
Look also at [.#]: the dot does not have to be escaped and | inside is treated as a literal pipe symbol (not alternation operator).

JavaScript/jQuery manipulate and then replace all links in my HTML content

I am trying to write a script that, after the page load, will replace all my existing links with links in a different format.
However, while I've managed to work out how to do the link string manipulation, I'm stuck on how to actually replace it on the page.
I have the following code which gets all the links from the page, and then loops through them doing a regular expression to see if they match my pattern and then if they do taking out the name information from the link and creating the new link structure - this bit all works. It's the next stage of doing the replace where I'm stuck.
var str;
var fn;
var ln;
var links = document.getElementsByTagName("a");
for(var i=0; i<links.length; i++) {
str = links[i].href.match(/\/Services\/(.*?)\/People\/(.*?(?=\.aspx))/gi);
if (links[i].href.match(/\/Services\/(.*?)\/People\/(.*?(?=\.aspx))/gi)) {
var linkSplit = links[i].href.split("/");
// Get the last one (so the .aspx and then split again).
// Now split again on the .
var fileNameSplit = linkSplit[linkSplit.length-1].split(".");
var nameSplit = fileNameSplit[0].split(/(?=[A-Z])/);
fn = nameSplit[0];
ln = nameSplit[1];
if(nameSplit[2]){
ln += nameSplit[2];
}
// Build replacement string
var replacementUrl = 'https://www.testsite.co.uk/services/people.aspx?fn='+fn+'&sn='+ln;
// Do the actual replacement
links[i].href.replace(links[i].href, replacementUrl);
}
I've tried a couple of different solutions to make it do the actual replacement, .replace, .replaceWith, and I've tried using a split/join to replace a string with an array that I found here - Using split/join to replace a string with an array
var html = document.getElementsByTagName('html')[0];
var block = html.innerHTML;
var replace_str = links[i].href;
var replace_with = replacementUrl;
var rep_block = block.split(replace_str).join(replace_with);
I've read these, but had no success applying the same logic:
Javascript: How do I change every word visible on screen?
jQuery replace all href="" with onclick="window.location="
How can I fix this problem?
It's simpler than that:
links[i].href = replacementUrl;

removing BBcode from textarea with Javascript

I'm creating a small javscript for phpBB3 forum, that counts how much character you typed in.
But i need to remove the special characters(which i managed to do so.) and one BBcode: quote
my problem lies with the quote...and the fact that I don't know much about regex.
this is what I managed to do so far but I'm stranded:
http://jsfiddle.net/emjkc/
var text = '';
var char = 0;
text = $('textarea').val();
text = text.replace(/[&\/\\#,+()$~%.'":*?<>{}!?(\r\n|\n|\r)]/gm, '');
char = text.length;
$('div').text(char);
$('textarea').bind('input propertychange', function () {
text = $(this).val();
text = text.replace(/[&\/\\#,+()$~%.'":*?<>{}!?\-\–_;(\r\n|\n|\r)]/gm, '');
char = text.length;
$('div').text(char);
});
You'd better write a parser for that, however if you want to try with regexes, this should do the trick:
text = $('textarea').val();
while (text.match(/\[quote.*\[\/quote\]/i) != null) {
//remove the least inside the innermost found quote tags
text = text.replace(/^(.*)\[quote.*?\[\/quote\](.*)$/gmi, '\$1\$2');
}
// now strip anything non-character
text = text.replace(/[^a-z0-9]/gmi, '');
I'm not sure if this would work, but I think you can replace all bbcodes with a regex like this:
var withoutBBCodes = message.replace(/\[[^\]]*\]/g,"");
It just replaces everything like [any char != ']' goes here]
EDIT: sorry, didn't see that you only want to replace [quote] and not all bbcodes:
var withoutBBQuote = message.replace(/\[[\/]*quote[^\]]*\]/g,"");
EDIT: ok, you also want quoted content removed:
while (message.indexOf("[quote") != -1) {
message = message.replace(/\[quote[^\]]*\]((?!\[[[\/]*quote).)*\[\/quote\]/g,"");
}
I know you already got a solution thanks to #guido but didn't want to leave this answer wrong.

Javascript to extract *.com

I am looking for a javascript function/regex to extract *.com from a URI... (to be done on client side)
It should work for the following cases:
siphone.com = siphone.com
qwr.siphone.com = siphone.com
www.qwr.siphone.com = siphone.com
qw.rock.siphone.com = siphone.com
<http://www.qwr.siphone.com> = siphone.com
Much appreciated!
Edit: Sorry, I missed a case:
http://www.qwr.siphone.com/default.htm = siphone.com
I guess this regex should work for a few cases:
/[\w]+\.(com|ca|org|net)/
I'm not good with JavaScript, but there should be a library for splitting URIs out there, right?
According to that link, here's a "strict" regex:
/^(?:([^:\/?#]+):)?(?:\/\/((?:(([^:#]*)(?::([^:#]*))?)?#)?([^:\/?#]*)(?::(\d*))?))?((((?:[^?#\/]*\/)*)([^?#]*))(?:\?([^#]*))?(?:#(.*))?)/
As you can see, you're better off just using the "library". :)
This should do it. I added a few cases for some nonmatches.
var cases = [
"siphone.com",
"qwr.siphone.com",
"www.qwr.siphone.com",
"qw.rock.siphone.com",
"<http://www.qwr.siphone.com>",
"hamstar.corm",
"cheese.net",
"bro.at.me.come",
"http://www.qwr.siphone.com/default.htm"];
var grabCom = function(str) {
var result = str.match("(\\w+\\.com)\\W?|$");
if(result !== null)
return result[1];
return null;
};
for(var i = 0; i < cases.length; i++) {
console.log(grabCom(cases[i]));
}
var myStrings = [
'siphone.com',
'qwr.siphone.com',
'www.qwr.siphone.com',
'qw.rock.siphone.com',
'<http://www.qwr.siphone.com>'
];
for (var i = 0; i < myStrings.length; i++) {
document.write( myStrings[i] + '=' + myStrings[i].match(/[\w]+\.(com)/gi) + '<br><br>');
}
I've placed given demo strings to the myStrings array.
i - is index to iterate through this array. The following line does the matching trick:
myStrings[i].match(/[\w]+\.(com)/gi)
and returns the value of siphone.com. If you'd like to match .net and etc. - add (com|net|other) instead of just (com).
Also you may find the following link useful: Regular expressions Cheat Sheet
update: missed case works too %)
You could split the string then search for the .com string like so
var url = 'music.google.com'
var parts = url.split('.');
for(part in parts) {
if(part == 'com') {
return true;
}
{
uri = "foo.bar.baz.com"
uri.split(".").slice(-2).join(".") // returns baz.com
This assumes that you want just the hostname and tld. It also assumes that there is no path information either.
Updated now that you also need to handle uris with paths you could do:
uri.split(".").slice(-2).join(".").split("/")[0]
Use regexp to do that. This way modifications to the detections are quite easy.
var url = 'www.siphone.com';
var domain = url.match(/[^.]\.com/i)[0];
If you use url.match(/(([^.]+)\.com)[^a-z]/i)[1] instead. You can assure that the ".com" is not followed by any other characters.

Matching all excerpts which starts and ends with specific words

I have a text which looks like:
some non interesting part
trans-top
body of first excerpt
trans-bottom
next non interesting part
trans-top
body of second excerpt
trans-bottom
non interesting part
And I want to extract all excerpts starting with trans-top and ending with trans-bottom into an array. I tried that:
match(/(?=trans-top)(.|\s)*/g)
to find strings witch starts with trans-top. And it works. Now I want to specify the end:
match(/(?=trans-top)(.|\s)*(?=trans-bottom)/g)
and it doesn't. Firebug gives me an error:
regular expression too complex
I tried many other ways, but I can't find working solution... I'm shure I made some stupid mistake:(.
This works pretty well, but it's not all in one regex:
var test = "some non interesting part\ntrans-top\nbody of first excerpt\ntrans-bottom\nnext non interesting part\ntrans-top\nbody of second excerpt\ntrans-bottom\nnon interesting part";
var matches = test.match(/(trans-top)([\s\S]*?)(trans-bottom)/gm);
for(var i=0; i<matches.length; i++) {
matches[i] = matches[i].replace(/^trans-top|trans-bottom$/gm, '');
}
console.log(matches);
If you don't want the leading and trailing linebreaks, change the inner loop to:
matches[i] = matches[i].replace(/^trans-top[\s\S]|[\s\S]trans-bottom$/gm, '');
That should eat the linebreaks.
This tested function uses one regex and loops through picking out the contents of each match placing them all in an array which is returned:
function getParts(text) {
var a = [];
var re = /trans-top\s*([\S\s]*?)\s*trans-bottom/g;
var m = re.exec(text);
while (m != null) {
a.push(m[1]);
m = re.exec(text);
}
return a;
}
It also filters out any lealding and trailing whitespace surrounding each match contents.

Categories

Resources