Changing text node contents globally in document - javascript

So I have this script that I'm using to change text node contents in JS. Irun this script in Greasemonkey:
(function() {
var replacements, regex, key, textnodes, node, s;
replacements = {
"facebook": "channelnewsasia",
"Teh": "The",
"TEH": "THE",
};
regex = {};
for (key in replacements) {
regex[key] = new RegExp(key, 'g');
}
textnodes = document.evaluate( "//body//text()", document, null, XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null);
for (var i = 0; i < textnodes.snapshotLength; i++) {
node = textnodes.snapshotItem(i);
s = node.data;
for (key in replacements) {
s = s.replace(regex[key], replacements[key]);
}
node.data = s;
}
})();
Which works really well.
Except my problem is, I'm trying to change value 0 to 75. However it is also changing other 0's on the page that are included in dates, like today's date.
Which is not what I want. I ONLY want it to change 0's that are by themselves. How do I go about this?
Thanks for any help.

Well, you'd need regular expression for that. Your script supports regexes, you just need to put it in the list of required replacements.
Regular expression to match single zero would look like this: (?:[^0-9]|^)(0)(?:[^0-9]|$). It works by asserting that non-number must be before/after zero - or string beginning/end.
You can put this into your list of replacements:
replacements = {
"(?:[^0-9]|^)(0)(?:[^0-9]|$)": "75",
};
Alternativelly, if zeroes are always separated by space, use just this expression: \b0\b
Notes to your code:
Remember that with your system, you must treat all replacement templates as regexes. So don't forget to escape characters like (, [ or . in replacements when you want to treat those characters literally.
Wrapping whole userscript in self invoking expression is unnecessary, userscript variable scope is already hidden from global scope.

Related

Trimming whitespace without affecting strings

So, I recently found this example on trimming whitespace, but I've found that it also affects strings in code. For instance, say I'm doing a lesson on string comparison, and to demonstrate that "Hello World!" and "Hello World!" are different, I need the code compression to not have any effect on those two strings.
I'm using the whitespace compression so that people with different formatting styles won't be punished for using something that I don't use. For instance, I like to format my functions like this:
function foo(){
return 0;
};
While others may format it like this:
function foo()
{
return 0;
};
So I use whitespace compression around punctuation to make sure it always comes out the same, but I don't want it to affect anything within a string. Is there a way to add exceptions in JavaScript's replace() function?
UPDATE:
check this jsfiddle
var str='dfgdfg fdgfd fd gfd g print("Hello World!"); sadfds dsfgsgdf'
var regex=/(?:(".*"))|(\s+)/g;
var newStr=str.replace(regex, '$1 ');
console.log(newStr);
console.log(str);
In this code it will process everything except the quoted strings
to play with the code more comfortably you can see how the regex is working :
https://regex101.com/r/tG5qH2/1
I made a jsfiddle here: https://jsfiddle.net/cuywha8t/2/
var stringSplitRegExp = /(".+?"|'.+?')/g;
var whitespaceRegExp = /\s+\{/g;
var whitespaceReplacement = "{"
var exampleCode = `var str = "test test test" + 'asdasd "sd"';\n`+
`var test2 = function()\n{\nconsole.log("This is a string with 'single quotes'")\n}\n`+
`console.log('this is a string with "double quotes"')`;
console.log(exampleCode)
var separatedStrings =(exampleCode.split(stringSplitRegExp))
for(var i = 0; i < separatedStrings.length; i++){
if (i%2 === 1){
continue;
}
var oldString = separatedStrings[i];
separatedStrings[i] = oldString.replace(whitespaceRegExp, whitespaceReplacement)
}
console.log(separatedStrings.join(""))
I believe this is what you are looking for. it handles cases where a string contains the double quotes, etc. without modifying. This example just does the formatting of the curly-braces as you mentioned in your post.
Basically, the behavior of split allows the inclusion of the splitter in the array. And since you know the split is always between two non-string elements you can leverage this by looping over and modifying only every even-indexed array element.
If you want to do general whitespace replacement you can of course modify the regexp or do multiple passes, etc.

replaceText() RegEx "not followed by"

Any ideas why this simple RegEx doesn't seem to be supported in a Google Docs script?
foo(?!bar)
I'm assuming that Google Apps Script uses the same RegEx as JavaScript. Is this not so?
I'm using the RegEx as such:
DocumentApp.getActiveDocument().getBody().replaceText('foo(?!bar)', 'hello');
This generates the error:
ScriptError: Invalid regular expression pattern foo(?!bar)
As discussed in comments on this question, this is a documented limitation; the replaceText() method doesn't support reverse-lookaheads or any other capture group.
A subset of the JavaScript regular expression features are not fully supported, such as capture groups and mode modifiers.ref
Serge suggested a work-around, "it should be possible to manipulate your document at a lower level (extracting text from paragraph etc) but it could rapidly become quite cumbersome."
Here's what that could look like. If you don't mind losing all formatting, this example will apply capture groups, RegExp flags (i for case-insensitivity) and reverse-lookaheads to change:
Little rabbit Foo Foo, running through the foobar.
to:
Little rabbit Fred Fred, running through the foobar.
Code:
function myFunction() {
var body = DocumentApp.getActiveDocument().getBody();
var paragraphs = body.getParagraphs();
for (var i=0; i<paragraphs.length; i++) {
var text = paragraphs[i].getText();
paragraphs[i].replaceText(".*", text.replace(/(f)oo(?!bar)/gi, '$1red') );
}
}
You have a sequence which you can match with a regular expression, but that regular expression will also match one, or more, things which you do not desire to change. The generalized solution to this situation is to:
Change the text such that you have known sequences of characters which are definitely not used. Effectively, this gives you sequences of characters which you use as variables to hold the values you don't want to change. Personally, I would use:
body.replaceText('Q','Qz');
Which will make it such that there is no sequence in your document which matches /Q[^z]/. This results in you being able to use sequences like Qa to represent some text you don't want to change. I use Q because it has a low frequency of use in English. You can use any character. For efficiency, choose a character which results in a low number of changes within the text you are affecting.
Change the things you don't want to end up changing to one of the character sequences you now know are unused. For example:
body.replaceText('foobar','Qa');
Repeat this for whatever additional items you don't want to end up changing.
Change the text you are really wanting to change. In this example:
body.replaceText('foo','hello'.replace(/Q/g,'Qz'));
Note that you need to apply to the new replacement text the first substitution which you used to open up known unused sequences.
Restore all of the things you did not want to change to their original state:
body.replaceText('Qa','foobar');
Restore the text you used to open up unused character sequences:
body.replaceText('Qz','Q');
All together that would be:
var body = DocumentApp.getActiveDocument().getBody();
body.replaceText('Q','Qz'); //Open up unused character sequences
body.replaceText('foobar','Qa'); //Save the things you don't want to change.
//In the general case, you need to apply to the new text the same substitution
// which you used to open up unused character sequences. If you don't you
// may end up with those sequences being changed in the new text.
body.replaceText('foo','hello'.replace(/Q/g,'Qz')); //Make the change you desire.
body.replaceText('Qa','foobar'); //Restore the things you saved.
body.replaceText('Qz','Q'); //Restore the original sequence.
While solving the problem this way does not allow you to use all the features of JavaScript RegExp (e.g. capture groups, look-ahead assertions, and flags), it should preserve the formatting within your document.
You can choose not to perform steps 1 and 5 above by picking a longer sequence of characters to use to represent the text which you do not want to match (e.g. kNoWn1UnUsEd). However, such a longer sequence is something that must be selected based on your knowledge of what already exists in the document. Doing that can save a couple of steps, but you either have to search for an unused string or accept that there is some probability that the string you use is already in the document, which would result in an undesired substitution.
I figured out a way to obtain most of JS's str.replace() functionalities including capture groups and smart replacers in Apps Script without messing up the style. The trick is to use Javascript's regex.exec() function and Apps Script's text.deleteText() and text.insertText() functions.
function replaceText(body, regex, replacer, attribute){
var content = body.getText();
const text = body.editAsText();
var match = "";
while (true){
content = body.getText();
var oldLength = content.length;
match = regex.exec(content);
if (match === null){
break;
}
var start = match.index;
var end = regex.lastIndex - 1;
text.deleteText(start, end);
text.insertText(start, replacer(match, regex));
var newLength = body.getText().length;
var replacedLength = oldLength - newLength;
var newEnd = end - replacedLength;
text.setAttributes(start, newEnd, attribute);
regex.lastIndex -= replacedLength;
}
}
Argument explanations:
body: the body of the document you want to operate on
regex: the normal JS regular expression object used as a search pattern
replacer: the replacer function used to return the string you want to replace with, replacer automatically receive two arguments:
I. match: match object generated by regex.exec() and
II. regex: the regular expression object used as a search pattern
attribute: An Apps Script attribute object
For example, if you want to apply bold style to new strings replacing the old ones, you can create a boldStyle attribute object:
var boldStyle = {};
boldStyle[DocumentApp.Attribute.BOLD] = true;
Tips:
How can I use capture groups in replaceText()?
You can access all capture groups from the replacer function, match[0] is the whole string matched, match[1] is the first capture group, match[2] is the second, etc.
How can I access the index and position of the match in replaceText()?
You can access the start index of the match (match.index) and end index of the match (regex.lastIndex) from the replacer function.
For more in-depth reference of JS RegExp, see this excellent tutorial from Javascript.info.
Example:
Here's a example use case of the replaceText() function. It's simple implementation of a markdown to google docs conversion script:
function markdownToDocs() {
const body = DocumentApp.getActiveDocument().getBody();
// Use editAsText to obtain a single text element containing
// all the characters in the document.
const text = body.editAsText();
// e.g. replace "**string**" with "string" (bolded)
var boldStyle = {};
boldStyle[DocumentApp.Attribute.BOLD] = true;
replaceDeliminaters(body, "\\*\\*", boldStyle, false);
// e.g. replace multiline "```line 1\nline 2\nline 3```" with "line 1\nline 2\nline 3" (with gray background highlight)
var blockHighlightStyle = {};
blockHighlightStyle[DocumentApp.Attribute.BACKGROUND_COLOR] = "#EEEEEE";
replaceDeliminaters(body, "```", blockHighlightStyle, true);
// e.g. replace inline "`console.log("hello world")`" with "console.log("hello world")" (in "Times New Roman" font and italic)
var inlineStyle = {};
inlineStyle[DocumentApp.Attribute.FONT_FAMILY] = "Times New Roman";
inlineStyle[DocumentApp.Attribute.ITALIC] = true;
replaceDeliminaters(body, "`", inlineStyle, false);
// feel free to change all the styling and markdown deliminaters as you wish.
}
// replace markdown deliminaters like "**", "`", and "```"
function replaceDeliminaters(body, deliminator, attributes, multiline){
var capture;
if (multiline){
capture = "([\\s\\S]+?)"; // capture newline characters as well
} else{
capture = "(.+?)"; // do not capture newline characters
}
const regex = new RegExp(deliminator + capture + deliminator, "g");
const replacer = function(match, regex){
return match[1]; // return the first capture group
}
replaceText(body, regex, replacer, attributes);
}

Return the part of the regex that matched

In a regular expression that uses OR (pipe), is there a convenient method for getting the part of the expression that matched.
Example:
/horse|caMel|TORTOISe/i.exec("Camel");
returns Camel. What I want is caMel.
I understand that I could loop through the options instead of using one big regular expression; that would make far more sense. But I'm interested to know if it can be done this way.
Very simply, no.
Regex matches have to do with your input string and not the text used to create the regular expression. Note that that text might well be lost, and theoretically is not even necessary. An equivalent matcher could be built out of something like this:
var test = function(str) {
var text = str.toLowerCase();
return text === "horse" || text === "camel" || text === "tortoise";
};
Another way to think of it is that the compilation of regular expressions can divorce the logic of the function from their textual representation. It's one-directional.
Sorry.
There is not a way built-in to the Javascript RegExp object; without changing your expression. The closest you can get is source which will just return the entire expression as a string.
Since you know you're expression is a series of | ORs, you could capturing groups to figure out which group matched, and combine that with .source to find out the contents of that group:
var exp = /(horse)|(caMel)|(TORTOISe)/i;
var result = exp.exec("Camel");
var match = function(){
for(var i = 1; i < result.length; i++){
if(result[i]){
return exp.source.match(new RegExp('(?:[^(]*\\((?!\\?\\:)){' + i + '}([^)]*)'))[1];
}
}
}();
// match == caMel
It is also extremely easy (although somewhat impractical) to write a RegExp engine from scratch would you could technically add that functionality to. It would be much slower than using an actual RegExp object, since the whole engine would have to be interpreted at run-time. It would, however, be able to return exactly the matched portion of the expression for any regular expression and not be limited to one which consists of a series of | ORs.
The best way to solve your problem, however, is probably not to use a loop or a regular expression at all, but instead to create an object where you use a canonical form for the key:
var matches = {
'horse': 'horse',
'camel': 'caMel',
'tortoise': 'TORTOISe'
};
// Test "Camel"
matches['Camel'.toLowerCase()]; // "caMel"
This will give the wanted value without looping:
var foo, pat, tres, res, reg = /horse|caMel|TORTOISe/i;
foo = reg.exec('Camel');
if (foo) {
foo = foo[0].replace(/\./g, '\\.');
pat = new RegExp('\\|' + foo + '\\|', 'i');
tres = '|' + reg.source + '|';
res = tres.match(pat)[0].replace(/\|/g, '');
}
alert(res);
If there's no match, now you get undefined, though it's easy to change to something else.

Why doesn't this RegExp work and which notation is more standards compliant?

Disclaimer: I realize asking "Why doesn't my regular expression work" is pretty amateur.
I have looked at the documentation, though I'm just plain struggling. I have a url (as a string) and what I want is to replace the placeholders (i.e. {objectID} and {queryTerm}
For a while now, I've been making attempts like this:
var _serviceURL = "http://my-server.com/rest-services/someObject/{objectID}/entries?term={queryTerm}";
var re1 = new RegExp("/{([A-Za-z])+}","gi");
var re2 = new RegExp("/{([A-Za-z]+)}+","gi");
var re3 = new RegExp("/{([A-Za-z])+}","gi");
var re4 = new RegExp("/({[A-Za-z]+})+","gi");
var re5 = new RegExp("({[A-Za-z]+})+","gi");
var re6 = new RegExp("({[A-Za-z]}+)*","g");
var re6a = new RegExp("({([a-z]+)})+","gi");
var re7 = /{([^}]+)}/g;
var tokens = re6A.exec(_serviceURL);
if (null != tokens.length ){
for(i = 0; i < tokens.length; i++){
var t = tokens[i];
console.log("tokens[i]: " + t);
}
}
else {
console.log("RegEx fail...")
}
re6a above produces an array like this upon execution:
tokens: Array[3]
0: "{objectID}"
1: "{objectID}"
2: "objectID"
Related to the scenario above:
Why is it I'm never getting the queryTerm ?
Does the RegExp 'i' (ignore case) flag mean I can list a character class like [a-z] and also capture [A-Z] ?
Which method of constructing a RegExp object is better? ...new RegExp(...); or var regExp = /{([^}]+)}/g; . In terms of "what's better", what I mean is cross-browser compatibility and as similar to other RegEx implementations (if I'm learning RegEx, I want to get the most value I can out of it).
Does the RegExp i (ignore case) flag mean I can list a character class like [a-z] and also capture [A-Z]?
Yes, it'll capture them all.
Which method of constructing a RegExp object is better? new RegExp(...) or var regExp = /{([^}]+)}/g;? In terms of "what's better", what I mean is cross-browser compatibility and as similar to other RegEx implementations (if I'm learning RegEx, I want to get the most value I can out of it).
You should definitely use the literal notation.
It gets compiled once at runtime, instead of every time you use it.
They're both equally cross browser compatible.
All that said, I'd use this:
_serviceURL.match(/[^{}]+(?=})/g);
Here's the fiddle: http://jsfiddle.net/CAugU/
Here's an explanation of the above regex:
[ opens the character set
^ negates the set. Will only match whatever is NOT in these brackets
{} match anything that is NOT a curly brace
] close the character set
+ match that as many times as possible
(?= ascertain that it is possible to match the following here (won't be included in the match, this is called a lookahead)
} match a curly brace
) close the lookahead
As you're going to replace the placeholders, it seems more natural to use replace rather than match, for example:
var _serviceURL = "http://my-server.com/rest-services/someObject/{objectID}/entries?term={queryTerm}"
var values = {
objectID: 1234,
queryTerm: "hello"
}
var result = _serviceURL.replace(/{(.+?)}/g, function($0, $1) {
return values[$1]
})
yields http://my-server.com/rest-services/someObject/1234/entries?term=hello

Javascript regular expression to replace word but not within curly brackets

I have some content, for example:
If you have a question, ask for help on StackOverflow
I have a list of synonyms:
a={one typical|only one|one single|one sole|merely one|just one|one unitary|one small|this solitary|this slight}
ask={question|inquire of|seek information from|put a question to|demand|request|expect|inquire|query|interrogate}
I'm using JavaScript to:
Split synonyms based on =
Looping through every synonym, if found in content replace with {...|...}
The output should look like:
If you have {one typical|only one|one single|one sole|merely one|just one|one unitary|one small|this solitary|this slight} question, {question|inquire of|seek information from|put a question to|demand|request|expect|inquire|query|interrogate} for help on StackOverflow
Problem:
Instead of replacing the entire word, it's replacing every character found. My code:
for(syn in allSyn) {
var rtnSyn = allSyn[syn].split("=");
var word = rtnSyn[0];
var synonym = (rtnSyn[1]).trim();
if(word && synonym){
var match = new RegExp(word, "ig");
postProcessContent = preProcessContent.replace(match, synonym);
preProcessContent = postProcessContent;
}
}
It should replace content word with synonym which should not be in {...|...}.
When you build the regexps, you need to include word boundary anchors at both the beginning and the end to match whole words (beginning and ending with characters from [a-zA-Z0-9_]) only:
var match = new RegExp("\\b" + word + "\\b", "ig");
Depending on the specific replacements you are making, you might want to apply your method to individual words (rather than to the entire text at once) matched using a regexp like /\w+/g to avoid replacing words that themselves are the replacements for others. Something like:
content = content.replace(/\w+/g, function(word) {
for(var i = 0, L = allSyn.length; i < L; ++i) {
var rtnSyn = allSyn[syn].split("=");
var synonym = (rtnSyn[1]).trim();
if(synonym && rtnSyn[0].toLowerCase() == word.toLowerCase()) return synonym;
}
});
Regular expressions include something called a "word-boundary", represented by \b. It is a zero-width assertion (it just checks something, it doesn't "eat" input) that says in order to match, certain word boundary conditions have to apply. One example is a space followed by a letter; given the string ' X', this regex would match it: / \bX/. So to make your code work, you just have to add word boundaries to the beginning and end of your word regex, like this:
for(syn in allSyn) {
var rtnSyn = allSyn[syn].split("=");
var word = rtnSyn[0];
var synonym = (rtnSyn[1]).trim();
if(word && synonym){
var match = new RegExp("\\b"+word+"\\b", "ig");
postProcessContent = preProcessContent.replace(match, synonym);
preProcessContent = postProcessContent;
}
}
[Note that there are two backslashes in each of the word boundary matchers because in javascript strings, the backslash is for escape characters -- two backslashes turns into a literal backslash.]
For optimization, don't create a new RegExp on each iteration. Instead, build up a big regex like [^{A-Za-z](a|ask|...)[^}A-Za-z] and an hash with a value for each key specifying what to replace it with. I'm not familiar enough with JavaScript to create the code on the fly.
Note the separator regex which says the match cannot begin with { or end with }. This is not terribly precise, but hopefully acceptable in practice. If you genuinely need to replace words next to { or } then this can certainly be refined, but I'm hoping we won't have to.

Categories

Resources