I need to extract an entire javascript function from a script file. I know the name of the function, but I don't know what the contents of the function may be. This function may be embedded within any number of closures.
I need to have two output values:
The entire body of the named function that I'm finding in the input script.
The full input script with the found named function removed.
So, assume I'm looking for the findMe function in this input script:
function() {
function something(x,y) {
if (x == true) {
console.log ("Something says X is true");
// The regex should not find this:
console.log ("function findMe(z) { var a; }");
}
}
function findMe(z) {
if (z == true) {
console.log ("Something says Z is true");
}
}
findMe(true);
something(false,"hello");
}();
From this, I need the following two result values:
The extracted findMe script
function findMe(z) {
if (z == true) {
console.log ("Something says Z is true");
}
}
The input script with the findMe function removed
function() {
function something(x,y) {
if (x == true) {
console.log ("Something says X is true");
// The regex should not find this:
console.log ("function findMe(z) { var a; }");
}
}
findMe(true);
something(false,"hello");
}();
The problems I'm dealing with:
The body of the script to find could have any valid javascript code within it. The code or regex to find this script must be able to ignore values in strings, multiple nested block levels, and so forth.
If the function definition to find is specified inside of a string, it should be ignored.
Any advice on how to accomplish something like this?
Update:
It looks like regex is not the right way to do this. I'm open to pointers to parsers that could help me accomplish this. I'm looking at Jison, but would love to hear about anything else.
A regex can't do this. What you need is a tool that parses JavaScript in a compiler-accurate way, builds up a structure representing the shape of the JavaScript code, enables you to find the function you want and print it out, and enables you to remove the function definition from that structure and regenerate the remaining javascript text.
Our DMS Software Reengineering Toolkit can do this, using its JavaScript front end. DMS provides general parsing, abstract syntax tree building/navigating/manipulation, and prettyprinting of (valid!) source text from a modified AST. The JavaScript front end provides DMS with compiler-accurate definition of JavaScript. You can point DMS/JavaScript at a JavaScript file (or even various kinds of dynamic HTML with embedded script tags containing JavaScript), have it produce the AST.
A DMS pattern can be used to find your function:
pattern find_my_function(r:type,a: arguments, b:body): declaration
" \r my_function_name(\a) { \b } ";
DMS can search the AST for a matching tree with the specified structure; because this is an AST match and not a string match, line breaks, whitespace, comments and other trivial differences won't fool it. [What you didn't say is what to if you have more than one
function in different scopes: which one do you want?]
Having found the match, you can ask DMS to print just that matched code which acts as your extraction step. You can also ask DMS to remove the function using a rewrite rule:
rule remove_my_function((r:type,a: arguments, b:body): declaration->declaration
" \r my_function_name(\a) { \b } " -> ";";
and then prettyprint the resulting AST. DMS will preserve all the comments properly.
What this does not do, is check that removing the function doesn't break your code. After all, it may be in a scope where it directly accesses variables defined locally in the scope. Removing it to another scope now means it can't reference its variables.
To detect this problem, you not only need a parser, but you need a symbol table with maps identifiers in the code to definitions and uses. The removal rule then has to add a semantic condition to check for this. DMS provides the machinery to build such a symbol table from the AST using an attribute grammar.
To fix this problem, when removing the function, it may be necessary to modify the function to accept additional arguments replacing the local variables it accesses, and modify the call sites to pass in what amounts to references to the local variables. This can be implemented with a modest sized set of DMS rewrite rules, that check the symbol tables.
So removing such a function can be a lot more complex than just moving the code.
If the script is included in your page (something you weren't clear about) and the function is publicly accessible, then you can just get the source to the function with:
functionXX.toString();
https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Function/toString
Other ideas:
1) Look at the open source code that does either JS minification or JS pretty indent. In both cases, those pieces of code have to "understand" the JS language in order to do their work in a fault tolerant way. I doubt it's going to be pure regex as the language is just a bit more complicated than that.
2) If you control the source at the server and are wanted to modify a particular function in it, then just insert some new JS that replaces that function at runtime with your own function. That way, you let the JS compiler identify the function for you and you just replace it with your own version.
3) For regex, here's what I've done which is not foolproof, but worked for me for some build tools I use:
I run multiple passes (using regex in python):
Remove all comments delineated with /* and */.
Remove all quoted strings
Now, all that's left is non-string, non-comment javascript so you should be able to regex directly on your function declaration
If you need the function source with strings and comments back in, you'll have to reconstitute that from the original, now that you know the begin end of the function
Here are the regexes I use (expressed in python's multi-line format):
reStr = r"""
( # capture the non-comment portion
"(?:\\.|[^"\\])*" # capture double quoted strings
|
'(?:\\.|[^'\\])*' # capture single quoted strings
|
(?:[^/\n"']|/[^/*\n"'])+ # any code besides newlines or string literals
|
\n # newline
)
|
(/\* (?:[^*]|\*[^/])* \*/) # /* comment */
|
(?://(.*)$) # // single line comment
$"""
reMultiStart = r""" # start of a multiline comment that doesn't terminate on this line
(
/\* # /*
(
[^\*] # any character that is not a *
| # or
\*[^/] # * followed by something that is not a /
)* # any number of these
)
$"""
reMultiEnd = r""" # end of a multiline comment that didn't start on this line
(
^ # start of the line
(
[^\*] # any character that is not a *
| # or
\*+[^/] # * followed by something that is not a /
)* # any number of these
\*/ # followed by a */
)
"""
regExSingleKeep = re.compile("// /") # lines that have single lines comments that start with "// /" are single line comments we should keep
regExMain = re.compile(reStr, re.VERBOSE)
regExMultiStart = re.compile(reMultiStart, re.VERBOSE)
regExMultiEnd = re.compile(reMultiEnd, re.VERBOSE)
This all sounds messy to me. You might be better off explaining what problem you're really trying to solve so folks can help find a more elegant solution to the real problem.
I built a solution in C# using plain old string methods (no regex) and it works for me with nested functions as well. The underlying principle is in counting braces and checking for unbalanced closing braces. Caveat: This won't work for cases where braces are part of a comment but you can easily enhance this solution by first stripping out comments from the code before parsing function boundaries.
I first added this extension method to extract all indices of matches in a string (Source: More efficient way to get all indexes of a character in a string)
/// <summary>
/// Source: https://stackoverflow.com/questions/12765819/more-efficient-way-to-get-all-indexes-of-a-character-in-a-string
/// </summary>
public static List<int> AllIndexesOf(this string str, string value)
{
if (String.IsNullOrEmpty(value))
throw new ArgumentException("the string to find may not be empty", "value");
List<int> indexes = new List<int>();
for (int index = 0; ; index += value.Length)
{
index = str.IndexOf(value, index);
if (index == -1)
return indexes;
indexes.Add(index);
}
}
I defined this struct for easy referencing of function boundaries:
private struct FuncLimits
{
public int StartIndex;
public int EndIndex;
}
Here's the main function where I parse the boundaries:
public void Parse(string file)
{
List<FuncLimits> funcLimits = new List<FuncLimits>();
List<int> allFuncIndices = file.AllIndexesOf("function ");
List<int> allOpeningBraceIndices = file.AllIndexesOf("{");
List<int> allClosingBraceIndices = file.AllIndexesOf("}");
for (int i = 0; i < allFuncIndices.Count; i++)
{
int thisIndex = allFuncIndices[i];
bool functionBoundaryFound = false;
int testFuncIndex = i;
int lastIndex = file.Length - 1;
while (!functionBoundaryFound)
{
//find the next function index or last position if this is the last function definition
int nextIndex = (testFuncIndex < (allFuncIndices.Count - 1)) ? allFuncIndices[testFuncIndex + 1] : lastIndex;
var q1 = from c in allOpeningBraceIndices where c > thisIndex && c <= nextIndex select c;
var qTemp = q1.Skip<int>(1); //skip the first element as it is the opening brace for this function
var q2 = from c in allClosingBraceIndices where c > thisIndex && c <= nextIndex select c;
int q1Count = qTemp.Count<int>();
int q2Count = q2.Count<int>();
if (q1Count == q2Count && nextIndex < lastIndex)
functionBoundaryFound = false; //next function is a nested function, move on to the one after this
else if (q2Count > q1Count)
{
//we found the function boundary... just need to find the closest unbalanced closing brace
FuncLimits funcLim = new FuncLimits();
funcLim.StartIndex = q1.ElementAt<int>(0);
funcLim.EndIndex = q2.ElementAt<int>(q1Count);
funcLimits.Add(funcLim);
functionBoundaryFound = true;
}
testFuncIndex++;
}
}
}
I am almost afraid that regex cannot do this job. I think it is the same as trying to parse XML or HTML with regex, a topic that has already caused various religious debates on this forum.
EDIT: Please correct me if this is NOT the same as trying to parse XML.
I guess you would have to use and construct a String-Tokenizer for this job.
function tokenizer(str){
var stack = array(); // stack of opening-tokens
var last = ""; // last opening-token
// token pairs: subblocks, strings, regex
var matches = {
"}":"{",
"'":"'",
'"':'"',
"/":"/"
};
// start with function declaration
var needle = str.match(/function[ ]+findme\([^\)]*\)[^\{]*\{/);
// move everything before needle to result
var result += str.slice(0,str.indexOf(needle));
// everithing after needle goes to the stream that will be parsed
var stream = str.slice(str.indexOf(needle)+needle.length);
// init stack
stack.push("{");
last = "{";
// while still in this function
while(stack.length > 0){
// determine next token
needle = stream.match(/(?:\{|\}|"|'|\/|\\)/);
if(needle == "\\"){
// if this is an escape character => remove escaped character
stream = stream.slice(stream.indexOf(needle)+2);
continue;
}else if(last == matches[needle]){
// if this ends something pop stack and set last
stack.pop();
last = stack[stack.length-1];
}else if(last == "{"){
// if we are not inside a string (last either " or ' or /)
// push needle to stack
stack.push(needle);
last = needle;
}
// cut away including token
stream = stream.slice(stream.indexOf(needle)+1);
}
return result + stream;
}
oh, I forgot tokens for comments... but i guess you got an idea now of how it works...
Related
I'm writing some code that rips string literals out of Typescript/JavaScript source as the first stage of a localisation toolchain I have planned.
The fly in the ointment is string interpolation.
I was on the verge of writing a function to transform an interpolation string into a function call that rips the expressions and then replaces the interpolation string with a function call that takes the expressions as parameters.
const a = 5;
const b = 7;
const foo = `first value is ${a + b}, second value is ${a * b}`;
becomes
import { interpolate } from "./my-support-code";
...
const a = 5;
const b = 7;
const foo = interpolate("first value is ${0}, second value is ${1}", [a + b, a * b]);
with the interpolate function working through the array values and replacing strings generated from the ordinal position
function interpolate(template: string, expressions: Array<any>): string {
for (let i = 0; i < expressions.length; i++) {
template = template.replace("${" + i + "}", expressions[i].toString());
}
return template;
}
This will probably work (not yet tried) but it occurred to me that this is probably a thoroughly invented wheel. The question is basically is there a well-established package that does a comprehensive job of this?
I know the above doesn't localise anything. The point is to be rid of interpolation strings so the substitution mechanism can assume that all strings are simple literals. The base language string taken from the above would be "first value is ${0}, second value is ${1}" and translators would be expected to place the tokens appropriately in whatever string they produce.
If you're going to tackle this on a non-trivial sized code base, the best you can really do is:
Write a regular expression to identify common types of localization targets and identify them, probably by file + line number.
Add comments to your code in these locations using a keyword that's easy to git grep for, or even something that can be added to your editor's syntax highlighting rules. Personally I use things like // LOCALIZE.
If you're feeling ambitious, you could implement a rewriter that attempts to convert from template form to your localization's template requirements. Each conversion can be individually inspected, altered as required, and introduced. Hopefully you have test coverage to verify your code still works after this.
I'm new to lexing and parsing so sorry if the title isn't clear enough.
Basically, I'm using Jison to parse some text and I am trying to get the lexer to comprehend indentation. Here's the bit in question:
(\r\n|\r|\n)+\s* %{
parser.indentCount = parser.indentCount || [0];
var indentation = yytext.replace(/^(\r\n|\r|\n)+/, '').length;
if (indentation > parser.indentCount[0]) {
parser.indentCount.unshift(indentation);
return 'INDENT';
}
var tokens = [];
while (indentation < parser.indentCount[0]) {
tokens.push('DEDENT');
parser.indentCount.shift();
}
if (tokens.length) {
return tokens;
}
if (!indentation.length) {
return 'NEWLINE';
}
%}
So far, almost all of that works as expected. The one problem is the line where I attempt to return an array of DEDENT tokens. It appears that Jison is just converting that array into a string which causes me to get a parse error like Expecting ........, got DEDENT,DEDENT.
What I'm hoping I can do to get around this is manually push some DEDENT tokens onto the stack. Maybe with a function like this.pushToken('DEDENT') or something along those lines. But the Jison documentation is not so great and I could use some help.
Any thoughts?
EDIT:
I seem to have been able to hack my way around this after looking at the generated parser code. Here's what seems to work...
if (tokens.length) {
var args = arguments;
tokens.slice(1).forEach(function () {
lexer.performAction.apply(this, args);
}.bind(this));
return 'DEDENT';
}
This tricks the lexer into performing another action using the exact same input for each DEDENT we have in the stack, thus allowing it to add in the proper dedents. However, it feels gross and I'm worried there could be unforeseen problems.
I would still love it if anyone had any ideas on a better way to do this.
After a couple of days I ended up figuring out a better answer. Here's what it looks like:
(\r\n|\r|\n)+[ \t]* %{
parser.indentCount = parser.indentCount || [0];
parser.forceDedent = parser.forceDedent || 0;
if (parser.forceDedent) {
parser.forceDedent -= 1;
this.unput(yytext);
return 'DEDENT';
}
var indentation = yytext.replace(/^(\r\n|\r|\n)+/, '').length;
if (indentation > parser.indentCount[0]) {
parser.indentCount.unshift(indentation);
return 'INDENT';
}
var dedents = [];
while (indentation < parser.indentCount[0]) {
dedents.push('DEDENT');
parser.indentCount.shift();
}
if (dedents.length) {
parser.forceDedent = dedents.length - 1;
this.unput(yytext);
return 'DEDENT';
}
return `NEWLINE`;
%}
Firstly, I modified my capture regex to make sure I wasn't inadvertently capturing extra newlines after a series of non-newline spaces.
Next, we make sure there are 2 "global" variables. indentCount will track our current indentation length. forceDedent will force us to return a DEDENT if it has a value above 0.
Next, we have a condition to test for a truthy value on forceDedent. If we have one, we'll decrement it by 1 and use the unput function to make sure we iterate on this same pattern at least one more time, but for this iteration, we'll return a DEDENT.
If we haven't returned, we get the length of our current indentation.
If the current indentation is greater than our most recent indentation, we'll track that on our indentCount variable and return an INDENT.
If we haven't returned, it's time to prepare to possible dedents. We'll make an array to track them.
When we detect a dedent, the user could be attempting to close 1 or more blocks all at once. So we need to include a DEDENT for as many blocks as the user is closing. We set up a loop and say that for as long as the current indentation is less than our most recent indentation, we'll add a DEDENT to our list and shift an item off of our indentCount.
If we tracked any dedents, we need to make sure all of them get returned by the lexer. Because the lexer can only return 1 token at a time, we'll return 1 here, but we'll also set our forceDedent variable to make sure we return the rest of them as well. To make sure we iterate on this pattern again and those dedents can be inserted, we'll use the unput function.
In any other case, we'll just return a NEWLINE.
I can't seem to find an example of anyone using RegEx matches to create an overlay in CodeMirror. The Moustaches example matching one thing at a time seems simple enough, but in the API, it says that the RegEx match returns the array of matches and I can't figure out what to do with it in the context of the structure in the moustaches example.
I have a regular expression which finds all the elements I need to highlight: I've tested it and it works.
Should I be loading up the array outside of the token function and then matching each one? Or is there a way to work with the array?
The other issue is that I want to apply different styling depending on the (biz|cms) option in the regex - one for 'biz' and another for 'cms'. There will be others but I'm trying to keep it simple.
This is as far as I have got. The comments show my confusion.
CodeMirror.defineMode("tbs", function(config, parserConfig) {
var tbsOverlay = {
token: function(stream, state) {
tbsArray = match("^<(biz|cms).([a-zA-Z0-9.]*)(\s)?(\/)?>");
if (tbsArray != null) {
for (i = 0; i < tbsArray.length; i++) {
var result = tbsArray[i];
//Do I need to stream.match each element now to get hold of each bit of text?
//Or is there some way to identify and tag all the matches?
}
}
//Obviously this bit won't work either now - even with regex
while (stream.next() != null && !stream.match("<biz.", false)) {}
return null;
}
};
return CodeMirror.overlayMode(CodeMirror.getMode(config, parserConfig.backdrop || "text/html"), tbsOverlay);
});
It returns the array as produced by RegExp.exec or String.prototype.match (see for example https://developer.mozilla.org/en-US/docs/JavaScript/Reference/Global_Objects/String/match), so you probably don't want to iterate through it, but rather pick out specific elements the correspond to groups in your regexp (if (result[1] == "biz") ...)
Look at implementation of Code Mirror method match() and you'll see, that it processes method parameter for two types: string and RegExp.
Your constant in
stream.match("<biz.")
is of string type.
Define it in RegExp type:
tbsArray = /<biz./g
Thus, your stream will be matched with RegExp.
I want to represent an object that has several text properties, every one representing the same text value but in different languages. In case the user modifies a single field, the other fields should be revised, and I'm thinking on adding a single Unicode character at the beginning of the string of the other fields, and then to check for fields that need attention, I just have to check the value at obj.text_prop[0].
Which Unicode character can I use for this purpose? Ideally, it would be non-printable, supported in JS and JSON.
Such flagging should be done some other way, at a protocol level other than character level. For example, consider as making each language version an object rather than just a string; the object could then have properties such as needsAttention in addition to the property that contains the string.
But in case you need to embed such information into a string, then you could use ZERO WIDTH SPACE U+200B. As such it means line break opportunity, but this should not disturb here. The main problem is probably that old versions of IE may display it as a small rectangle.
Alternatively, you could use a noncharacter code point such as U+FFFF, if you can make sure that the string is never sent anywhere from the program without removing this code point. As described in Ch. 16 of the Unicode Standard, Special Areas and Format Characters, noncharacter code points are reserved for internal use in an application and should never be used in text interchange.
I would suggest you not to use strange characters in the beginning of the line. You can implement something like this:
<script type="text/javascript">
function LocalizationSet(){};
LocalizationSet.prototype.localizationItems = [];
LocalizationSet.prototype.itemsNeedAttention = [];
LocalizationSet.prototype.setLocalization = function(langId, text)
{
this.localizationItems[langId] = text;
this.itemsNeedAttention[langId] = true;
}
LocalizationSet.prototype.getLocalization = function(langId)
{
return this.localizationItems[langId];
}
LocalizationSet.prototype.needsAttention = function(langId)
{
if(this.itemsNeedAttention[langId] == null)
{
return false;
}
return this.itemsNeedAttention[langId];
}
LocalizationSet.prototype.unsetAttentionFlags = function()
{
for(var it in this.itemsNeedAttention)
{
this.itemsNeedAttention[it] = false;
}
}
//Example
var set = new LocalizationSet();
set.setLocalization("en","Hello");
set.setLocalization("de","Willkommen");
alert(set.needsAttention("en"));
alert(set.needsAttention("de"));
set.unsetAttentionFlags();
alert(set.needsAttention("en"));
set.setLocalization("en","Hi");
alert(set.needsAttention("en"));
//Shows true,true,false,true
</script>
2015 Edit Don't do this. Be a good person and Just Use JSON.parse() :)
I am trying to take a string which contains variables and values in a javascript-like syntax, and store them in a global object (gv). My issue is just with the parsing of the string.
String (everything inside the <div>):
<div id="gv">
variableName = "variableValue,NoSpacesThough";
portal = "TheCakeIsALie";
</div>
Script (parses string above, places values into global object):
var s = (document.getElementById("gv").innerHTML).split(';');
for (var i = 0; i < s.length; i++) {
if (s[i] !== "\n" || "") {
s[i] = s[i].replace(/^\s*/gm, "");
var varName = s[i].substr(0, s[i].indexOf('=') - 1),
varValue = (s[i].substr((s[i].indexOf('"') + 1), s[i].length)).replace('"', "");
gv[varName] = varValue;
}
}
Result:
console.log(gv.variableName); //returns: variableValue,NoSpacesThough
console.log(gv.portal); //returns: TheCakeIsALie
Q: How can I modify this script to correctly store these variables:
exampleVariable = { name: "string with spaces", cake:lie };
variableName = "variableValue,NoSpacesThough";
portal = "The Cake Is A Lie";
The directly above has an object containing: A string with spaces (and "), a reference
Thanks.
Four options / thoughts / suggestions:
1. Use JSON
If you're in control of the source format, I'd recommend using JSON rather than rolling your own. Details on that page. JSON is now part of the ECMAScript (JavaScript) standard with standard methods for creating JSON strings from object graphs and vice-versa. With your example:
exampleVariable = { name: "string with spaces", cake:lie };
variableName = "variableValue,NoSpacesThough";
portal = "The Cake Is A Lie";
here's what the JSON equivalent would look like:
{
"exampleVariable": { name: "string with spaces", cake:lie },
"variableName": "variableValue,NoSpacesThough",
"portal": "The Cake Is A Lie"
}
As you can see, the only differences are:
You wrap the entire thing in curly braces ({}).
You put the "variable" names (property names) in double quotes.
You use a colon rather than an equal sign after the property name.
You use a comma rather than a semicolon to separate properties (just as in the object literal you have on your exampleVariable line).
You ensure that any string values use double, rather than single, quotes (JavaScript allows either; JSON is more restrictive). Your example uses double quotes, but I mention it just in case...
2. Pre-process it into JSON with regular expressions
If you're not in control of the source format, but it's exactly as you've shown, you could reformat it as JSON fairly easily via regular expressions, and then deserialize it with the JSON stuff. But if the format is more complicated than you've quoted, that starts getting hairy very quickly.
Here's an example (live copy) of transforming what you've quoted to JSON:
function transformToJSON(str) {
var rexSplit = /\r?\n/g,
rexTransform = /^\s*([a-zA-Z0-9_]+)\s*=\s*(.+);\s*$/g,
rexAllWhite = /\s+/g,
lines,
index,
line;
lines = str.split(rexSplit);
index = 0;
while (index < lines.length) {
line = lines[index];
if (line.replace(rexAllWhite, '').length === 0) {
// Blank line, remove it
lines.splice(index, 1);
}
else {
// Transform it
lines[index] = line.replace(rexTransform, '"$1": $2');
++index;
}
}
result = "{\n" + lines.join(",\n") + "\n}";
return result;
}
...but beware as, again, that relies on the format being exactly as you showed, and in particular it relies on each value being on a single line and any string values being in double quotes (a requirement of JSON). You'll probably need to handle complexities the above doesn't handle, but you can't do it with things like your first line var s = (document.getElementById("gv").innerHTML).split(';');, which will break lines on ; regardless of whether the ; is within quotes...
3. Actually parse it by modifying a JSON parser to support your format
If you can't change the format, and it's less precise than the examples you've quoted, you'll have to get into actual parsing; there are no shortcuts (well, no reliable ones). Actually parsing JavaScript literals (I'm assuming there are not expressions in your data, other than the assignment expression of course) isn't that bad. You could probably take a JSON parser and modify it to your needs, since it will already have nearly all the logic for literals. There are two on Crockford's github page (Crockford being the inventer of JSON), one using recursive descent and another using a state machine. Take your choice and start hacking.
4. The evil eval
I suppose I should mention eval here, although I don't recommend you use it. eval runs arbitrary JavaScript code from a string. But because it runs any code you give it, it's not a good choice for deserializing things like this, and any free variables (like the ones you've quoted) would end up being globals. Really very ugly, I mostly mention it in order to say: Don't use it. :-)