Minification: Using regex to remove linebreaks from JavaScript code - javascript

For the pure purpose of obfuscation, the first three lines seem to clean up the script pretty nicely from unnecessary enters.
Can anyone tell me what the lines 1 - 4 actually do? Only thing I know from trial and error is that if I comment out the fourth line the site works, if I leave it in place the site breaks.
<?php
header("Content-type: text/javascript; charset=UTF-8");
ob_start("compress");
function compress($buffer)
{
# remove extra or unneccessary new line from javascript
$buffer = preg_replace('/([;])\s+/', '$1', $buffer);
$buffer = preg_replace('/([}])\s+(else)/', '$1else', $buffer);
$buffer = preg_replace('/([}])\s+(var)/', '$1;var', $buffer);
$buffer = preg_replace('/([{};])\s+(\$)/', '$1\$', $buffer);
return $buffer;
}
Is there a better way to remove one or multiple line enters from JavaScript?

Dissection of all four regular expressions
Let's try and dissect each one of the regular expressions.
First regex
$buffer = preg_replace('/([;])\s+/', '$1', $buffer);
Explanation
( # beginning of the first capturing group
[;] # match the literal character ';'
) # ending of the first capturing group
\s+ # one or more whitespace characters (including newlines)
The above regular expression removes any whitespace that occurs immediately following a semicolon. ([;]) is a capturing group, meaning if a match is found, it is stored into a backreference, so we could use it later. For example, if our string was foo; <space><space>, then the expression would match ; and the whitespace characters. The replacement pattern here is $1, which means the entire matched string would be replaced with just a semicolon.
Second regex
$buffer = preg_replace('/([}])\s+(else)/', '$1else', $buffer);
Explanation
( # beginning of the first capturing group
[}] # match the literal character ';'
) # ending of the first capturing group
\s+ # one or more whitespace characters
(else) # match and capture 'else'
The above regex removes any whitespace between a closing curly brace (}) and else. The replacement pattern here is $1else, which means, the string with whitespace will get replaced by what was captured by the first capturing group ([}]) (which is just the semicolon) followed by the keyword else. Nothing much to it.
Third regex
$buffer = preg_replace('/([}])\s+(var)/', '$1;var', $buffer);
Explanation
( # beginning of the first capturing group
[}] # match the literal character ';'
) # ending of the first capturing group
\s+ # one or more whitespace characters
(var) # match and capture 'var'
This is the same as previous regex. The only difference here is the keyword - var instead of else. The semicolon character is optional in JavaScript. But if you want to write multiple statements in a single line, there's no way for the interpreter to know they're multiple lines, so a ; will need to be used to terminate each statement.
Fourth regex
$buffer = preg_replace('/([{};])\s+(\$)/', '$1\$', $buffer);
Explanation
( # beginning of the first capturing group
[{};] # match the literal character '{' or '}' or ';'
) # ending of the first capturing group
\s+ # one or more whitespace characters
( # beginning of the second capturing group
\$ # match the literal character '$'
) # ending of the second capturing group
The replacement pattern here is $1\$, which means the entire matched string would be replaced with what was matched by the first capturing group ([{};]) followed by a literal $ character.
Sidenote
This answer was only meant to explain the four regexes and what it does. The expressions could be improved a lot, but I'm not going into that as it's not the correct approach. As Qtax points out in the comments, you really should use a proper JS minifier to achieve this task. You might want to check out Google's Closure Compiler - it looks pretty neat.
If you're still confused how it works, don't worry. Learning regexes can be difficult in the beginning. I suggest you use this website - http://regularexpressions.info. It is a pretty decent resource for learning regular expressions. If you're looking for a book, you might want to check out Mastering Regular Expressions By Jeffrey Friedl.

Related

Regex consume a character if it matches, but not otherwise

I am trying to write a regex expression which will capture all instances of the '#' character, except when two such characters appear in succession (essentially, an escape sequence). For example:
abd#ajk: # should be matched
abd##ajk: No matches
abd###ajk: The final # should match.
abd####ajk: No matches
This almost works with the negative lookahead expression #(?!#), except that because the second # is not consumed, the last of two # symbols will still be matched. What I think I want to do is to lookahead for an # but consume the character if it is there; otherwise, do not consume it. Is this possible?
Edit: I'm using Javascript which unfortunately rules out several good approaches :(
In JavaScript, to split strings at an unescaped #, you may actually match chunks of text that is either ## (an escaped #) and any chars other than #:
var strs = ['abd#ajk','abd##ajk','abd###ajk','abd####ajk'];
var rx = /(?:[^#]|##)+/g;
for (var s of strs) {
console.log(s, "=>", s.match(rx))
}
The regex is
/(?:[^#]|##)+/g
See its demo
Details
(?: - start of a non-capturing group that matches either of the 2 alternatives:
[^#]- any char other than#`
| - or
## - 2 #s
)+ - repeat matching 1 or more times.
The g modifier finds all matching occurrences inside the input string.
Since you didn't tag a programming language to your question here is my 2 cents for Java:
(?<=(?<!#)(?:##){0,999})#(?!#)
Java doesn't support infinite lookbehinds but bounded so here I explicitly specified max of even occurrences of #: 999.
JavsScript
Lookbehinds in JavaScript are not implemented and supported by many browsers yet. If you are trying to do this in JS then this would be your working solution:
Method 1
((?:[^#]*(?:##)+[^#]*)+)|#
(?:[^#]*(?:##)+[^#]*)+ Match ## occurrences and all its leading / trailing characters
|# Or a single #
JS Code:
str.split(/((?:[^#]*(?:##)+[^#]*)+)|#/).filter(Boolean);
Method 2 (Recommended)
Or if you don't have problem with using match() this is much more cleaner and of course faster:
(?:[^#]*(?:##)+[^#]*)+|[^#]+
JS Code:
console.log(
"aaaa#######bbb#aa###cccc##ddddd#".match(/(?:[^#]*(?:##)+[^#]*)+|[^#]+/g)
);

Allow space in regex when validating file

I've got a text box where I wanted to ensure some goods and bads out of it.
For instance good could include:
GoodString
GoodString88
99GoodString
Some bad things I did not want to include was:
Good*String
Good&String
But one thing I wanted to allow would be to allow spaces between words so this should of been good:
Good String
However my regex/js is stating this is NOT a good string - I want to allow it. I'm using the test routine for this and I'm as dumb as you can get with regexes. I don't know why I can never understand these things...
In any event my validation is as follows:
var rx = /^[\w.-]+$/;
if (!rx.test($("#MainContent_txtNewDocumentTitle").val())) {
//code for bad string
}else{
//code for good string
}
What can I do to this:
var rx = /^[\w.-]+$/;
Such that spaces are allowed?
You can use this regex instead to allow space only in middle (not at start/end):
var rx = /^[\w.-]+(?:[ \t]+[\w.-]+)*$/gm;
RegEx Demo
RegEx Breakup:
^ # line start
[\w.-]+ # match 1 or more of a word character or DOT or hyphen
(?: # start a non-capturing group
[ \t]+ # match one or more space or tab
[\w.-]+ # match 1 or more of a word character or DOT or hyphen
)* # close the non-capturing group. * will allow 0 or more matches of group
$ # line end
/gm # g for global and m for multiline matches
RegEx Reference

regular expression to replace with ','

I have one RegExp, could anyone explain exactly what it does?
Regexp
b=b.replace(/(\d{1,3}(?=(?:\d\d\d)+(?!\d)))/g,"$1 ")
I think it is replacing with space(' ')
if i'm right, i want to replace it with comma(,) instead of space(' ').
To explain the regex, let's break it down:
( # Match and capture in group number 1:
\d{1,3} # one to three digits (as many as possible),
(?= # but only if it's possible to match the following afterwards:
(?: # A (non-capturing) group containing
\d\d\d # exactly three digits
)+ # once or more (so, three/six/nine/twelve/... digits)
(?!\d) # but only if there are no further digits ahead.
) # End of (?=...) lookahead assertion
) # End of capturing group
Actually, the outer parentheses are unnecessary if you use $& instead of $1 for the replacement string ($& contains the entire match).
The regex (\d{1,3}(?=(?:\d\d\d)+(?!\d))) matches any 1-3 digits ((\d{1,3}) that is followed by a multiple of 3 digits ((?:\d\d\d)+), that isn't followed by another digit ((?!\d)). It replaces it with "$1 ". $1 is replaced by the first capture group. The space behind it is... a space.
See regexpressions on mdn for more information about the different syntaxes.
If you want to seperate the numbers with a comma, instead of a space, you'll need to replace it with "$1," instead.
Don't try to solve everything by using regular expressions.
Regular expressions are meant for matching, not to fix non-text-encoded-as-text formatting.
If you want to format numbers differently, extract them and use format strings to reformat them on a character processing level. That is just an ugly hack.
It is okay to use regular expressions to find the numbers in the text, e.g. \d{4,} but trying to do the actual formatting with regexp is a crazy abuse.

Match everything but not quoted strings

I want to match everything but no quoted strings.
I can match all quoted strings with this: /(("([^"\\]|\\.)*")|('([^'\\]|\\.)*'))/
So I tried to match everything but no quoted strings with this: /[^(("([^"\\]|\\.)*")|('([^'\\]|\\.)*'))]/ but it doesn't work.
I would like to use only regex because I will want to replace it and want to get the quoted text after it back.
string.replace(regex, function(a, b, c) {
// return after a lot of operations
});
A quoted string is for me something like this "bad string" or this 'cool string'
So if I input:
he\'re is "watever o\"k" efre 'dder\'4rdr'?
It should output this matches:
["he\'re is ", " efre ", "?"]
And than I wan't to replace them.
I know my question is very difficult but it is not impossible! Nothing is impossible.
Thanks
EDIT: Rewritten to cover more edge cases.
This can be done, but it's a bit complicated.
result = subject.match(/(?:(?=(?:(?:\\.|"(?:\\.|[^"\\])*"|[^\\'"])*'(?:\\.|"(?:\\.|[^"'\\])*"|[^\\'])*')*(?:\\.|"(?:\\.|[^"\\])*"|[^\\'])*$)(?=(?:(?:\\.|'(?:\\.|[^'\\])*'|[^\\'"])*"(?:\\.|'(?:\\.|[^'"\\])*'|[^\\"])*")*(?:\\.|'(?:\\.|[^'\\])*'|[^\\"])*$)(?:\\.|[^\\'"]))+/g);
will return
, he said.
, she replied.
, he reminded her.
,
from this string (line breaks added and enclosing quotes removed for clarity):
"Hello", he said. "What's up, \"doc\"?", she replied.
'I need a 12" crash cymbal', he reminded her.
"2\" by 4 inches", 'Back\"\'slashes \\ are OK!'
Explanation: (sort of, it's a bit mindboggling)
Breaking up the regex:
(?:
(?= # Assert even number of (relevant) single quotes, looking ahead:
(?:
(?:\\.|"(?:\\.|[^"\\])*"|[^\\'"])*
'
(?:\\.|"(?:\\.|[^"'\\])*"|[^\\'])*
'
)*
(?:\\.|"(?:\\.|[^"\\])*"|[^\\'])*
$
)
(?= # Assert even number of (relevant) double quotes, looking ahead:
(?:
(?:\\.|'(?:\\.|[^'\\])*'|[^\\'"])*
"
(?:\\.|'(?:\\.|[^'"\\])*'|[^\\"])*
"
)*
(?:\\.|'(?:\\.|[^'\\])*'|[^\\"])*
$
)
(?:\\.|[^\\'"]) # Match text between quoted sections
)+
First, you can see that there are two similar parts. Both these lookahead assertions ensure that there is an even number of single/double quotes in the string ahead, disregarding escaped quotes and quotes of the opposite kind. I'll show it with the single quotes part:
(?= # Assert that the following can be matched:
(?: # Match this group:
(?: # Match either:
\\. # an escaped character
| # or
"(?:\\.|[^"\\])*" # a double-quoted string
| # or
[^\\'"] # any character except backslashes or quotes
)* # any number of times.
' # Then match a single quote
(?:\\.|"(?:\\.|[^"'\\])*"|[^\\'])*' # Repeat once to ensure even number,
# (but don't allow single quotes within nested double-quoted strings)
)* # Repeat any number of times including zero
(?:\\.|"(?:\\.|[^"\\])*"|[^\\'])* # Then match the same until...
$ # ... end of string.
) # End of lookahead assertion.
The double quotes part works the same.
Then, at each position in the string where these two assertions succeed, the next part of the regex actually tries to match something:
(?: # Match either
\\. # an escaped character
| # or
[^\\'"] # any character except backslash, single or double quote
) # End of non-capturing group
The whole thing is repeated once or more, as many times as possible. The /g modifier makes sure we get all matches in the string.
See it in action here on RegExr.
Here is a tested function that does the trick:
function getArrayOfNonQuotedSubstrings(text) {
/* Regex with three global alternatives to section the string:
('[^'\\]*(?:\\[\S\s][^'\\]*)*') # $1: Single quoted string.
| ("[^"\\]*(?:\\[\S\s][^"\\]*)*") # $2: Double quoted string.
| ([^'"\\]*(?:\\[\S\s][^'"\\]*)*) # $3: Un-quoted string.
*/
var re = /('[^'\\]*(?:\\[\S\s][^'\\]*)*')|("[^"\\]*(?:\\[\S\s][^"\\]*)*")|([^'"\\]*(?:\\[\S\s][^'"\\]*)*)/g;
var a = []; // Empty array to receive the goods;
text = text.replace(re, // "Walk" the text chunk-by-chunk.
function(m0, m1, m2, m3) {
if (m3) a.push(m3); // Push non-quoted stuff into array.
return m0; // Return this chunk unchanged.
});
return a;
}
This solution uses the String.replace() method with a replacement callback function to "walk" the string section by section. The regex has three global alternatives, one for each section; $1: single quoted, $2: double quoted, and $3: non-quoted substrings, Each non-quoted chunk is pushed onto the return array. It correctly handles all escaped characters, including escaped quotes, both inside and outside quoted strings. Single quoted substrings may contain any number of double quotes and vice-versa. Illegal orphan quotes are removed and serve to divide a non-quoted section into two chunks. Note that this solution requires no lookaround and requires only one pass. It also implements Friedl's "Unrolling-the-Loop" efficiency technique and is quite efficient.
Additional: Here is some code to test the function with the original test string:
// The original test string (with necessary escapes):
var s = "he\\'re is \"watever o\\\"k\" efre 'dder\\'4rdr'?";
alert(s); // Show the test string without the extra backslashes.
console.log(getArrayOfNonQuotedSubstrings(s).toString());
You can't invert a regex. What you have tried was making a character class out of it and invert that - but also for doing that you would have to escape all closing brackets "\]".
EDIT: I would have started with
/(^|" |' ).+?($| "| ')/
This matches anything between the beginning or the end of a quoted string (very simple: a quotation mark plus a blank) and the end of the string or the start of a quoted string (a blank plus a quotation mark). Of course this doesn't handle any escape sequences or quotations which don't follow the scheme / ['"].*['"] /. See above answers for more detailed expressions :-)

Escape a white space character using Javascript

I have the following jquery statement. I wish to remove the whitespace as shown below. So if I have a word like:
For example
#Operating/System I would like the
end result to show me
#Operating\/System. (ie with a
escape sequence).
But if I have #Operating/System test then I want to show
#Operating\/System + escape
sequence for space. The .replace(/ /,'')
part is incorrect but .replace("/","\\/") works
well as per my requirements.
Please help!
$("#word" + lbl.eq(i).text().replace("/","\\/").replace(/ /,'')).hide();
$( "#word" + lbl.eq(i).text().replace(/([ /])/g, '\\$1') ).hide();
This matches all spaces and slashes in a string (and saves the respective char in group $1):
/([ /])/g
replacement with
'\\$1'
means a backslash plus the original char in group $1.
"#Operating/System test".replace(/([ /])/g, '\\$1');
-->
"#Operating\/System\ test"
Side advantage - there is only a singe call to replace().
EDIT: As requested by the OP, a short explanation of the regular expression /([ /])/g. It breaks down as follows:
/ # start of regex literal
( # start of match group $1
[ /] # a character class (spaces and slashes)
) # end of group $1
/g # end of regex literal + "global" modifier
When used with replace() as above, all spaces and slashes are replaced with themselves, preceded by a backslash.

Categories

Resources