URL Encoding in JS for meaningful URLs and Rails Page Caching

URL Encoding in JS for meaningful URLs and Rails Page Caching - javascript

I'm running a Rails Application which gets a lot of traffic at the moment so I started using Page Caching to increase the performance. So far everything works like a charm. But when I tried to also cache search results I run into a strange problem.
My Approach:
Use meaningful URLs for searching and pagination (/search?query=term&page=3 becomes /search/term/3)
Use Javascript to submit the form - if JS is disabled it falls back to the old form (which also works with my routes, but without caching)
My Code:
// Javascript
function set_search_action() {
window.location = '/search/' + escape(document.getElementById('query').value);
return false;
}
// HTML
<form action="/search" id="search_form" method="get" onSubmit="return set_search_action();">
<input id="query" name="query" title="Search" type="text" />
<input class="submit" name="commit" type="submit" value="Search" />
</form>
The Problem
Everything works for single words like "term". But when I search for "term1 term2" the form is submitted to /search/term1 term2/ or /search/term1 term2/1 . It should be submitted to /search/term1+term2 That's what the JS escape function should do I think.
So far it works also with spaces in development mode. But I guess it will become a problem in production mode with caching enabled (URLs shouldn't contain any whitespaces).
Any ideas on what I did wrong? Thanks!

It should be submitted to /search/term1+term2
Nope. Plus symbols only represent spaces in application/x-www-form-urlencoded content, such as when the query-string part of the URL is used to submit a form. In the path-part of a URL, + simply means plus; space should be encoded to %20 instead.
That's what the JS escape function should do I think.
Yes it does, and that's the problem. escape encodes spaces to +, which is only suitable for form submissions; used in a path, you will get an unexpected and unwanted plus sign. It also mangles non-ASCII characters into an arbitrary format specific to the escape function that no URL-decoder will be able to read.
As Tomalak said, escape()/unescape() is almost always the wrong thing, and in general should not be used. encodeURIComponent() is usually what you really want, and will produce %20 for spaces, which is safe as it is equally valid in the path part or the query string.

Never use escape()! It's broken and highly dis-recommended for what you do. Use encodeURIComponent() instead.
To have + instead of %20, append a .replace(/%20/g, "+").

Related

"Literally" insert string into HTML input

I want to insert a string "literally" into an HTML input field using jQuery.
The problem is that the string contains escaped characters and character codes, and all of those need to still be there when the string is inserted.
My problem seems to be with escaped characters (thanks for the comments that pointed that out). I can't figure out how I can insert the string without the escaped characters and codes being translated.
The literal strings come from a file data.txt. To clarify, this is just an exemplary string that is used to demonstrate that there can be escaped quotes and character codes etc. in the strings.
TEST\"/**\x3e
They are loaded (in node.js) from the file into an array of strings.
Wrapper code (Node.js) visits the page using the Chrome dev tools.
Here, for each string a script is prepared that is injected and executed on the page.
Therefore the inputString is inserted into the script, before it is injected.
So here is my problem with string escaping. I have the strings in literal format as data and I currently inject them as dynamically generated JavaScript code which is where escaping problems occur.
Injected Code
// this was (currently incorrectly) injected into the page before
// from the array of input strings that was loaded from data
let insertString = "TEST\"/**\x3e"; // <-
let form = $("form").first();
let inputs = form.find(":input").not(":input[type=submit]");
let input = inputs.first();
input.focus().val(insertString);
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<!-- Exemplary form code on the page -->
<form action="post" method="post">
<label for="name">Name: </label>
<input id="name" type="text" name="input">
<input type="submit" value="Submit">
</form>
What we got
What I want
The string is not inserted as is.
For example the character code \x3e is translated to >.
Also the escaped \" is translated to ".
It needs to be inserted just as it would be when copying and pasting from the data file.
Thoughts on a potential (manual) solution
So one potential solution is to rework the data.txt file and escape the strings correctly. So the first line might be TEST\\\"/**\\x3e, as #Jamiec and #Barmar have proposed.
// injected before
let insertString = "TEST\\\"/**\\x3e"; // <- manually escaped
let form = $("form").first();
let inputs = form.find(":input").not(":input[type=submit]");
let input = inputs.first();
input.focus().val(insertString);
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<!-- Exemplary form code on the page -->
<form action="post" method="post">
<label for="name">Name: </label>
<input id="name" type="text" name="input">
<input type="submit" value="Submit">
</form>
The string will then be inserted as intended, but the solution is still not satisfying, because it would be better for me to not touch the input data.
It would be best to have the input strings in the data.txt file exactly as they will look when they are inserted into the page.
This would require and additional step between loading the input data and inserting each string into the script (that is then injected into the page). Potentially this preprocessing can be done with regexp replacements.

You need to escape all the backslashes and quotes in the string. You can do this using a regular expression.
function escape_string(string) {
return string.replace(/[\\"']/g, "\\$&");
}
console.log('let str = "' + escape_string(prompt("Type a string")) + '";');

This has nothing to do with encoding, nor input fields - it is simply string escapes - so can be demonstrated using the console (or any other way of displaying a string).
In order to see the literal escape character \ in a string you must escape the escape character with \\ - see below:
var text1 = "TEST\"/**\x3e";
console.log(text1)
var text2 = "TEST\\\"/**\\x3e";
console.log(text2)
As you can see the first output is your exact problem, where as the second escapes the escape character so you get what you expect in the output.

Partial matching a string against a regex

Suppose that I have this regular expression: /abcd/
Suppose that I wanna check the user input against that regex and disallow entering invalid characters in the input. When user inputs "ab", it fails as an match for the regex, but I can't disallow entering "a" and then "b" as user can't enter all 4 characters at once (except for copy/paste). So what I need here is a partial match which checks if an incomplete string can be potentially a match for a regex.
Java has something for this purpose: .hitEnd() (described here http://glaforge.appspot.com/article/incomplete-string-regex-matching) python doesn't do it natively but has this package that does the job: https://pypi.python.org/pypi/regex.
I didn't find any solution for it in js. It's been asked years ago: Javascript RegEx partial match
and even before that: Check if string is a prefix of a Javascript RegExp
P.S. regex is custom, suppose that the user enters the regex herself and then tries to enter a text that matches that regex. The solution should be a general solution that works for regexes entered at runtime.

Looks like you're lucky, I've already implemented that stuff in JS (which works for most patterns - maybe that'll be enough for you). See my answer here. You'll also find a working demo there.
There's no need to duplicate the full code here, I'll just state the overall process:
Parse the input regex, and perform some replacements. There's no need for error handling as you can't have an invalid pattern in a RegExp object in JS.
Replace abc with (?:a|$)(?:b|$)(?:c|$)
Do the same for any "atoms". For instance, a character group [a-c] would become (?:[a-c]|$)
Keep anchors as-is
Keep negative lookaheads as-is
Had JavaScript have more advanced regex features, this transformation may not have been possible. But with its limited feature set, it can handle most input regexes. It will yield incorrect results on regex with backreferences though if your input string ends in the middle of a backreference match (like matching ^(\w+)\s+\1$ against hello hel).

As many have stated there is no standard library, fortunately I have written a Javascript implementation that does exactly what you require. With some minor limitation it works for regular expressions supported by Javascript.
see: incr-regex-package.
Further there is also a react component that uses this capability to provide some useful capabilities:
Check input as you type
Auto complete where possible
Make suggestions for possible input values
Demo of the capabilities Demo of use

I think that you have to have 2 regex one for typing /a?b?c?d?/ and one for testing at end while paste or leaving input /abcd/
This will test for valid phone number:
const input = document.getElementById('input')
let oldVal = ''
input.addEventListener('keyup', e => {
if (/^\d{0,3}-?\d{0,3}-?\d{0,3}$/.test(e.target.value)){
oldVal = e.target.value
} else {
e.target.value = oldVal
}
})
input.addEventListener('blur', e => {
console.log(/^\d{3}-?\d{3}-?\d{3}-?$/.test(e.target.value) ? 'valid' : 'not valid')
})
<input id="input">
And this is case for name surname
const input = document.getElementById('input')
let oldVal = ''
input.addEventListener('keyup', e => {
if (/^[A-Z]?[a-z]*\s*[A-Z]?[a-z]*$/.test(e.target.value)){
oldVal = e.target.value
} else {
e.target.value = oldVal
}
})
input.addEventListener('blur', e => {
console.log(/^[A-Z][a-z]+\s+[A-Z][a-z]+$/.test(e.target.value) ? 'valid' : 'not valid')
})
<input id="input">

This is the hard solution for those who think there's no solution at all: implement the python version (https://bitbucket.org/mrabarnett/mrab-regex/src/4600a157989dc1671e4415ebe57aac53cfda2d8a/regex_3/regex/_regex.c?at=default&fileviewer=file-view-default) in js. So it is possible. If someone has simpler answer he'll win the bounty.
Example using python module (regular expression with back reference):
$ pip install regex
$ python
>>> import regex
>>> regex.Regex(r'^(\w+)\s+\1$').fullmatch('abcd ab',partial=True)
<regex.Match object; span=(0, 7), match='abcd ab', partial=True>

You guys would probably find this page of interest:
(https://github.com/desertnet/pcre)
It was a valiant effort: make a WebAssembly implementation that would support PCRE. I'm still playing with it, but I suspect it's not practical. The WebAssembly binary weighs in at ~300K; and if your JS terminates unexpectedly, you can end up not destroying the module, and consequently leaking significant memory.
The bottom line is: this is clearly something the ECMAscript people should be formalizing, and browser manufacturers should be furnishing (kudos to the WebAssembly developer into possibly shaming them to get on the stick...)
I recently tried using the "pattern" attribute of an input[type='text'] element. I, like so many others, found it to be a letdown that it would not validate until a form was submitted. So a person would be wasting their time typing (or pasting...) numerous characters and jumping on to other fields, only to find out after a form submit that they had entered that field wrong. Ideally, I wanted it to validate field input immediately, as the user types each key (or at the time of a paste...)
The trick to doing a partial regex match (until the ECMAscript people and browser makers get it together with PCRE...) is to not only specify a pattern regex, but associated template value(s) as a data attribute. If your field input is shorter than the pattern (or input.maxLength...), it can use them as a suffix for validation purposes. YES -this will not be practical for regexes with complex case outcomes; but for fixed-position template pattern matching -which is USUALLY what is needed- it's fine (if you happen to need something more complex, you can build on the methods shown in my code...)
The example is for a bitcoin address [ Do I have your attention now? -OK, not the people who don't believe in digital currency tech... ] The key JS function that gets this done is validatePattern. The input element in the HTML markup would be specified like this:
<input id="forward_address"
name="forward_address"
type="text"
maxlength="90"
pattern="^(bc(0([ac-hj-np-z02-9]{39}|[ac-hj-np-z02-9]{59})|1[ac-hj-np-z02-9]{8,87})|[13][a-km-zA-HJ-NP-Z1-9]{25,34})$"
data-entry-templates="['bc099999999999999999999999999999999999999999999999999999999999','bc1999999999999999999999999999999999999999999999999999999999999999999999999999999999999999','19999999999999999999999999999999999']"
onkeydown="return validatePattern(event)"
onpaste="return validatePattern(event)"
required
/>
[Credit goes to this post: "RegEx to match Bitcoin addresses?
" Note to old-school bitcoin zealots who will decry the use of a zero in the regex here -it's just an example for accomplishing PRELIMINARY validation; the server accepting the address passed off by the browser can do an RPC call after a form post, to validate it much more rigorously. Adjust your regex to suit.]
The exact choice of characters in the data-entry-template was a bit arbitrary; but they had to be ones such that if the input being typed or pasted by the user is still incomplete in length, it will use them as an optimistic stand-in and the input so far will still be considered valid. In the example there, for the last of the data-entry-templates ('19999999999999999999999999999999999'), that was a "1" followed by 39 nines (seeing as how the regex spec "{25,39}" dictates that a maximum of 39 digits in the second character span/group...) Because there were two forms to expect -the "bc" prefix and the older "1"/"3" prefix- I furnished a few stand-in templates for the validator to try (if it passes just one of them, it validates...) In each template case, I furnished the longest possible pattern, so as to insure the most permissive possibility in terms of length.
If you were generating this markup on a dynamic web content server, an example with template variables (a la django...) would be:
<input id="forward_address"
name="forward_address"
type="text"
maxlength="{{MAX_BTC_ADDRESS_LENGTH}}"
pattern="{{BTC_ADDRESS_REGEX}}" {# base58... #}
data-entry-templates="{{BTC_ADDRESS_TEMPLATES}}" {# base58... #}
onkeydown="return validatePattern(event)"
onpaste="return validatePattern(event)"
required
/>
[Keep in mind: I went to the deeper end of the pool here. You could just as well use this for simpler patterns of validation.]
And if you prefer to not use event attributes, but to transparently hook the function to the element's events at document load -knock yourself out.
You will note that we need to specify validatePattern on three events:
The keydown, to intercept delete and backspace keys.
The paste (the clipboard is pasted into the field's value, and if it works, it accepts it as valid; if not, the paste does not transpire...)
Of course, I also took into account when text is partially selected in the field, dictating that a key entry or pasted text will replace the selected text.
And here's a link to the [dependency-free] code that does the magic:
https://gitlab.com/osfda/validatepattern.js
(If it happens to generate interest, I'll integrate constructive and practical suggestions and give it a better readme...)
PS: The incremental-regex package posted above by Lucas Trzesniewski:
Appears not to have been updated? (I saw signs that it was undergoing modification??)
Is not browserified (tried doing that to it, to kick the tires on it -it was a module mess; welcome anyone else here to post a browserified version for testing. If it works, I'll integrate it with my input validation hooks and offer it as an alternative solution...) If you succeed in getting it browserfied, maybe sharing the exact steps that were needed would also edify everyone on this post. I tried using the esm package to fix version incompatibilities faced by browserify, but it was no go...

I strongly suspect (although I'm not 100% sure) that general case of this problem has no solution the same way as famous Turing's "Haltin problem" (see Undecidable problem). And even if there is a solution, it most probably will be not what users actually want and thus depending on your strictness will result in a bad-to-horrible UX.
Example:
Assume "target RegEx" is [a,b]*c[a,b]* also assume that you produced a reasonable at first glance "test RegEx" [a,b]*c?[a,b]* (obviously two c in the string is invalid, yeah?) and assume that the current user input is aabcbb but there is a typo because what the user actually wanted is aacbbb. There are many possible ways to fix this typo:
remove c and add it before first b - will work OK
remove first b and add after c - will work OK
add c before first b and then remove the old one - Oops, we prohibit this input as invalid and the user will go crazy because no normal human can understand such a logic.
Note also that your hitEnd will have the same problem here unless you prohibit user to enter characters in the middle of the input box that will be another way to make a horrible UI.
In the real life there would be many much more complicated examples that any of your smart heuristics will not be able to account for properly and thus will upset users.
So what to do? I think the only thing you can do and still get reasonable UX is the simplest thing you can do i.e. just analyze your "target RegEx" for set of allowed characters and make your "test RegEx" [set of allowed chars]*. And yes, if the "target RegEx" contains . wildcart, you will not be able to do any reasonable filtering at all.

Define allowed characters in text objects HTML

Is there anyway I can define the encoding in text areas using HTML and pure JS?
I want to have them not permitting special unicode characters (such as ♣♦♠).
The valid character range (for my purpose) is from Unicode code point U+0000 to U+00FF.
It is OK to silently replace invalid characters with an empty string upon form-submission (without warning to the user).

So, as you have clarified in your comments: you want to replace the characters you deem illegal with empty strings on form-submission without warning.
Given the following example html (body content):
<form action="demo_form.asp">
First name: <input type="text" name="fname" /><br>
Last name: <input type="text" name="lname" /><br>
Likes: <textarea name="txt_a"></textarea><br>
Dislikes: <textarea name="txt_b"></textarea><br>
<input type="submit" value="Submit">
</form>
Here is a basic concept javascript:
function demo(){
for( var elms=this.getElementsByTagName('textarea')
, L=elms.length
; L--
; elms[L].value=elms[L].value.replace(/[^\u0000-\u00FF]/g,'')
);
}
window.onload=function(){
document.forms[0].onsubmit=demo; //hook form's onsubmit use any method you like
};
The basic idea is to force the browser's regex engine to match on Unicode (not local charset) using the \uXXXX notation.
Then we simply make a range: [\u0000-\u00FF] and finally specify we want to match on everything outside that range: [^\u0000-\u00FF].
Everything that matches those criteria will be replaced by '' (an empty string) on form-submission. No warning no nothing.
You can/should freely expand this concept to incorporate this into your code (in a way that fits your code-flow) (and where needed, apply it to input type="text" etc), depending on your further requirements.
This should get you started!
EDIT:
Note that your current valid-range specification (\u0000-\u00FF) will effectively dis-allow all such 'pesky' special characters like:
fancy quotes ‘ ’ “ ”
(that's a great feature for people copying from Word etc.),
€ ™ Œ œ, etc.
But, it will nicely include the full C1 control-block (all 32 control-characters). However on the other hand.. it's consistent with including the full C0 control-block.
Effectively, this is now your (what you requested) valid char-set: http://en.wikipedia.org/wiki/ISO/IEC_8859-1
As you can now see, there is a lot more to this. That is why sane applications (finally) are starting to use Unicode (usually encoded for the web as UTF-8) and just accept what the users provide (within (extremely clearly specified) reason)!
Most common validation-questions are (in the real world) nothing more than a high-school-class example of the concept of validating (and even more to the point: to explain the basics of regular expressions with what is considered to be easily understandable examples, like name/email/address). Sadly they are wildly applied even by some government identity-systems (up to passports etc) to people's names, addresses etc. In fact: even the full current Unicode cannot represent every person's name (in native writing) on the planet (that is actually still alive)!! Real world example: try entering and leaving a commercial flight when your boarding-pass has a different credentials then your passport (regardless of which one is wrong).. 'Just' an umlaut missing is going to be a problem somewhere, worse example, imagine an woman with a German first name, Thai last name and married to a man with a Mandarin last name..
Source: xkcd.com/1171/
Finally: Please do realize that in most cases this whole exercise is useless (if you do it silently without warning), because:
you may never just accept user-input on the server-side without proper cleanup, so you are already (silently without the user knowing it) cleaning up your input to the form that you require (to a novice programmer (that forgets to think about (for example) users with javascript disabled,) this sometimes feels like repeating the work already done in javascript on the client-side)...
Usually, the only use of replicating the server-side behavior on the client-side (usually using javascript) is so the user dynamically knows what would be dis-allowed by the server (without sending data back and forth) and can adapt accordingly!

You can use form attribute accept-charset
The accept-charset attribute specifies the character encodings that
are to be used for the form submission.
The default value is the reserved string "UNKNOWN" (indicates that the
encoding equals the encoding of the document containing the
element).
See this documentation http://www.w3schools.com/tags/att_form_accept_charset.asp
I cannot say if this will protect the text field but at least it controls what character set is submitted by the form.
Actually this issue has already been answered
javascript to prevent writing into form elements after n utf 8 characters

Change the backslashes in a URI to forward-slashes with javascript

I'm writing an app that will let the user import pictures. I'm running windows, so the file path that is returned when the user selects a picture has backslashes, which is what I believe to be causing javascript to fail when I pass the path to my import method.
I get the file path with a simple html file input and use a submit button with an onclick call to my javascript:
<input type="file" id="photo-to-import" />
<input type="button" value="Submit" onclick="console.log($('#photo-to-import').val().replace('/\\/g','/'))"/>
console.log is normally where the function call would go, I've changed it for debugging. If I hard code in a file path to a picture and go through and manually change the slashes, it imports the picture, for example, I'll copy/paste a path:
C:\Users\Name\Desktop\desktop app\images\imageName.png
into the function and change the slashes I end up with:
<input type="button" value="Submit" onclick="onPhotoURISuccess('C:/Users/Name/Desktop/desktop app/images/imageName.png')"/>
and this works great. I have tried
.replace('\\\\', '/')
.replace('\\', '/')
...
and always get the exact same output, the string is unchanged every time.

Change replace('/\\/g','/') to replace(/\\/g,'/'), with the quotes you will be attempting to replace literal matches of the string '/\\/g' instead of using a regular expression literal.
For example, 'foo /\\/g bar'.replace('/\\/g','/') will give you 'foo / bar', and 'C:\\Users\\Name\\Desktop\\desktop app\\images\\imageName.png'.replace(/\\/g,'/') will give you 'C:\Users\Name\Desktop\desktop app\images\imageName.png'.

What is the correct way to encodeURIcomponent non utf-8 characters and decodes them accordingly?

I have a Javascript bookmarklet that uses encodeURIcomponent to pass the URL of the current page to the server side, and then use urldecode on the server side to get the characters back.
The problem is, when the encoded character is not in utf-8 (for my case it's gb2312, but it could be something else), and when the server does the urldecode, the decoded character become squares. Which, obviously, isn't what it looked like before the encoding.
It's a bookmarklet, input could be anything, so I can't just define "encode as gb2312" in the js, or "decode as gb2312" in the php scripts.
So, is there a correct way of using encodeURIcomponent which passes the character encoding together with the contents, and then the decoding can pick the right encoding to decode it?

For encoding of browsers, especially for GB2312 charset, check the following docs (in Chinese) first
http://ued.taobao.com/blog/2011/08/26/encode-war/
http://www.ruanyifeng.com/blog/2010/02/url_encoding.html
For your case, %C8%B7%B6%A8 is actually generated from the GB2312 form of '\u786e\u5b9a'. This occurs normally on (legacy?) versions of IE and FF, when user directly inputs Chinese character in location bar,
Or you're using non-standard link from page content which does not perform IRI to URI encoding at all and just render binary string like '/tag/\xc8\xb7\xb6\xa8'(douban.com used to have this usage for tags, now they're using correct URI encoding in UTF8). not quite sure because cannot reproduce in Chrome, maybe test in FF and IE, part about douban is true.
Actually, the correct output of encodeURIComponent should be
> encodeURIComponent('%C8%B7%B6%A8')
"%25C8%25B7%25B6%25A8"
Thus in server side, when an unquoted string contains non-ascii bytes, you'd better to leave the string as it is, here '%C8%B7%B6%A8'.
Also, you could check in client side to apply encodeURIComponent again on a value that contains %XX where XX is larger than 0x7F. I'm not quite sure whether this against RFC 2396 though.
写英文好累啊，不过还是要入乡随俗～

Using escape() and then translate the characters to numeric character reference before sending them to server.
From MDN escape() reference:
The hexadecimal form for characters, whose code unit value is 0xFF or
less, is a two-digit escape sequence: %xx. For characters with a
greater code unit, the four-digit format %uxxxx is used.
Thus, it's easy to translate the output of escape() to numeric character reference by using a simple replace() statement:
escape(input_value).replace(/%u([0-9a-fA-F]{4})/g, '&#x$1;');
Or, if your server-side language only supports decimal entities, use:
escape(input_value).replace(/%u([0-9a-fA-F]{4})/g, function(m0, m1) {
return '&#' + parseInt(m1, 16) + ';';
};
Example code in PHP
client.html (file encoding: GB2312):
<html>
<head>
<meta charset="gb2312">
<script>
function processForm(form) {
console.log('BEFORE:', form.test.value);
form.test.value = escape(form.test.value).replace(/%u(\w{4})/g, function(m0, m1) {
return '&#' + parseInt(m1, 16) + ';';
});
console.log('AFTER:', form.test.value);
return true;
}
</script>
</head>
<body>
<form method="post" action="server.php" onsubmit="return processForm(this);">
<input type="text" name="test" value="确定">
<input type="submit">
</form>
</body>
</html>
server.php:
<?php
echo '<script>console.log("',
$_REQUEST['test'], ' --> ',
mb_decode_numericentity($_REQUEST['test'], array(0x80, 0xffff, 0, 0xffff), 'UTF-8'),
'");</script>';
?>

Develop Reference

JavaScript is the programming language of the Web.