I have tried comparing two text files. If these contain the same data but there is a difference of even one space the result is showing as ‘different’.
Can anyone tell me how to compare two JavaScript files using C#?
Since JavaScript is whitespace tolerant (tolerates any amount of whitespace as long as the syntax is correct), the simplest thing to do if you want to compare everything but the whitespace is to regex-replace:
Regex _r = new Regex(#"\s+", RegexOptions.Compiled);
string result = _r.Replace(value, " ");
Run this on both files and compare the results; it replaces any sequence of standard whitespace characters (space, tab, carriage return, vertical tab etc.) with a single space. You can then compare with Equals (case sensitive or not, as you require).
Of course, whitespace IS significant inside strings, so this assumes the string handling in all the compared files does not rely on whitespace too much.
However two very different code files can have the same effects, so if that's what you're after you have a hard job ahead of you.
Do you just need to know if they are exactly the same? If so you could just load them into memory and compare the .length() property...
Technically, if one file contains an extra space they aren't "the same". I would first compare the lengths and if those don't match you'll need to do a byte by byte comparison. If you want to remove extra spaces you'll probably want to do something like a Trim() on the contents of both files first.
Here's a link to an old MS post describing how to create a file compare function:
http://support.microsoft.com/kb/320348
Related
I have included special characters like #, |,~ but they also appeared as data in some values which breaks/fails my idea of joining and splitting values later in the code.
There isn't any "safest" and "most reliable" separator to join string in any languages, not just JavaScript.
It depends entirely on your dataset, meaning the "safest" choice will be different for every different set of data.
For example, if your dataset is guaranteed to contain only integers, then any non-numeric characters can be the safest choice.
However, if your dataset is a free text, then there will be no "safest" choice, because even if you choose an arbitrary combination of string as the separator, i.e. %%%, an end-user can still supply that data in a legit sentence like My preferred pronoun is "%%%", albeit highly unlikely. Thus using %%% as separator here would still break your logic.
Because of this, you can only choose a separator that gives you the least risk.
Depending on your use case, there probably are other simpler solutions that does not require separators.
Generally we avoid joining strings if you need to separate them again later, JSON notation from serializing the data is usually a good compromise and has best interoperability.
CSV can work well too, but don't just insert commas, make sure you properly escape the values if they need it.
If JSON or CSV isn't appropriate, then using a sequence of special characters is more likely to be unique, you could use || (double pipe) as that is very unlikely to occur in anything except C based code.
You could use other special characters but I would avoid $ or % as these are commonly used in replacement tokens. Also avoid any form of brackets as they are used for other container based replacement.
A 3 character code using multiple symbols is more unique again |:| just pick something that visually looks like a barrier between values and can't be confused with tokens.
I am using RegEx's to find the frequency of occurrences of certain string values in a large data set. This was working fine until I found some of the years worth of data have been entered with a typo, meaning two characters have been swapped around. It is not feasible to edit the data sets to correct the typo. Therefore, is it possible to define a RegEx that will match the strings regardless of the index of just two characters within them?
The strings in question are:
"gcse/o-level/cse" and "gsce/o-level/cse"
I am aware I can simply search by the characters found after the typo, but I would like to know if there is a RegEx method to deal with this sort of occurrence as I could not find any mention of a solution anywhere else, and thought it posed an interesting challenge.
You can just use
/g(cs|sc)e\/o-level\/cse/
| here means "or", as you're used to.
I've seen a few javascript programmers use this pattern to produce an array:
"test,one,two,three".split(','); // => ["test", "one", "two", "three"]
They're not splitting user input or some variable holding a string value, they're splitting a hard-coded string literal to produce an array. In all of the cases I've seen a line like the above it would seem that it's perfectly reasonable to just use an array literal without relying on split to create an array from a string. Are there any reasons that the above pattern for creating an array makes sense, or is somehow more efficient than simply using an array literal?
When splitting a string at run-time instead of using an array literal, you are trading a small amount of execution time for a small amount of bandwidth.
In most cases I would argue that it is not worth it. If you are minifying and gzipping your code before publishing it, as you should be, using a single comma inside of a string versus a quote-comma-quote from two strings in an array would have little to no impact on bandwidth savings. In fact after minification and gzipping, the version using the split string could possibly be longer due to the addition of the less compressible .split(',').
Splitting a string instead of creating an array literal of strings does mean a little less typing, but we spend more time reading code than writing it. Using the array literal would be more maintainable in the future. If you wanted to add a comma to an item in the array, you just add it as another string in the array literal; using split you would have to re-write the whole string using a different separator.
The only situation that I use split and a string literal to create an array is if I need an array that consists only of single characters, i.e. the alphabet, numbers or alphanumeric characters, etc.
var letters = 'abcdefghijklmnopqrstuvwxyz'.split(''),
numbers = '0123456789'.split(''),
alphanumeric = letters.concat(numbers);
You'll notice that I used concat to create alphanumeric. If I had instead copy-pasted the contents of letters and numbers into one long string this code would compress better. The reason I did not is that that would be a micro-optimization that would hurt future maintainability. If in the future characters with accents, tildes or umlauts need to be added to the list of letters, they can be added in one place; no need to remember to copy-paste them into alphanumeric too.
Splitting a string may be useful for code golf, but in a production environment where minification and gzipping are factors and writing easily readable, maintainable code is important, just using an array literal is almost always the better choice.
For example, in ruby, the array ["test", "one", "two", "three"]
could also wrote as %w(test one two three), which save you some characters to type.
But javascript doesn't support such notation, so someone use split method to achieve it.
If a large number a arrays are built manually. It could potentially reduce the load time of your page since less characters are transmitted. But you would need a large number of arrays or large arrays to have a considerable difference. Performance an array creation might be slower but it is faster to type. For large arrays I use a spreadsheet to apply the formating around each value with a formula like this ="'"&A1&"',". I stick with the array literal.
It makes more sense to use the "split" method when you can't control the output, ie. user input or string output from a method. If you're trying to get a specific value in the output that is separated by something it is easier to use the split method. Of course, if you're controlling the values it doesn't always make sense.
I am trying to make JavaScript print all Unicode characters. According to my research, there are 1,114,112 Unicode characters.
A script like the following could work:
for(i = 0; i < 1114112; i++)
console.log(String.fromCharCode(i));
But I found out that only 10% of the 1,114,112 Unicode characters are used.
How can I can I only print the used unicode characters?
As Jukka said, JavaScript has no built-in way of knowing whether a given Unicode code point has been assigned a symbol yet or not.
There is still a way to do what you want, though.
I’ve written several scripts that parse the Unicode database and create separate data files for each category, property, script, block, etc. in Unicode. I’ve also created an HTTP API that allows you to programmatically get all code points (i.e. an array of numbers) in a given Unicode category, or all symbols (i.e. an array of strings for each character) with a given Unicode property, or a regular expression with that matches any symbols in a certain Unicode script.
For example, to get an array of strings that contains one item for each Unicode code point that has been assigned a symbol in Unicode v6.3.0, you could use the following URL:
http://mathias.html5.org/data/unicode/format?version=6.3.0&property=Assigned&type=symbols&prepend=window.symbols%20%3D%20&append=%3B
Note that you can prepend and append anything you like to the output by tweaking the URL parameters, to make it easier to reuse the data in your own scripts. An example HTML page that console.log()s all these symbols, as you requested, could be written as follows:
<!DOCTYPE html>
<meta charset="utf-8">
<title>All assigned Unicode v6.3.0 symbols</title>
<script src="http://mathias.html5.org/data/unicode/format?version=6.3.0&property=Assigned&type=symbols&prepend=window.symbols%20%3D%20&append=%3B"></script>
<script>
window.symbols.forEach(function(symbol) {
// Do what you want to do with `symbol` here, e.g.
console.log(symbol);
});
</script>
Demo. Note that since this is a lot of data, you can expect your DevTools console to become slow when opening this page.
Update: Nowadays, you should use Unicode data packages such as unicode-11.0.0 instead. In Node.js, you can then do the following:
const symbols = require('unicode-11.0.0/Binary_Property/Assigned/symbols.js');
console.log(symbols);
// Or, to get the code points:
require('unicode-11.0.0/Binary_Property/Assigned/code-points.js');
// Or, to get a regular expression that only matches these characters:
require('unicode-11.0.0/Binary_Property/Assigned/regex.js');
There is no direct way in JavaScript to find out whether a code point is assigned to a character or not, which appears to be the question here. You need information extracted from suitable sources, and this information needs to be updated whenever new characters are assigned in new versions of Unicode.
There are 1,114,112 code points in Unicode. The Unicode standard assigns to each code point the property gc, General Category. If the value of this property is anything but Cs, Co, or Cn, then the code point is assigned to a character. (Code points with gc equal to Co are Private Use code points, to which no character is assigned, but they may be used for characters by private agreements.)
What you would need to do is to get a copy of some relevant files in the Unicode character database (just a collection of files in specific formats, really) and write code that reads it and generates information about assigned code points. For the purposes of printing all Unicode characters, it might be best to generate the information as an array of ranges of assigned codepoints. And this would need to be repeated when the standard is updated with new characters.
Even the rest isn’t trivial. You would need to decide what it means to print a character. Some characters are control characters that may have an effect such as causing a newline, but lacking a visible glyph. Some (spaces) have empty glyphs. Some (combining marks) are meant to be rendered as marks attached to preceding character, though they have conventional renderings as “standalone” characters, too. Some are meant to take essentially different shapes depending on nearest context; they may have isolated forms, too, but just writing a character after another by no means guarantees that an isolated form is used.
Then there’s the problem of fonts. No single font can contain all Unicode characters, so you would need to find a collection of fonts that cover all of Unicode when used together, preferably so that they stylistically match somehow.
So if you are just looking for a compilation of all printable Unicode characters, consider using the Unicode code charts.
The trouble here is that Javascript is not, contrary to popular opinion, a Unicode environment.
Internally, it uses USC-2, an incompatible 16-bit encoding method that predates UTF16.
In addition, many of the unicode characters are not directly printable by themselves -- some of them are modifies for the previous characters -- for example the Spanish letter ñ can be written in unicode either as a single point -- that character -- or as two points -- n and ~
Here are a couple of resources that should really help you in understanding this:
http://mathiasbynens.be/notes/javascript-encoding
http://mathiasbynens.be/notes/javascript-unicode
I am trying to develop a paint brush application thru processingjs.
This API has function loadPixels() that will load the RGB values in to the array.
Now i want to store the array in the server db.
The problem is the size of the array, when i convert to a string the size is 5 MB.
Is the best solution is to do compression at javascript level? How to do it?
See http://rosettacode.org/wiki/LZW_compression#JavaScript for an LZW compression example. It works best on longer strings with repeated patterns.
From the Wikipedia article on LZW:
A dictionary is initialized to contain
the single-character strings
corresponding to all the possible
input characters (and nothing else
except the clear and stop codes if
they're being used). The algorithm
works by scanning through the input
string for successively longer
substrings until it finds one that is
not in the dictionary. When such a
string is found, the index for the
string less the last character (i.e.,
the longest substring that is in the
dictionary) is retrieved from the
dictionary and sent to output, and the
new string (including the last
character) is added to the dictionary
with the next available code. The last
input character is then used as the
next starting point to scan for
substrings.
In this way, successively longer
strings are registered in the
dictionary and made available for
subsequent encoding as single output
values. The algorithm works best on
data with repeated patterns, so the
initial parts of a message will see
little compression. As the message
grows, however, the compression ratio
tends asymptotically to the
maximum.
JavaScript implementation of Gzip has a couple answers that are relevant.
Also, Javascript LZW and Huffman Coding with PHP and JavaScript are other implementations I found.