tunable diff algorithm

tunable diff algorithm - javascript

I'm interested in finding a more-sophisticated-than-typical algorithm for finding differences between strings, that can be "tuned" via some parameters, to balance between such things as "maximize count of identical characters" vs. "maximize the length of spans" vs. "try to keep whole words intact".
Ultimately, I want to be able to make the results as human readable as possible. For instance, if a long sentence has been replaced with an entirely new sentence, where the only things it has in common with the original are the words "the" "and" and "a" in that order, I might want it treated as if the whole sentence is changed, rather than just that 4 particular spans are changed --- just like how a reasonable person would see it.
Does such a thing exist? Although I'm working in javascript/node.js, an algorithm in any language would be helpful.
I'm actually ok with something that uses Monte Carlo methods or the like, if its results are better. Computation time is not an issue (within reason), nor is determinism.
Note: although this is beyond the scope of what I'm asking, I'll throw one more thing out there just in case: It would also be great if it could recognize changes that are out of order....for instance if someone changes the order of two paragraphs while leaving them otherwise identical, it would be awesome if it recognized it as a simple move, rather than as one subtraction and and one unrelated addition.

I've had good luck with diff_match_patch. There are some good options for tuning it for readability.

Try http://prettydiff.com/ Its code is already formatted for compatibility with CommonJS, which is the framework Node uses.

Related

Break this regex pattern

Sorry for the somewhat truthful misleading title.
I am currently trying to grasp several formats of dates, within a given sentence, will add more as time permits and neccessity arises, but for the most part this is what I have.
https://regex101.com/r/vV0uZ3/1
The main reason why I said break this, is because I can't think of any other types of dates that could mess this up, but I also feel that it is inefficient, semi new to regexes, and this seems to work, but I feel that there is a better manner in which to do it as well, this is somewhat time insensitive data processing, but speed is always a needed point in the end game.
Shortened scope, are there (base) formats of date inputs that this would not be able to pull in?
Also would it be better to use the case insensitive flag 'i' or pull in the full (base) alphabet into it [A-z], for time, and solution comprehension?
EDIT~~
I may be misrepresenting my question a bit, here goes.
I more or less will be reading in human inputs, and mostly american date formats within a document.
I have seen the comments made, and they are definitely sensible, just not too sure if at the moment those edge cases would be plausible, human randomness is a beautiful thing though, so side question, would having a regex that could catch all of these date formats(n/m/e | n-m-e | n m, e |...) heavily bog down my code in the long ru, and would the trade off for maleability be worth the slight shift in efficiency?

How to test an MD5 implementation?

I am considering using a JS MD5 implementation.
But I noticed that there are only a few tests. Is there a good way of verifying that implementation is correct?
I know I can try it with a few different values and see if it works, but that only means it is correct for some inputs. I would like to see if it is correct for all inputs.

The corresponding RFC has a good description of the algorithm, an example implementation in C, and a handful of test values at the end. All three together let you make a good guess about the quality of the examined implementation and that's all you can get: a good guess.
Testing an applications with an infinite or at least a very large input set as a black box is hard, impossible even in most cases. So you have to check if the code implements the algorithm correctly. The algorithm is described in RFC-3121 (linked to above). This description is sufficient for an implementation. The algorithm itself is well known (in the scientific sense, i.e.: many papers have been written about it and many flaws have been found) and simple enough to skip the formal part, just inspect the implementation.
Problems to expect with MD5 in JavaScript: input of one or more zero bytes (you can check the one and two bytes long inputs thoroughly), endianess (should be no problem but easy to check) and the problem of the unsigned integer used for bit-manipulation in JavaScript (">>" vs. ">>>" but also easy to check for). I would also test with a handful of data with all bits set.
The algorithm needs padding, too, you can check it with all possible input of length shorter than the limit.
Oh, and for all of you dismissing the MD5-hash: it still has its uses as a fast non-cryptographic hash with a low collision-rate and a good mixing (some call the effect of the mixing "avalanche", one bit change in the input changes many bits in the output). I still use it for larger, non-cryptographic Bloom-filters. Yes, one should use a special hash fitting the expected input but constructing such a hash function is a pain in the part of the body Nature gave us to sit on.

Is there a performance penalty using capture groups in RegExp#test?

Disclaimer: my question is not focused on the exercise, it's just an example (although if you have any interesting tips on the example itself, feel free to share!).
Say I'm working with parsing some strings with Regex in JavaScript, and my main focus is performance (speed).
I have a piece of regex which checks for a numeric string, and then parses it using Number if it's numeric:
if (/^\[[0-9]+]$/.test(str)) {
val = Number(str.match(/^\[([0-9]+)$/)[1]);
}
Note how the conditional test does not have a capture group around the digits. This leads to writing out basically the same regex twice, except with a capture group the second time.
What I would like to know is this; does adding a capture group to a regex used alongside test() in a condition affect performance in any way? I'd like to simply use the capture regex in both places, as long as there is no performance hit.
And to the question as why I'm doing test() then match() rather than match() and checking null; I want to keep parsing as fast as possible when there's a miss, but it's ok to be a little slower when there's a hit.
If it's not clear from the above, I'm referring to JavaScript's regex engine - although if this differs across engines it'd be nice to know too. I'm working specifically in Node.js here, should it also differ across JS engines.
Thanks in advance!

Doing 2 regexps - that are very similar in scope - will almost always be slower than doing a single one because regexps are greedy (that means that they will try to match as much as they can, usually meaning take the maximum amount of time possible).
What you're asking is basically: is the cost of fewer memory in the worst case scenario (aka using the .test to save on memory from capture) faster than just using the extra memory? The answer is no, using extra memory speeds up your process.
Don't take my word for it though, here's a jsperf: http://jsperf.com/regex-perf-numbers

Is there a way to match only top level parentheses with regex?

With Javascript, suppose I have a string like (1)(((2)(3))4), can I get a regex to match just (1) and (((2)(3))4), or do I need to do something more complicated?
Ideally the regex would return ["((2)(3))","4"] if you searched ((2)(3))4. Actually that's really a requirement. The point is to group things into the chunks that need to be worked on first, like the way parentheses work in math.

No, there is no way to match only top level parentheses with regex
Looking only at the top level doesn't make the problem easier than general "parsing" of recursive structures. (See this relevant popular SO question with a great answer).
Here's a simple intuitive reason why Regex can't parse arbitrary levels of nesting:
To keep track of the level of nesting, one must count. If one wants to be able to keep track of an arbitrary level of nesting, one needs an arbitrarily large number while running the program.
But regular expressions are exactly those that can be implemented by DFAs, that is Deterministice finite automatons. These have only a finite number of states. Thus they can't keep track of an arbitrarily large number.
This argument works also for your specific concern of being only interested in the top level parentheses.
To recognize the top level parentheses, you must keep track of arbitrary nesting preceding any one of them:
((((..arbitrarily deep nesting...))))((.....)).......()......
^toplevel ^^ ^ ^^
So yes, you need something more powerful than regex.
While if you are very pragmatic, for your concrete application it might be okay to say that you won't encounter any nesting deeper than, say, 1000 (and so you might be willing to go with regex), it's also a very practical fact that any regex recognizing a nesting level of more than 2 is basically unreadable.

Well, here is one way to do it. As Jo So pointed out, you can't really do it in javascript with indefinite amounts of recursion, but you can make something arbitrarily recursive pretty easily. I'm not sure how the performance scales though.
First I figured out that you need recursion. Then I realized that you can just make your regex 'recursive' by just copying and pasting recursively, like so (using curly braces for clarity):
Starting regex
Finds stuff in brackets that isn't itself brackets.
/{([^{}])*}/g
Then copy and paste the whole regex inside itself! (I spaced it out so you can see where it was pasted in.) So now it is basically like a( x | a( x )b )b
/{([^{}] | {([^{}])*} )*}/g
That will get you one level of recursion and you can continue ad nauseum in this fashion and actually double the amount of recursions each time:
//matches {4{3{2{1}}}}
/{([^{}]|{([^{}]|{([^{}]|{([^{}])*})*})*})*}/g
//matches {8{7{6{5{4{3{2{1}}}}}}}}
/{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}])*})*})*})*})*})*})*})*}/g
Finally I just add |[^{}]+ on the end of the expression to match stuff that is completely outside of brackets. Crazy, but it works for my needs. I feel like there is probably some clever way to combine this concept with a recursive function in order to get a truly recursive matcher, but I can't think of it now.

If you can be sure that the parentheses are balanced (I'm sure there are other resources out there that can answer that question for you if required) and if by "top-level" you're happy to find local as well as global maxima then all you need to do is find any content that starts with an open bracket and closes with a close-bracket, with no intermediate open-bracket between the two:
I think the following should do that for you and helpfully group any "top-level" content:
\(([^\(]*?)\)
That content may not all be at the same "level", but if you think of the nested brackets as describing the branching of a tree, the regex will return to you the leaves. If you pre-process your text to be wrapped in parentheses to start with, and the earlier assumptions are met, you can guarantee always getting at least one "leaf".

What does sorting mean in non-alphabetic (i.e, Asian) languages?

I have some code that sorts table columns by object properties. It occurred to me that in Japanese or Chinese (non-alphabetical languages), the strings that are sent to the sort function would be compared the way an alphabetical language would.
Take for example a list of Japanese surnames:
寿拘 (Suzuki)
松坂 (Matsuzaka)
松井 (Matsui)
山田 (Yamada)
藤本 (Fujimoto)
When I sort the above list via Javascript, the result is:
寿拘 (Suzuki)
山田 (Yamada)
松井 (Matsui)
松坂 (Matsuzaka)
藤本 (Fujimoto)
This is different from the ordering of the Japanese syllabary, which would arrange the list phonetically (the way a Japanese dictionary would):
寿拘 (Suzuki)
藤本 (Fujimoto)
松井 (Matsui)
松坂 (Matsuzaka)
山田 (Yamada)
What I want to know is:
Does one double-byte character really get compared against the other in a sort function?
What really goes on in such a sort?
(Extra credit) Does the result of such a sort mean anything at all? Does the concept of sorting really work in Asian (and other) languages? If so, what does it mean and what should one strive for in creating a compare function for those languages?
ADDENDUM TO SUMMARIZE ANSWERS AND DRAW CONCLUSIONS:
First, thanks to all who contributed to the discussion. This has been very informative and helpful. Special shout-outs to bobince, Lie Ryan, Gumbo, Jeffrey Zheng, and Larry K, for their in-depth and thoughtful analyses. I awarded the check mark to Larry K for pointing me toward a solution my question failed to foresee, but I up-ticked every answer I found useful.
The consensus appears to be that:
Chinese and Japanese character strings are sorted by Unicode code points, and their ordering may be predicated on a rationale that may be in some way intelligible to knowledgeable readers but is not likely to be of much practical value in helping users to find the information they're seeking.
The kind of compare function that would be required to make a sort semantically or phonetically useful is far too cumbersome to consider pursuing, especially since the results would probably be less than satisfactory, and in any case the comparison algorithms would have to be changed for each language. Best just to allow the sort to proceed without even attempting a compare function.
I was probably asking the wrong question here. That is, I was thinking too much "inside the box" without considering that the real question is not how do I make sorting useful in these languages, but how do I provide the user with a useful way of finding items in a list. Westerners automatically think of sorting for this purpose, and I was guilty of that. Larry K pointed me to a Wikipedia article that suggests a filtering function might be more useful for Asian readers. This is what I plan to pursue, as it's at least as fast as sorting, client-side. I will keep the column sorting because it's well understood in Western languages, and because speakers of any language would find the sorting of dates and other numerical-based data types useful. But I will also add that filtering mechanism, which would be useful in long lists for any language.

Does one double-byte character really get compared against the other in a sort function?
The native String type in JavaScript is based on UTF-16 code units, and that's what gets compared. For characters in the Basic Multilingual Plane (which all these are), this is the same as Unicode code points.
The term ‘double-byte’ as in encodings like Shift-JIS has no meaning in a web context: DOM and JavaScript strings are natively Unicode, the original bytes in the encoded page received by the browser are long gone.
Does the result of such a sort mean anything at all?
Little. Unicode code points do not claim to offer any particular ordering... for one, because there is no globally-accepted ordering. Even for the most basic case of ASCII Latin characters, languages disagree (eg. on whether v and w are the same letter, or whether the uppercase of i is I or İ). And CJK gets much gnarlier than that.
The main Unicode CJK Unified Ideographs block happens to be ordered by radical and number of strokes (Kangxi dictionary order), which may be vaguely useful. But use characters from any of the other CJK extension blocks, or mix in some kana, or romaji, and there will be no meaningful ordering between them.
The Unicode Consortium do attempt to define some general ordering rules, but it's complex and not generally attempted at a language level. Systems that really need language-sensitive sorting abilities (eg. OSes, databases) tend to have their own collation schemes.
This is different from the ordering of the Japanese syllabary
Yes. Above and beyond collation issues in general, it's a massively difficult task to handle kanji accurately by syllable, because you have to guess at the pronunciation. JavaScript can't realistically know that by ‘藤本’ you mean ‘Fujimoto’ and not ‘touhon’; this sort of thing requires in-depth built-in dictionaries and still-unreliable heuristics... not the sort of thing you want to build in to a programming language.

You could implement the Unicode Collation Algorithm in Javascript if you want something better than the default JS sort for strings. Might improve some things. Though as the Unicode doc states:
Collation is not uniform; it varies
according to language and culture:
Germans, French and Swedes sort the
same characters differently. It may
also vary by specific application:
even within the same language,
dictionaries may sort differently than
phonebooks or book indices. For
non-alphabetic scripts such as East
Asian ideographs, collation can be
either phonetic or based on the
appearance of the character.
The Wikipedia article points out that since collation is so tough in non-alphabetic scripts, now a days the answer is to make it very easy to look up information by entering characters, rather than by looking through a list.
I suggest that you talk to truly knowledgeable end users of your application to see how they would best like it to behave. The problem of ordering Chinese characters is not unique to your application.
Also, if you don't want to implement the collation in your system, another solution would for you to create a Ajax service that stores the names in a MySql or other database, then looks up the data with an order statement.

Strings are compared character by character where the code point value defines the order:
The comparison of strings uses a simple lexicographic ordering on sequences of code point value values. There is no attempt to use the more complex, semantically oriented definitions of character or string equality and collating order defined in the Unicode specification. Therefore strings that are canonically equal according to the Unicode standard could test as unequal. In effect this algorithm assumes that both strings are already in normalised form.
If you need more than this, you will need to use a string comparison that can take collations into account.

Others have answered the other questions, I will take on this one:
what should one strive for in creating a
compare function for those languages?
One way to do it is that, you will need to create a program that can "read" the characters; that is, able to map hanzi/kanji characters to their "sound" (pinyin/hiragana reading). At the simplest level, this means a database that maps hanzi/kanji to sounds. Of course this is more difficult than it sounds (pun not intended), since a lot of characters can have different pronunciations in different contexts, and Chinese have many different dialects to consider.
Another way, is to order by stroke order. This means there would need to be a database that maps hanzi/kanji to their strokes. Another problem: Chinese and Japanese writes in different stroke orders. However, aside from Japanese and Chinese difference, using stroke ordering is much more consistent within a single text, since hanzi/kanji characters are almost always written using the same stroke order irrespective of what they meant or how they are read. A similar idea is to sort by radicals instead of plain stroke orders.
The third way, is sorting by Unicode code points. This is simple, and always gives undisputably consistent ordering; however, the problem is that the sort order is meaningless for human.
The last way is to rethink about the need for absolute ordering, and just use some heuristic to sort by relevance to the user's need. For example, in a shopping cart software, you can sort depending on user's buying habits or by price. This kinda avoids the problem, but most of the time it works (except if you're compiling a dictionary).
As you notice, the first two methods require creating a huge database of one-to-many mapping, but they still doesn't always give a useful result. The third method also require a huge database, but many programming languages already have this database built into the language. The last way is a bit of heuristic, probably most useful, however they are doomed to never give consistent ordering (much worse than the first two method).

Yes, the characters get compared. They are usually compared based on their Unicode code points, though, which are quite different between hiragana and kanji -- making the sort potentially useless in Japanese. (Kanji borrowed from Chinese, but the order they'd appear in Chinese doesn't correspond to the order of the hiragana that'd represent the same meaning). There are collations that could render some of the characters "equal" for comparison purposes, but i don't know if there's one that'll consider a kanji to be equivalent to the hiragana that'd comprise its pronunciation -- especially since a character can have a number of different pronunciations.
In Chinese or Korean, or other languages that don't have 3 different alphabets (one of which is quite irregular), it'd probably be less of an issue.

Those are sorted by codepoint value, ascending. This is certainly meaningless for human readers. It's not impossible to devise a sensible sorting scheme for Japanese, but sorting Chinese characters is hard (partly because we don't necessarily know whether we're looking at Japanese or Chinese), and lot of programmers punt to this solution.

The normal string comparison functions in many programming languages are designed to ensure that strings can be sorted into a unique order, to allow algorithms like binary search and duplicate-detection to work correctly. To sort data in a fashion meaningful to a human reader, one must know what the data represents. For example, in a list of English movie titles, "El Mariachi" would typically sort under "E", but in a list of Spanish movie titles it would sort under "M". The application will need information beyond that contained in the strings themselves to know how the strings should be sorted.

The answers to Q1 (can you sort) and Q3 (is sort meaningful) are both "yes" for Chinese (from a mainland perspective). For Q2 (how to sort):
All Chinese characters have definite pronunciation (some are polyphonic) as defined in pinyin, and it's far more common (as in virtually all Chinese dictionaries) to sort by pinyin, where there is no ambiguity. Characters with the same pronunciation are then sorted by stroke order.
The polyphonic characters pose extra challenge for sorting, as their pinyin usually depends on the word they are in (I heard Japanese characters could be even more hairy). For example, the character 阿 is pronounced a(1) in 阿姨 (tone in parenthesis), and e(1) in 阿胶. So if you need to sort words or sentences, you cannot simply look at one character at a time from each item.

Recall that in JavaScript, you can pass into sort() a function in which you can implement sort yourself, in order to achieve a sort that matters to humans:
myarray.sort(function(a,b){
//return 0, 1, or -1 based on the comparison of the two strings
});

Develop Reference

JavaScript is the programming language of the Web.