Dealing with illegal characters (apostrophes) in a TXT file with Node.js

Dealing with illegal characters (apostrophes) in a TXT file with Node.js - javascript

I am relying on .txt files being sent externally in Node.js that sometimes have what i would class as "illegal" characters such as apostrophes and commas resulting in copying and pasting from webpages and programs such as Microsoft Word
How can I get Node.js or use Javascript to replace these incorrect formats such as apostrophes with correctly formatted apostrophes or strip out any illegal characters full stop?
Here is an example from a web page and shown in PasteBin:
Resilience is what happens when we’re able to move forward even when things don’t fit together the way we expect.
And tolerances are an engineer’s measurement of how well the parts meet spec. (The word ‘precision’ comes to mind). A 2018 Lexus is better than 1968 Camaro because every single part in the car fits together dramatically better. The tolerances are more narrow now.
One way to ensure that things work out the way you hope is to spend the time and money to ensure that every part, every form, every worker meets spec. Tighten your spec, increase precision and you’ll discover that systems become more reliable.
The other alternative is to embrace the fact that nothing is ever exactly on spec, and to build resilient systems.
You’ll probably find that while precision feels like the way forward, resilience, the ability to thrive when things go wrong, is a much safer bet.
The trap? Hoping for one, the other or both but not doing the work to make it likely. What will you do when it doesn’t work?
Neither resilience nor tolerances get better on their own.
https://pastebin.com/uJ7GAKk4
Copied from the following URL and pasted into Notepad and saved
https://seths.blog/storyoftheweek/

You could use a RegExp to remove the unwanted characters
// text is the pasted text
var filtered = text.replace(/[',]/gm, '');

Related

JS lexing---multi line string

I am making a JS lexer as part of my study. In JS, single line stings start from " or ' and ends with the same character except if that character is preceded by a backslash.
In my current code, I loop through every character and append them to existing tokens based on flags like "string" or "regex". so it feels natural to implement multi line string with " or ' because it seems that it does not affect any other part of my lexer
Is there any practical reason why new line is not allowed as contents of strings?

Many languages, but not all, prohibit unescaped newlines in string literals. So JavaScript is certainly not unique here.
But the motivation really has little to do with the ease, difficulty or efficiency of lexical analysis. In fact, for lexical analysis the simplest syntax is to allow any character rather than having to include special-case checks. [Note 1]
There are other considerations, though; notably, the importance of a program to be readable and easy to debug. Long strings put an extra load on someone reading the code, because they may not be aware that a section of program text is actually part of a string literal. (There's a similar problem with multiline comments, which is why it's usually considered good style to mark every line in a long comment in some way, for example with a vertical column of stars at the left-hand margin. No such solution exists for string literals, though.)
Also, unterminated multiline strings can be annoying to correct. If strings are cannot span lines, the error will be detected on the line containing the problem. But multiline strings might continue until the beginning of the next string, then triggering a syntax error when the contents of the next string are accidentally parsed as program text. Or worse, resulting in a completely incorrect parse of what was supposed to be program text, followed by another incorrect string literal starting where the second literal ends, and continuing from there.
That also makes it hard for developer tools, such as editors and syntax highlighters, to deal with program text as it is being typed.
In the end, you may or may not find these arguments compelling, and a language designer might have other aesthetic preferences as well. I can't really speak for the original designers of the JavaScript language, and neither of us can take a voyage in time to argue with them and maybe change their decision.
For better or worse, languages are designed according to particular subjective judgements, and if the language is successful these judgements become permanent features. They are things you have to accept if you are using a language and they're not usually worth obsessing about. You get used to them, or you find a different language to program in, with its own syntax quirks.
When you design your own language, you will need to resolve a large number of syntactic questions, and you will undoubtedly run into cases where the answer is not clearcut because there is no objectively correct unique solution. Whatever you do, someone will want to argue with you. Perhaps you can refer them to this answer.
Notes:
There is actually a historic reason for not allowing multiline string literals, which is much clearer but has been more or less irrelevant for several decades.
Once Upon A Time, common filesystems considered text files to be linear arrays of fixed-length lines (often 80 character lines, matching a Hollerith card). One advantage of such a filesystem is that it could instantly navigate to a particular line number in a file, since all lines were the same length. But in any case, for systems where programs were entered on punched cards, the fixed length lines were just part of the environment.
To make all lines the same length, lines needed to be filled out with space characters. This would obviously make multiline string literals awkward, and that's why C never allowed multiline string literals, instead relying on a syntactic feature where consecutive string literals are automatically concatenated into a single literal.
In the end, fixed-line-length filesystems proved to be unpopular, and I don't think you're likley to run into one these days. But a careful reading of the C and Posix standards shows that such filesystems must still be usable by conforming implementations, with the consequence that a fully portable program must be prepared to deal with line length limits on output and trailing whitespace on input.

There is also such syntax
const string =
'line1\
line2\
line3'

Spaces in equal signs

I'm just wondering is there a difference in performance using removing spaces before and after equal signs. Like this two code snippets.
first
int i = 0;
second
int i=0;
I'm using the first one, but my friend who is learning html/javascript told me that my coding is inefficient. Is it true in html/javascript? And is it a huge bump in the performance? Will it also be same in c++/c# and other programming languages? And about the indent, he said 3 spaces is better that tab. But I already used to code like this. So I just want to know if he is correct.

Your friend is a bit misguided.
The extra spaces in the code will make a small difference in the size of the JS file which could make a small difference in the download speed, though I'd be surprised if it was noticeable or meaningful.
The extra spaces are unlikely to make a meaningful difference in the time to parse the file.
Once the file is parsed, the extra spaces will not make any difference in execution speed since they are not part of the parsed code.
If you really want to optimize download or parse speed, the way to do that is to write your code in the most readable fashion possible for best maintainability and then use a minimizer for the deployed code and this is a standard practice by many web sites. This will give you the best of both worlds - maintainable, readable code and minimum deployed size.
A minimizer will remove all unnecessary spacing, shorten the names of variables, remove comments, collapse lines, etc... all designed to make the deployed code as small as possible without changing the run-time meaning of the code at all.
C++ is a compiled language. As such, only the compiler that the developer uses sees any extra spaces (same with comments). Those spaces are gone once the code has been compiled into native code which is what the end-user gets and runs. So, issues about spaces between elements in a line are simply not applicable at all for C++.
Javascript is an interpreted language. That means the source code is downloaded to the browser and the browser then parses the code at runtime into some opcode form that the interpreter can run. The spaces in Javascript will be part of the downloaded code (if you don't use a minimizer to remove them), but once the code is parsed, those extra spaces are not part of the run-time performance of the code. Thus, the spaces could have a small influence on the download time and perhaps an even smaller influence on the parse time (though I'm guessing unlikely to be measurable or meaningful). As I said above, the way to optimize this for Javascript is to use spaces to enhance readability in the source code and then run a minimizer over the code to generate a deployed version of the code to minimize the deployed size of the file. This preserves maximum readability and minimizes download size.

There is little (javascript) to no (c#, c++, Java) difference in performance. In the compiled languages in particular, the source code compiles to the exact same machine code.
Using spaces instead of tabs can be a good idea, but not because of performance. Rather, if you aren't careful, use of tabs can result in "tab rot", where there are tabs in some places and spaces in others, and the indentation of the source code depends on your tab settings, making it hard to read.

tunable diff algorithm

I'm interested in finding a more-sophisticated-than-typical algorithm for finding differences between strings, that can be "tuned" via some parameters, to balance between such things as "maximize count of identical characters" vs. "maximize the length of spans" vs. "try to keep whole words intact".
Ultimately, I want to be able to make the results as human readable as possible. For instance, if a long sentence has been replaced with an entirely new sentence, where the only things it has in common with the original are the words "the" "and" and "a" in that order, I might want it treated as if the whole sentence is changed, rather than just that 4 particular spans are changed --- just like how a reasonable person would see it.
Does such a thing exist? Although I'm working in javascript/node.js, an algorithm in any language would be helpful.
I'm actually ok with something that uses Monte Carlo methods or the like, if its results are better. Computation time is not an issue (within reason), nor is determinism.
Note: although this is beyond the scope of what I'm asking, I'll throw one more thing out there just in case: It would also be great if it could recognize changes that are out of order....for instance if someone changes the order of two paragraphs while leaving them otherwise identical, it would be awesome if it recognized it as a simple move, rather than as one subtraction and and one unrelated addition.

I've had good luck with diff_match_patch. There are some good options for tuning it for readability.

Try http://prettydiff.com/ Its code is already formatted for compatibility with CommonJS, which is the framework Node uses.

Syntax / Logical checker In Javascript?

I'm building a solution for a client which allows them to create very basic code,
now i've done some basic syntax validation but I'm stuck at variable verification.
I know JSLint does this using Javascript and i was wondering if anyone knew of a good way to do this.
So for example say the user wrote the code
moose = "barry"
base = 0
if(moose == "barry"){base += 100}
Then i'm trying to find a way to clarify that the "if" expression is in the correct syntax, if the variable moose has been initialized etc etc
but I want to do this without scanning character by character,
the code is a mini language built just for this application so is very very basic and doesn't need to manage memory or anything like that.
I had thought about splitting first by Carriage Return and then by Space but there is nothing to say the user won't write something like moose="barry" or if(moose=="barry")
and there is nothing to say the user won't keep the result of a condition inline.
Obviously compilers and interpreters do this on a much more extensive scale but i'm not sure if they do do it character by character and if they do how have they optimized?
(Other option is I could send it back to PHP to process which would then releave the browser of responsibility)
Any suggestions?
Thanks
The use case is limited, the syntax will never be extended in this case, the language is a simple scripted language to enable the client to create a unique cost based on their users input the end result will be processed by PHP regardless to ensure the calculation can't be adjusted by the end user and to ensure there is some consistency.
So for example, say there is a base cost of £1.00
and there is a field on the form called "Additional Cost", the language will allow them manipulate the base cost relative to the "additional cost" field.
So
base = 1;
if(additional > 100 && additional < 150){base += 50}
elseif(additional == 150){base *= 150}
else{base += additional;}
This is a basic example of how the language would be used.
Thank you for all your answers,
I've investigated a parser and creating one would be far more complex than is required
having run several tests with 1000's of lines of code and found that character by character it only takes a few seconds to process even on a single core P4 with 512mb of memory (which is far less than the customer uses)
I've decided to build a PHP based syntax checker which will check the information and convert the variables etc into valid PHP code whilst it's checking it (so that it's ready to be called later without recompilation) using this instead of javascript this seems more appropriate and will allow for more complex code to arise without hindering the validation process
It's only taken an hour and I have code which is able to check the validity of an if statement and isn't confused by nested if's, spaces or odd expressions, there is very little left to be checked whereas a parser and full blown scripting language would have taken a lot longer
You've all given me a lot to think about and i've rated relevant answers thank you

If you really want to do this — and by that I mean if you really want your software to work properly and predictably, without a bunch of weird "don't do this" special cases — you're going to have to write a real parser for your language. Once you have that, you can transform any program in your language into a data structure. With that data structure you'll be able to conduct all sorts of analyses of the code, including procedures that at least used to be called use-definition and definition-use chain analysis.
If you concoct a "programming language" that enables some scripting in an application, then no matter how trivial you think it is, somebody will eventually write a shockingly large program with it.
I don't know of any readily-available parser generators that generate JavaScript parsers. Recursive descent parsers are not too hard to write, but they can get ugly to maintain and they make it a little difficult to extend the syntax (esp. if you're not very experienced crafting the original version).

You might want to look at JS/CC which is a parser generator that generates a parser for a grammer, in Javascript. You will need to figure out how to describe your language using a BNF and EBNF. Also, JS/CC has its own syntax (which is somewhat close to actual BNF/EBNF) for specifying the grammar. Given the grammer, JS/CC will generate a parser for that grammar.
Your other option, as Pointy said, is to write your own lexer and recursive-descent parser from scratch. Once you have a BNF/EBNF, it's not that hard. I recently wrote a parser from an EBNF in Javascript (the grammar was pretty simple so it wasn't that hard to write one YMMV).
To address your comments about it being "client specific". I will also add my own experience here. If you're providing a scripting language and a scripting environment, there is no better route than an actual parser.
Handling special cases through a bunch of if-elses is going to be horribly painful and a maintenance nightmare. When I was a freshman in college, I tried to write my own language. This was before I knew anything about recursive-descent parsers, or just parsers in general. I figured out by myself that code can be broken down into tokens. From there, I wrote an extremely unwieldy parser using a bunch of if-elses, and also splitting the tokens by spaces and other characters (exactly what you described). The end result was terrible.
Once I read about recursive-descent parsers, I wrote a grammar for my language and easily created a parser in a 10th of the time it took me to write my original parser. Seriously, if you want to save yourself a lot of pain, write an actual parser. If you go down your current route, you're going to be fixing issues forever. You're going to have to handle cases where people put the space in the wrong place, or perhaps they have one too many (or one too little) spaces. The only other alternative is to provide an extremely rigid structure (i.e, you must have exactly x number of spaces following this statement) which is liable to make your scripting environment extremely unattractive. An actual parser will automatically fix all these problems.

Javascript has a function 'eval'.
var code = 'alert(1);';
eval(code);
It will show alert. You can use 'eval' to execute basic code.

Network-efficient difference between two strings in Javascript

I have a web application where a client side editor is editing a really really large text which is known on the server side.
The client can make any kind of modifications to this text.
What is the most network-efficient way to transmit the result difference in a way that the server understands? Also, since this will happen on client side (Javascript), I would also like it to be 'fast' (or at least not noticeably slow)
Some scenarios:
User modifies ONE character
User modifies several sentences in random positions
User erases everything and results in a blank text.
I cannot use diff-like syntax since it's not network efficent, it checks lines, where examples 1 and 3 will produce horrible differences (especially the last one, where the result will be more than the old itself).
Anyone has experience in this matter? User operates on a really large set of data - around 3-5MB of text, and uploading the whole "new" content is a big no-no.
To be clear, I'm looking for a "protocol" of transfer, string comparison is not the issue.

I'm not very familiar with this topic but I can point you to an open source (Apache License 2.0) project which may be very useful.
It is a Diff, Match and Patch library written in several languages, including JavaScript, from a Google engineer and it is used in several online collaborative editing services.
Here are a list of resources:
The Diff, Match and Patch project
The MobWrite project (Editor implementation based on the above project)
"Differential Synchronization" (A Google Tech Talk by the engineer)

A simple approach, assuming that you know the copy on the server isn't going to change, would just be to send a list of edits (deletions and additions), with the deletions represented as a start and end index, and the additions represented as a start index and the text to insert.
If you have more than a simple diff algorithm to work with (I'm not sure exactly what you mean by "string comparison is not the issue"), you could also detect moved or copied chunks of text, and send those as the start and end index of the moved or copied piece of text, as well as the destination to insert it.
Note that you'll need to make sure to keep track of whether your indices refer to the original document, or the document as edited so far. An easy approach to avoid this problem is to always perform the edits from the end of the document towards the beginning; then earlier edits won't affect the offsets specified by later edits.
For an example of an approach like this, see the ed format that diff -e outputs. This is basically input that could be fed into the ed line-oriented text editor. If you want the absolute smallest diffs to send across you may want to do character based indexing rather than line based indexing, but the same basic approach could work.

Any edits the user's performing can be efficiently broken down into: delete from X for length Y; insert at X text "whatever". X and Y are offsets in characters from the start of the text; Y is a number of characters; "whatever" is any string of characters. You say you need no help computing the diff, but an example is here, except it's richer in its output than you need, but does identify "removals and insertions", so, just change the output part.
The exact format in which you send the data to the server can be tuned, but I don't think there's much mileage in doing that -- pending measurement, I'd start by sending the commands as D for delete or I for insert, the numbers in decimal, the inserted string in quoted form. Once you have some statistics on actual transfers being performed, you can see how much overhead is in the numbers (decimal vs binary) and quotes, but I suspect that may not be all that meaningful (if it proves to be, there are all sort of things you can try, such as giving offsets from the latest point of insertion or deletion, rather than always from the start, to make things faster).
You can sample what the user is doing every few seconds, and just send the incremental changes over those last few seconds (if any) -- this way, each packet you're sending will be small, and if the net connection or the user's computer/browser crash, the user won't have lost much work.

You could just send changes every 500ms, so, whatever changes were made in the last 500ms would be sent, but you only send data when there was a change.
In this you could then send the position of the changed word(s) and just send the entire word, but I would have the position be from the front of the text.
It won't be several sentences worth, but there may be several words involved, but, if you send them in order of change then the result should be consistent.

Because there are so many ways to do edits--even within short periods of time like 500ms--including dragging and dropping, or cutting and pasting, large sections of text around within the document or from outside it--I don't know if there's going to be something that will cover all scenarios really well. This is certainly a non-answer to your question at face value, but I would consider carefully the trouble of developing and maintaining something like this compared to changing the interface to restrict the text size and breaking existing texts into smaller pieces.
Maybe that's not possible in your situation, but if it is, I would guess it would be much less trouble in the end to dodge the issue in this way and just send full documents after an edit.

Develop Reference

JavaScript is the programming language of the Web.