Find what has been changed and upload only changes

Find what has been changed and upload only changes - javascript

I'm just looking for ideas/suggestions here; I'm not asking for a full on solution (although if you have one, I'd be happy to look at it)
I'm trying to find a way to only upload changes to text. It's most likely going to be used as a cloud-based application running on jQuery and HTML, with a PHP server running the back-end.
For example, if I have text like
asdfghjklasdfghjkl
And I change it to
asdfghjklXasdfghjkl
I don't want to have to upload the whole thing (the text can get pretty big)
For example, something like 8,X sent to the server could signify:
add an X to the 8th position
Or D8,3 could signify:
go to position 8 and delete the previous 3 terms
However, if a single request is corrupted en route to the server, the whole document could be corrupted since the positions would be changed. A simple hash could detect corruption, but then how would one go about recovering from the corruption? The client will have all of the data, but the data is possibly very large, and it is unlikely to be possible to upload.
So thanks for reading through this. Here is a short summary of what needs suggestions
Change/Modification Detection
Method to communicate the changes
Recovery from corruption
Anything else that needs improvement

There is already an accepted form for transmitting this kind of "differences" information. It's called Unified Diff.
The google-diff-match-patch provides implementations in Java, JavaScript, C++, C#, Lua and Python.
You should be able to just keep the "original text" and the "modified text" in variables on the client, then generate the diff in javascript (via diff-match-patch), send it to the server, along with a hash, and re-construct it (either using diff-match-patch or the unix "patch" program) on the server.
You might also want to consider including a "version" (or a modified date) when you send the original text to the client in the first place. Then include the same version (or date) in the "diff request" that the client sends up to the server. Verify the version on the server prior to applying the diff, so as to be sure that the server's copy of the text has not diverged from the client's copy while the modification was being made. (of course, in order for this to work, you'll need to update the version number on the server every time the master copy is updated).

You have a really interesting approach. But if the text files are really so large that it would need too much time to upload them every time, why do you have the send the whole thing to the client? Does the client really have to receive the whole 5mb text file? Wouldn't it be possible to send him only what he needs?
Anyway, to your question:
The first thing that comes to my mind when hearing "large text files" and modification detection is diff. For the algorithm, read here. This could be an approach to commit the changes, and it specifies a format for it. You'd just have to rebuild diff (or a part of it) in javascript. This will be not easy, but possible, as I guess. If the algorithm doesn't help you, possibly at least the definition of the diff file format does.
To the corruption issue: You don't have to fear that your date gets corrupted on the way, because the TCP protocol, on which HTTP is based, looks that everything arrives without being corrupted. What you should fear is the connection reset. Might be you can do something like a handshake? When the client sends an update to the server, the server applies the modifications and keeps one old version of the file. To ensure that the client has received the ratification from the server that the modification went fine (that's where the conneciton reset happens), the client sends back another ajax request to the server. If this one doesn't come to the server within sone definied time, the file gets reset on the server side.
Another thing: I don't know if javascript likes it to handle such gigantic files/data...

This sounds like a problem that versioning systems (CVS, SVN, Git, Bazaar) already solve very well.
They're all reasonably easy to set up on a server, and you can communicate with them through PHP.
After the setup, you'd get for free: versioning, log, rollback, handling of concurrent changes, proper diff syntax, tagging, branches...
You wouldn't get the 'send just the updates' functionality that you asked for. I'm not sure how important that is to you. Pure texts are really very cheap to send as far as bandwidth is concerned.
Personally, I would probably make a compromise similar to what Wikis do. Break down the whole text into smaller semantically coherent chunks (chapters, or even paragraphs), determine on the client side just which chunks have been edited (without going down to the character level), and send those.
The server could then answer with a diff, generated by your versioning system, which is something they do very efficiently. If you want to allow concurrent changes, you might run into cases where editors have to do manual merges, anyway.
Another general hint might be to look at what Google did with Wave. I have to remain general here, because I haven't really studied it in detail myself, but I seem to remember that there have been a few articles about how they've solved the real-time concurrent editing problem, which seems to be exactly what you'd like to do.
In summary, I believe the problem you're planning to tackle is far from trivial, there are tools that address many of the associated problems already, and I personally would compromise and reformulate the approach in favor of much less workload.

Related

How can I have a page showing reservations update when a customer adds a reservation from another computer (using Rails)?

I would like to have a page where a restaurant can log in and see all of their current reservations/take-out orders, and I want this page to automatically update when someone (from another computer) makes a reservation or places an order. The idea is that the restaurant would leave this page open at all times to show their current status. What is the best way to do this? Can it be done without refreshing the page?
I wasn't even sure how to refer to a setup like this, so I wasn't really able to find much using Google. Is there a word for this type of setup?
I am using rails, and I am considering using AngularJS for the front end. Any suggestions?

There are two approaches to solving this.
The first, oldest, simplest is that your webpage contains some javascript that will poll the server at regular intervals (e.g. every 10-30 seconds), to check if something has changed and then add the changed data (e.g. reload a partial).
The second approach is a bit cleaner, and it allows the server to push the changed data to the connected clients, only when it is changed.
There are a few available approaches/libraries for this:
use websockets
use pusher
use juggernaut The author of juggernaut had deprecated it, in favor of using HTLM5 SSE (server sent events). Read more.
The advantage of using polling is that it is easy, works on every browser, but you have to write more code yourself, you will put some kind of load on your server, even if data has not changed (although the load is minimal).
The push-technologies are newer, work very clean, less code is needed. But some work only in newer browser (most of the times not really an issue), and some require extra support/setting up on your server-side.
On that note: pusher is really easy to get started with, and if your load is limited, it is free.
There are still a lot of others, but this should get you started in the right direction.

Node Packages vs Browser ones

For example, packages like highlight.js works in node just like in browser. What is considered best practice/faster/ideal?
In this case, highlight.js beautifies a <code> tag with color schemes. Example: In a blog where you use it, there are 2 cases:
Fetch post, show post to user and let the browser/client version
beautify the code, or
Fetch post, pass the contents to the highlight
node function, and show the entire results to the user.
My concerns:
Free up server stress. Show website earlier, since it doesn't need to
parse any data.
Avoid browser incompatibility (not a big deal tbh).
Save some static requests if not using CDN. Maybe faster?
I don't know what else I'm missing or what should be considered. What do you think?
PD: Every day more packages are browser/node compatible, but I think this is the best example I can provide.

The answer to that question can vary, but I would prefer to do it on the client side. Here are some pros and cons of the client-side route:
PRO: The one you mentioned, server load reduced. Remember, you're paying for your server and your client is paying for the connection (sometimes figuratively, as in wait time). If you process server-side, you pay more; if you process client-side, the client pays more. I would let the client pay!
CON: On the other hand, the syntax highlighting will load faster if you process server-side, because you can process once then cache for all subsequent clients.
CON: Browser incompatibility, like you said.
PRO: Semantics. You're augumenting highlihgting on top of the raw data, rather than having the raw data strung up between <span>s. Think about non-JS machines trying to process your page.

Why is there such a disconnect between server side models,validations, etc and client side?

I've been bouncing back and forth lately on different client-side JavaScript libraries/frameworks. I like Backbone. Not a fan of ExtJs. Etc.
Anyway, they all seem to have one giant problem in that I have to define validation logic in both the server side (Rails 3) and client side. Plus, I have to do the same with my model definitions (AR Objects and `JS Objects'). Then I have to define the business rules in both places too.
Seems like I'm always developing two concurrent applications.
I know this is a subjective question, but for us small one-man teams who can't afford dedicated JS guys and dedicated Ruby guys, what are my solutions?
I'm racking my brain and maybe I'm missing something but I can't find a single solution to this problem.
I thought about writing a Ruby gem that would generate local JS objects. So at least my business objects would be the same. But this sounds scary. Especially since I may not want all attributes client-side.
What are your thoughts on this problem? Do I just have to live with it?

I think it's something you just have to live with. If you think about the nature of the problem, and WHY we do both client and server data validation, you can come to the conclusion that there isn't currently a way around it without degrading the user experience or putting your application at risk.
Think about it like you are sending a shipment of goods across the country on a train. At the source location, somebody checks the logs to make sure that everything in the order has been included on the train, and that none of the goods are damaged. At the destination, another person checks that they've received everything they ordered in the shipment, and that nothing is damaged.
What happens if you skip one of those validation steps? You run the risk of sending an incomplete shipment without your "server" side validation. Without validating the incoming shipment on the other end, if someone were to hijack the train and swap out a bunch of counterfeit goods, you wouldn't find out about it until the goods had been sold and the cops were at your door.
The time (and expense) it takes the train to get from one location to the other creates an incentive to validate both incoming and outgoing goods, simply because an error on either end requires another train to be sent.
Admittedly, this metaphor is kind of a stretch, but hopefully you get the picture. We need validation on both ends.

This is because those are two "closed domains" which don't overlap themselves. A code on the server, a code on the client. You can't do PHP/Ruby/Python/OtherServerLanguage on the client, and you can't do Javascript on the server. Oh wait ! There you can !
I see three kinds of solutions:
build a tool that generate rules for one of the two domains, example: get/parse your ruby code, and generate the Javascript related to it (models, validation rules etc),
use a tool to convert one server side language to Javascript, there are TONS out there https://github.com/jashkenas/coffee-script/wiki/List-of-languages-that-compile-to-JS
use the same language for both domains, with something similar to node.js http://nodejs.org/ which bring Javascript on the server. That way, you can write your code once and run it server-side and client-side, so your code base could be reused :) This pattern only ask you to decouple all you code in small independent modules

Security and JavaScript files containing a site's logic

Now that JavaScript libraries like jQuery are more popular than ever, .js files are starting to contain more and more of a site's logic. How and where it pulls data/information from, how that info is processed, etc. This isn't necessarily a bad thing, but I'm wondering to what extend this might be a security concern.
Of course the real processing of data still happens in the backend using PHP or some other language, and it is key that you make sure that nothing unwanted happens at that point. But just by looking at the .js of a site (that relies heavily on e.g. jQuery), it'll tell a person maybe more than you, as a developer, would like. Especially since every browser nowadays comes with a fairly extensive web developer environment or add-on. Even for a novice manipulating the DOM isn't that big of a deal anymore. And once you figure out what code there is, and how you might be able to influence it by editing the DOM, the 'fun' starts.
So my main concerns are:
I don't want everyone to be able to look at a .js file and see exactly (or rather: for a large part) how my site, web app or CMS works — what is there, what it does, how it does it, etc.
I'm worried that by 'unveiling' this information, people who are a lot smarter than I am figure out a way to manipulate the DOM in order to influence JavaScript functions they now know the site uses, possibly bypassing backend checks that I implemented (and thus wrongly assuming they were good enough).
I already use different .js files for different parts of e.g. a web app. But there's always stuff that has to be globally available, and sometimes this contains more than I'd like to be public. And since it's all "out there", who's to say they can't find those other files anyway.
I sometimes see a huge chuck of JavaScript without line breaks and all that. Like the compact jQuery files. I'm sure there are applications or tricks to convert your normal .js file to one long string. But if it can do that, isn't it just as easy to turn it back to something more readable (making it pointless except for saving space)?
Lastly I was thinking about whether it was possible to detect if a request for a .js file comes from the site itself (by including the script in the HTML), instead of a direct download. Maybe by blocking the latter using e.g. Apache's ModRewrite, it's possible to use a .js file in the HTML, but when someone tries to access it, it's blocked.
What are your thoughts about this? Am I overreacting? Should I split my JS as much as possible or just spend more time triple checking (backend) scripts and including more checks to prevent harm-doing? Or are there some best-practices to limit the exposure of JavaScripts and all the info they contain?

Nothing in your JavaScript should be a security risk, if you've set things up right. Attempting to access an AJAX endpoint one finds in a JavaScript file should check the user's permissions and fail if they don't have the right ones.
Having someone view your JavaScript is only a security risk if you're doing something broken like having calls to something like /ajax/secret_endpoint_that_requires_no_authentication.php, in which case your issue isn't insecure JavaScript, it's insecure code.
I sometimes see a huge chuck of JavaScript without line breaks and all that. Like the compact jQuery files. I'm sure there are applications or tricks to convert your normal .js file to one long string. But if it can do that, isn't it just as easy to turn it back to something more readable (making it pointless except for saving space)?
This is generally minification (to reduce bandwidth usage), not obfuscation. It is easily reversible. There are obfuscation techniques that'll make all variable and function names something useless like "aa", "bb", etc., but they're reversible with enough effort.
Lastly I was thinking about whether it was possible to detect if a request for a .js file comes from the site itself (by including the script in the HTML), instead of a direct download. Maybe by blocking the latter using e.g. Apache's ModRewrite, it's possible to use a .js file in the HTML, but when someone tries to access it, it's blocked.
It's possible to do this, but it's easily worked around by any half-competent attacker. Bottom line: nothing you send a non-privileged user's browser should ever be sensitive data.

Of course you should spend more time checking back-end scripts. You have to approach the security problem as if the attacker is one of the key developers on your site, somebody who knows exactly how everything works. Every single URL in your site that does something to your database has to be protected to make sure that every parameter is within allowed constraints: a user can only change their own data, can only make changes within legal ranges, can only change things in a state that allows changes, etc etc etc. None of that has anything at all to do with what your Javascript looks like or whether or not anyone can read it, and jQuery has nothing at all to do with the problem (unless you've done it all wrong).
Remember: an HTTP request to your site can come from anywhere and be initiated by any piece of software in the universe. You have no control over that, and nothing you do to place restrictions on what clients can load what pages will have any effect on that. Don't bother with "REFERER" checks because the values can be faked. Don't rely on data scrubbing routines in your Javascript because those can be bypassed.

Well, you're right to be thinking about this stuff. It's a non-trivial and much misunderstood area of web application development.
In my opinion, the answer is that yes it can create more security issues, simply because (as you point out) the vectors for attack are increased. Fundamentally not much changes from a traditional (non JS) web application and the same best practises and approaches will server you very well. Eg, watching out for SQL injection, buffer overflows, response splitting, etc... You just have more places you need to watch out for it.
In terms of the scripts themselves, the issues around cross-domain security are probably the most prevalent. Research and learn how to avoid XSS attacks in particular, and also CSRF attacks.
JavaScript obfuscation is not typically carried out for security reasons, and you're right that it can be fairly easily reverse engineered. People do it, partially to protect intellectual property, but mainly to make the code download weight smaller.
I'd recommend Christopher Wells book published by O'Reilly called 'Securing Ajax Applications'.

There is free software that does JavaScript Obfuscation. Although there is not security though obscurity. This does not prevent all attacks against your system. It does make it more difficult, but not impossible for other people to rip off your JavaScript and use it.
There is also the issue of client side trust. By having a lot of logic on the client side the client is given the power to choose what it wants to execute. For instance if you are escaping quote marks in JavaScript to protect against SQL Injection. A Hacker is going to write exploit code to build his own HTTP request bypassing the escaping routines altogether.
TamperData and FireBug are commonly used by hackers to gain a deeper understanding of a Web Application.
JavaScript code alone CAN have vulnerabilities in it. A good example is DOM Based XSS. Although I admit this is not a very common type of XSS.

Here's a book by Billy Hoffman about Ajax security:
http://www.amazon.com/Ajax-Security-Billy-Hoffman/dp/0321491939/ref=sr_1_1?ie=UTF8&s=books&qid=1266538410&sr=1-1

severside processing vs client side processing + ajax?

looking for some general advice and/or thoughts...
i'm creating what i think to be more of a web application then web page, because i intend it to be like a gmail app where you would leave the page open all day long while getting updates "pushed" to the page (for the interested i'm using the comet programming technique). i've never created a web page before that was so rich in ajax and javascript (i am now a huge fan of jquery). because of this, time and time again when i'm implementing a new feature that requires a dynamic change in the UI that the server needs to know about, i am faced with the same question:
1) should i do all the processing on the client in javascript and post back as little as possible via ajax
or
2) should i post a request to the server via ajax, have the server do all the processing and then send back the new html. then on the ajax response i do a simple assignment with the new HTML
i have been inclined to always follow #1. this web app i imagine may get pretty chatty with all the ajax requests. my thought is minimize as much as possible the size of the requests and responses, and rely on the continuously improving javascript engines to do as much of the processing and UI updates as possible. i've discovered with jquery i can do so much on the client side that i wouldn't have been able to do very easily before. my javascript code is actually much bigger and more complex than my serverside code. there are also simple calulcations i need to perform and i've pushed that on the client side, too.
i guess the main question i have is, should we ALWAYS strive for client side processing over server side processing whenever possible? i 've always felt the less the server has to handle the better for scalability/performance. let the power of the client's processor do all the hard work (if possible).
thoughts?

There are several considerations when deciding if new HTML fragments created by an ajax request should be constructed on the server or client side. Some things to consider:
Performance. The work your server has to do is what you should be concerned with. By doing more of the processing on the client side, you reduce the amount of work the server does, and speed things up. If the server can send a small bit of JSON instead of giant HTML fragment, for example, it'd be much more efficient to let the client do it. In situations where it's a small amount of data being sent either way, the difference is probably negligible.
Readability. The disadvantage to generating markup in your JavaScript is that it's much harder to read and maintain the code. Embedding HTML in quoted strings is nasty to look at in a text editor with syntax coloring set to JavaScript and makes for more difficult editing.
Separation of data, presentation, and behavior. Along the lines of readability, having HTML fragments in your JavaScript doesn't make much sense for code organization. HTML templates should handle the markup and JavaScript should be left alone to handle the behavior of your application. The contents of an HTML fragment being inserted into a page is not relevant to your JavaScript code, just the fact that it's being inserted, where, and when.
I tend to lean more toward returning HTML fragments from the server when dealing with ajax responses, for the readability and code organization reasons I mention above. Of course, it all depends on how your application works, how processing intensive the ajax responses are, and how much traffic the app is getting. If the server is having to do significant work in generating these responses and is causing a bottleneck, then it may be more important to push the work to the client and forego other considerations.

I'm currently working on a pretty computationally-heavy application right now and I'm rendering almost all of it on the client-side. I don't know exactly what your application is going to be doing (more details would be great), but I'd say your application could probably do the same. Just make sure all of your security- and database-related code lies on the server-side, because not doing so will open security holes in your application. Here are some general guidelines that I follow:
Don't ever rely on the user having a super-fast browser or computer. Some people are using Internet Explore 7 on old machines, and if it's too slow for them, you're going to lose a lot of potential customers. Test on as many different browsers and machines as possible.
Any time you have some code that could potentially slow down or freeze the browser momentarily, show a feedback mechanism (in most cases a simple "Loading" message will do) to tell the user that something is indeed going on, and the browser didn't just randomly freeze.
Try to load as much as you can during initialization and cache everything. In my application, I'm doing something similar to Gmail: show a loading bar, load up everything that the application will ever need, and then give the user a smooth experience from there on out. Yes, they're going to have to potentially wait a couple seconds for it to load, but after that there should be no problems.
Minimize DOM manipulation. Raw number-crunching JavaScript performance might be "fast enough", but access to the DOM is still slow. Avoid creating and destroying elements; instead simply hide them if you don't need them at the moment.

I recently ran into the same problem and decided to go with browser side processing, everything worked great in FF and IE8 and IE8 in 7 mode, but then... our client, using Internet Explorer 7 ran into problems, the application would freeze up and a script timeout box would appear, I had put too much work into the solution to throw it away so I ended up spending an hour or so optimizing the script and adding setTimeout wherever possible.
My suggestions?
If possible, keep non-critical calculations client side.
To keep data transfers low, use JSON and let the client side sort out the HTML.
Test your script using the lowest common denominator.
If needed use the profiling feature in FireBug. Corollary: use the uncompressed (development) version of jQuery.

I agree with you. Push as much as possible to users, but not too much. If your app slows or even worse crashes their browser you loose.
My advice is to actually test how you application acts when turned on for all day. Check that there are no memory leaks. Check that there isn't a ajax request created every half of second after working with application for a while (timers in JS can be a pain sometime).
Apart from that never perform user input validation with javascript. Always duplicate it on server.
Edit
Use jquery live binding. It will save you a lot of time when rebinding generated content and will make your architecture more clear. Sadly when I was developing with jQuery it wasn't available yet; we used other tools with same effect.
In past I also had a problem when one page part generation using ajax depends on other part generation. Generating first part first and second part second will make your page slower as expected. Plan this in front. Develop a pages so that they already have all content when opened.
Also (regarding simple pages too), keep number of referenced files on one server low. Join javascript and css libraries into one file on server side. Keep images on separate host, better separate hosts (creating just a third level domain will do too). Though this is worth it only on production; it will make development process more difficult.

Of course it depends on the data, but a majority of the time if you can push it client side, do. Make the client do more of the processing and use less bandwidth. (Again this depends on the data, you can get into cases that you have to send more data across to do it client side).

Some stuff like security checks should always be done on the server. If you have a computation that takes a lot of data and produces less data, also put it on the server.
Incidentally, did you know you could run Javascript on the server side, rendering templates and hitting databases? Check out the CommonJS ecosystem.

There could also be cross-browser support issues. If you're using a cross-browser, client-side library (eg JQuery) and it can handle all the processing you need then you can let the library take care of it. Generating cross-browser HTML server-side can be harder (tends to be more manual), depending on the complexity of the markup.

this is possible, but with the heavy intial page load && heavy use of caching. take gmail as an example
On initial page load, it downloads most of the js files it needed to run. And most of all cached.
dont over use of images and graphics.
Load all the data need to show in intial load and along with the subsequent predictable user data. in gmail & latest yahoo mail the inbox is not only populated with the single mail conversation body, It loads first few full email messages in advance at the time of pageload. secret of high resposiveness comes with the cost (gmail asks to load the light version if the bandwidth is low.i bet most of us have experienced ).
follow KISS principle. means keep ur desgin simple.
And never try to render the whole page using javascript in any case, you cannot predict all your endusers using the high config systems or high bandwidth systems.
Its smart to split the workload between your server and client.

If you think in the future you might want to create an API for your application (communicating with iPhone or android apps, letting other sites integrate with yours,) your would have to duplicate a bunch of code for all those devices if you go with a bare-bones server implementation of your application.

Develop Reference

JavaScript is the programming language of the Web.