Saving Application State in Node.js - javascript

How can I save the application state for a node.js Application that consists mostly of HTTP request?
I have a script in Node.JS that works with a RESTful API to import a large number (10,000+) of products into an E-Commerce application. The API has a limit on the amount of requests that can be made and we are staring to brush up against that limit. On a previous run the script exited with a Error: connect ETIMEDOUT probably due to exceeding API limits. I would like to be able to try connecting 5 times and if that fails resume after an hour when the limit has been restored.
It would also be beneficial to save the progress throughout in case of a crash (power goes down, network crashes etc). And be able to resume the script from the point it left off.
I know that Node.js operates as a giant event-queue, all http requests and their callbacks get put into that queue (together with some other events). This makes it a prime target for saving the state of current execution. Other pleasant (not totally necessary for this project) would be being able to distribute the work among several machines on different networks to increase throughput.
So is there an existing way to do it? A framework perhaps? Or do I need to implement this myself, in that case, any useful resources on how this can be done would be appreciated.

I'm not sure what you mean when you say
I know that Node.js operates as a giant event-queue, all http requests and their callbacks get put into that queue (together with some other events). This makes it a prime target for saving the state of current execution
Please feel free to comment or expound on this if you find it relevant to the answer.
That said, if you're simply looking for a persistence mechanism for this particular task, I might recommend Redis, for a few reasons:
It allows atomic operations on many data types; for example, if you had an entry in Redis called num_requests_made that represented the number of requests made, you could increment this number easily in Redis using INCR num_requests_made, and it's guaranteed to be atomic, making it easier to scale to multiple workers.
It has several data types that could prove useful for your needs; for example, a simple string could represent the number of API requests made during a certain period of time (as in the previous bullet point); you might store details on failed API request that need to be resubmitted in a list; etc.
It provides pub/sub mechanisms which would allow you to communicate easily between multiple instances of the program.
If this sounds interesting or useful and you're not already familiar with Redis, I highly recommend trying out the interactive tutorial, which introduces you to a few data types and commands for them. Another good piece of reading material is A fifteen minute introduction to Redis data types.

Related

Duplicate websocket subscription in Azure webapp

I have a node.js app running in Azure as a webApp. On startup it connects to an external service using a websocket subscription. Specifically I'm using the reconnecting-websockets NPM package to wrap it to handle disconnects.
The problem I am having is that because there are 2 instances of the app running on Azure (horizontal scaling for failover) I end up with two subscriptions at any one time.
Is there an obvious way to solve this problem?
For extra context, this is a problem for 2 reasons:
I pay for each message received and am over quota
When messages are received I process then and do database updates, these are also being duplicated.
You basically want to have an AppService with potentially multiple instances, but you don't want your application to run in parallel. At least you don't want two have two subscriptions. Ideally you don't want to touch your application code.
An easy way to implement this would be to wrap your application into a continuous WebJob, and set its scale property to singleton.
Here is one tutorial on how to set up a nodejs webjob: https://morshemesh.medium.com/continuous-deployment-of-web-jobs-with-node-js-2308f95e63b1
You can then use a settings.job file to control that your webjob only runs on a single instance at any one time. Or you can use the Azure Portal to set the value when you manually deploy the Webjob.
{
"is_singleton": true
}
https://github.com/projectkudu/kudu/wiki/WebJobs-API#set-a-continuous-job-as-singleton
PS: Don't forget to enable Always On. It is also mentioned in the docs. But you probably already need that for your current deployment.
If you don't want your subscription to be duplicated then it stands to reason that you only want one process subscribing to the external websocket connection.
Since you mentioned that messages received will be updated in the db, then it makes sense that this would be an isolated backend process since you made it clear that you have multiple instances running for the frontend server (and whether or not a separate backend).
Of course if you want more redundancy, you could use a load balancer with simple distribution of messages to any number of instances behind. Perhaps some persistent queueing system if you feel that it's needed.
If you want these messages to be propagated to the client (not clear from the question), this will be a bit more annoying. If it's a one-way simple channel, then you could consider using SSE which is a rather simple protocol. If it's bilateral then I would myself probably consider running a STOMP server with intermediary broker (like RabbitMq) and connect directly from the client (i.e. the browser, not the server generating the frontend) to the service.
Not sure if you're well versed with Java, but I made some app that you could use for reference in case interested when we had to prepare some internal demos: https://github.com/kimgysen/zwoop-backend/tree/develop/stomp-api/src/main/java/be/zwoop
For all intents and purposes, I'm not sure if all this is worth the hustle for you, it sounds like you're on a tight budget and that you're looking for simple solutions without too much complexity. Have you considered giving up on load balancing the website (is the load really that high?), I don't have enough background knowledge on your project to judge, I believe. But proper caching optimization and initially scaling vertically may be sufficient at the start (?).
Personally I would start simple and gradually increase complexity when needed.
I'm just throwing ideas at you, hopefully it is helpful in any way to have a few considerations.
Btw, I don't understand why other answers on this question were all deleted (?).

Optimal number of requests from client

Let's say we have the page. During rendering the page, we need to execute about 15 requests to API for getting some data.
How does this number of requests will affect on performance for desktop/mobile versions? Do I need to do any changes for reducing the number of requests? It will be great if you can send me the link with clarification related to this theme.
Optimization is this case really depends on the result of the API calls. Like what you are getting in response. Is the same static data each time or is it the same data with slight changes or is it extremely weird data which changes in real time?
There are many optimization techniques like whether to use Sync or Async, caching, batching, payload reduction. There could be many more but I know the few above. You can get a lot about these with a single Google query. It is up to you to decide which to use and where to use.
Various browsers have various limits for maximum connections per host
name; you can find the exact numbers at
http://www.browserscope.org/?category=network
Here is an interesting article about connection limitations from web
performance expert Steve Souders
http://www.stevesouders.com/blog/2008/03/20/roundup-on-parallel-connections/
12 requests to one domain/service is not much. latest versions of browsers supports around 6 simultaneous http 1.x connections per domain. So that means, your first 6 service calls (to a particular domain) needs to be done before initiating the next HTTP connection to that domain. (With HTTP2, this limitation will not be there though). So if your application is not intended to be high performing you are usually fine.
On the other hand, if every milli seconds counts, then it's better to have an edge service / GraphQL (my preference) aggregates all the services and send to the browser.

Where in my stack can streams solve problems, replace parts and/or make it better?

If I take a look at the stream library landscape i see a lot of nice stuff (like mapping/reducing streams) but I'm not sure how to use them effectively.
Say I already have an express app that serves static files and has some JSON REST handlers connecting to a MongoDB database server. I have a client heavy app that can display information in widgets and charts (think highcharts) with the user filtering, drilling down into information etc. I would like to move to using real-time updating of the interface, and this is the perfect little excuse to introduce node.js into the project, I think, however the data isn't really real-time so pushing new data to a lot of client's isn't what I'm trying to achieve (yet). I just want a fast experience.
I want to use browserify, which gives me access to the node.js streams api in the browser (and more..) and given the enormity of the data sets, processing is done server-side (by a backend API over JSONP).
I understand that most of the connections at some point are already expressed as streams, but I'm not sure where I could use streams elsewhere effectively to solve a problem;
Right now, when sliders/inputs are changed, spinning loaders appear in affected components until the new JSON has arrived and is parsed and ready to be shot into the chart/widget. Putting a Node.JS server in between, can streaming things instead of request/responding chunks of JSONPified number data speed up the interactivity of the application?
Say that I have some time series data. Can a stream be reused so that when I say I want to see only a subset of the data (by time), I can have the stream re-send it's data, filtering out the ones I don't care about?
Would streaming data to a (high)chart be a better user experience then using a for loop and an array?

Should Node.js be used for intensive processing?

Let's say I'm building a 3-tier web site, with Mongo DB on the back end and some really lightweight javascript in the browser (let's say just validation on forms, maybe a couple of fancy controls which fire off some AJAX requests).
I need to choose a technology for the 'middle' tier (we could segment this into sub-tiers but that detail isn't the focus here, just the overall technology choice), where I want to crunch some raw data coming out of the DB, and render this into some HTML which I push to the browser. A fairly typical thin-client web architecture.
My safe choice would be to just implement this middle tier in Java, using some libraries like Jongo to talk to the Mongo DB and maybe Jackson to marshal/unmarshal JSON to talk to my fancy controls when they make AJAX requests. And some Java templating framework for rendering my HTML on the server.
However, I'm really intrigued by the idea of throwing all this out the window and using Node.js for this middle tier, for the following reasons:
I like javascript (the good parts), and let's say for this application's business logic it would be more expressive than Java.
It's javascript everywhere. No need to switch between languages, and indeed the OO and functional paradigms, when working anywhere on the stack. There's no translation plumbing between the tiers, JSON is supported natively everywhere.
I can reuse validation logic on the client and server.
If in the future I decide to do the HTML rendering client-side in the browser, I can reuse the existing templates with something like Backbone with a pretty minimal refactoring / retesting effort.
If you're at this point and like Node, all the above will seem obvious. So I should choose Node right?
BUT... this is where it falls down for me: as we all know Node is based around a single-threaded async I/O web server model. This is great for my scalability and performance in terms of servicing requests for data, but what about my business logic? What about my template rendering? Won't this stuff cause a huge bottleneck for all requests on the single thread?
Two obvious solutions come to mind, but neither of them sits right:
Keep the 'blocking' business logic in there and just use a cluster of Node instances and a load balancer, to service requests in true parallel. Ok great, so why isn't Node just multi-threaded in the first place? Or was this always the idea, to Keep It Simple Stupid and avoid the possibility of multi-threaded complexity in the base case, making the programmer do the extra setup work on top of this if multi-core processing power is desired?
Keep a single node instance, and keep it non-blocking by just calling out to some java implementation of my business logic running on some other, muti-threaded, app server. Ok, this option completely nullifies every pro I listed of using Node (in fact it adds complexity over just using Java), other than the possible gains in performance and scalability for CRUD requests to the DB.
Which leads me finally to the point of my question - am I missing some huge important piece of the Node puzzle, have I just got my facts completely wrong, or is Node just unsuitable for crunching business logic on the server? Put another way, is Node just useful for sitting over a database and servicing many CRUD requests in a more performant and scalable way than some other implementation which blocks on I/O? And you have to do all your business logic in some tier below, or even client-side, to maintain any reasonable levels of performance and scalability?
Considering all the buzz over Node, I'd rather hoped it brought more to the table than this. I'd love to be convinced otherwise!
On any given system you have N cpus available (1-64, or whatever N happens to be). In any CPU-intensive application, you're going to be stuck with a throughput of N cpus. There's no magical way to fix that by adding more than N threads/processes/whatever. Either your code has to be more efficient, or you need more CPUs. More threads won't help.
One of the little-appreciated facts about multiple-CPU performance is that if you need to run N+1 CPU-intensive operations at the same time, your throughput per CPU goes down quite a bit. A CPU-intensive process tends to hang on to that CPU for a very long time before giving it up, starving the other tasks pretty badly. In the majority of cases, it is blocking I/O and the concomitant task-switching that makes modern OS multitasking work as well as it does. If more of our every-day common tasks were CPU-bound, we would discover we needed a lot more CPUs in our machines than we do currently.
The nice thing that Node.js brings to the server party efficiency-wise is a thorough use of each thread. Ideally, you end up with less task switching. This isn't a huge win, but having N threads handling N*C connections asynchronously is going to have a performance advantage over N*C blocking threads running on the same number of CPUs. But the bottom line on CPUs remains the same: if you have more than N worth of actual CPU work to be done, you're going to feel some pain.
The last time I looked at the Node.js API there was a way to launch a server with one listener plus one worker thread per CPU. If you can do that, I would be inclined to go with Node.js provided a few caveats are met:
The Javascript-everywhere approach buys you some simplicity. For something complicated, I would be concerned about the asynchronous programming style making things harder rather than easier.
The template-processing and other CPU-intensive tasks aren't appreciably slower in Node.js than your other language/platform choices.
The database drivers are reliable.
There is one downside that I can see:
If a thread crashes, you lose all of the connections being serviced by that thread.
Finally, try to remember that programmer time is generally more expensive than servers or bandwidth.

Node.js web sockets server: Is my idea for data management stable/scalable?

I'm developing a html5 browser multi-player RPG with node.js running in the backend with a web sockets plug-in for client data transfer. The problem i'm facing is accessing and updating user data, as you can imagine this process will be taking place many times a second even with few users connected.
I've done some searching and found only 2 plug-ins for node.js that enable MySQL capabilities but they are both in early development and I've figured that querying the database for every little action the user makes is not efficient.
My idea is to get node.js to access the database using PHP when a user connects and retrieve all the information related to that user. The information collected will then be stored in an JavaScript object in node.js. This will happen for all users playing. Updates will then be applied to the object. When a user logs off the data stored in the object will be updated to the database and deleted from the object.
A few things to note are that I will separate different types of data into different objects so that more commonly accessed data isn't mixed together with data that would slow down lookups. Theoretically if this project gained a lot of users I would introduce a cap to how many users can log onto a single server at a time for obvious reasons.
I would like to know if this is a good idea. Would having large objects considerably slow down the node.js server? If you happen to have any ideas on another possible solutions for my situation I welcome them.
Thanks
As far as your strategy goes, for keeping the data in intermediate objects in php, you are adding a very high level of complexity to your application.
Just the communication between node.js and php seems complex, and there is no guarantee this will be any faster than just putting things right in mysql. Putting any uneeded barier between you and your data is going to make things more difficult to manage.
It seems like you need a more rapid data solution. You could consider using an asynchronous database like mongodb, or redis that will read and write quickly (redis will write in memory, should be incredibly fast)
These are both commonly used with node.js just for the reason that they can handle the real time data load.
Actually redis is what your really asking for, it actually stores things in memory and then persists it to the disk periodically. You can´t get any faster than that, but you will need enough ram. If ram looks like an issue, go with mongodb which is still really fast.
The disadvantage is you will need to relearn the ideas about data persistance, and that is hard. I´m in the process of doing that myself!
I have an application doing allmost what you describe- I choosed to do it that way since th MYSQL drivers for node was unstable/ undocumented at the time of development.
I have 200 connected users - requesting data 3-5 times each second, and fetch entire tables through php pages (each 200-800 ms) returning JSON from apache , with approx 1000 lines and put the contents in arrays. I loop through the arrays and find the relevant data on request - it works, and its fast - putting no significant load on cpu and memory.
All data insertion/updating, which is limited goes through php/mysql.
Advantages:
1. its a simple solution, with known stable services.
2. Only 1 client connecting to apache/php/mysql each 200-800 ms
3. all node clients get the benefit of non-blocking.
4. Runs on 2 small "pc-style" servers - and handles about 8000 req/second. (apache bench)
Disadvantages:
1. many - but it gets the job done.
I found that my node script COULD stop -1-2 times a week- maybe due to some connection problems (unsolved) - but Combined with Upstart and Monit it restarts and alerts with no problems.....

Categories

Resources