unique jobs with kue for node.js

unique jobs with kue for node.js - javascript

I would like that the jobs.create fails if an identical job is already in the system. Is there any way to acomplish this?
I need to run the same job every 24 hours, but some jobs could take even more than 24 hours, so I need to be sure that the job isn't already in the system (active, queued o failed) before adding it.
UPDATED:
Ok, I going to simplify the problem to be able to explain it here.
Lest say I have an analytics service and I have to send a report to my users once a day. Completing these reports some times(just a few cases but it is a possibility) take several hours even more than a day.
I need a way to know which are the currently running jobs to avoid duplicated jobs. I couldn't find anything in the ´´´´kue´´´´ API to know which jobs are currently running. Also I need some kind of event fired when more jobs are needed and then call my getMoreJobs producer.
Maybe my approach is wrong, if so please let me know a better way to solve my problem.
This is my simplified code:
var kue = require('kue'),
cluster = require('cluster'),
numCPUs = require('os').cpus().length;
numCPUs = CONFIG.sync.workers || numCPUs;
var jobs = kue.createQueue();
if (cluster.isMaster) {
console.log('Starting master pid:' + process.pid);
jobs.on('job complete', function(id){
kue.Job.get(id, function(err, job){
if (err || !job) return;
job.remove(function(err){
if (err) throw err;
console.log('removed completed job #%d', job.id);
});
});
function getMoreJobs() {
console.log('looking for more jobs...');
getOutdateReports(function (err, reports) {
if (err) return setTimeout(getMoreJobs, 5 * 60 * 60 * 1000);
reports.forEach(function(report) {
jobs.create('reports', {
id: report.id,
title: report.name,
params: report.params
}).attempts(5).save();
});
setTimeout(getMoreJobs, 60 * 60 * 1000);
});
}
//Create the jobs
getMoreJobs();
console.log('Starting ', numCPUs, ' workers');
for (var i = 0; i < numCPUs; i++) {
cluster.fork();
}
cluster.on('death', function(worker) {
console.log('worker pid:' + worker.pid + ' died!'.bold.red);
});
} else {
//Process the jobs
console.log('Starting worker pid:' + process.pid);
jobs.process('reports', 20, function(job, done){
//completing my work here
veryHardWorkGeneratingReports(function(err) {
if (err) return done(err);
return done();
});
});
}

The answer to one of your questions is that Kue puts the jobs that it pops off of the redis queue into "active", and you'll never get them unless you look for them.
The answer to the other question is that your distributed work queue is the consumer, not the producer of tasks. Mingling them like you have is okay, but, it's a muddy paradigm. What I've done with Kue is to make a wrapper for kue's json api, so that a job can be put into the queue from anywhere in the system. Since you seem to have a need to shovel jobs in, I suggesting writing a separate producer application that does nothing but get external jobs and stick them into your Kue work queue. It can monitor the work queue for when jobs are running low and load a batch in, or, what I would do, is make it shovel jobs in as fast as it can, and spool up multiple instances of your consumer application to process the load more quickly.
To re-iterate: Your separation of concerns isn't very good here. You should have a producer of tasks that's completely separate from your task consumer app. This gives you more flexibility, ease of scale (Just fire up another consumer on another machine and you're scaled!) and overall ease of code management. You should also allow, if possible, whomever is giving you these tasks that you "go looking for" access to your Kue server's JSON api instead of going out and finding them. The job producer can schedule its own tasks with Kue.

Look at https//github.com/LearnBoost/kue.
In json.js script check rows 64-112. There you'll find methods which return an object containing jobs, also filtered with type, state or id-range. (jobRange(), jobStateRange(), jobTypeRange().)
Scrolling down the main page to JSON API -section, you'll find the examples of the returned objects.
That how to call and use those methods you know much better than I do.
jobs.create() will fail, if you pass an unknown keyword. I would created a function to check the current job in forEach-loop, and returns a keyword. Then just call this function instead of literal keyword in jobs.create() -parameters.
Information got through those methods in json.js, may help you create that "moreJobToDo"-event too.

Related

Delay function execution (API call) to execute n times in time period

I'm trying to write back end functionality that is handling requests to particular API, but this API has some restrictive quotas, especially for requests/sec. I want to create API abstraction layer that is able of delaying function execution if there are too many requests/s, so it works like this:
New request arrives (to put it simple - library method is invoked)
Check if this request could be executed right now, according to given limit (requests/s)
If it can't be executed, delay its execution till next available moment
If at this time a new request arrives, delay its execution further or put it on some execution queue
I don't have any constraints in terms of waiting queue length. Requests are function calls with node.js callbacks as the last param for responding with data.
I thought of adding delay to each request, which would be equal to the smallest possible slot between requests (expressed as minimal miliseconds/request), but it can be a bit inefficient (always delaying functions before sending response).
Do you know any library or simple solution that could provide me with such functionality?

Save the last request's timestamp.
Whenever you have a new incoming request, check if a minimum interval elapsed since then, if not, put the function in a queue then schedule a job (unless one was already scheduled):
setTimeout(
processItemFromQueue,
(lastTime + minInterval - new Date()).getTime()
)
processItemFromQueue takes a job from the front of the queue (shift) then reschedules itself unless the queue is empty.

The definite answer for this problem (and the best one) came from the API documentation itself. We use it for a couple of months and it perfectly solved my problem.
In such cases, instead of writing some complicated queue code, the best way is to leverage JS possibility of handling asynchronous code and either write simple backoff by yourself or use one of many great libraries to use so.
So, if you stumble upon any API limits (e.g. quota, 5xx etc.), you should use backoff to recursively run the query again, but with increasing delay (more about backoff could be found here: https://en.wikipedia.org/wiki/Exponential_backoff). And, if finally, after given amount of times you fail again - gracefully return error about unavailability of the API.
Example use below (taken from https://www.npmjs.com/package/backoff):
var call = backoff.call(get, 'https://someaddress', function(err, res) {
console.log('Num retries: ' + call.getNumRetries());
if (err) {
// Put your error handling code here.
// Called ONLY IF backoff fails to help
console.log('Error: ' + err.message);
} else {
// Put your success code here
console.log('Status: ' + res.statusCode);
}
});
/*
* When to retry. Here - 503 error code returned from the API
*/
call.retryIf(function(err) { return err.status == 503; });
/*
* This lib offers two strategies - Exponential and Fibonacci.
* I'd suggest using the first one in most of the cases
*/
call.setStrategy(new backoff.ExponentialStrategy());
/*
* Info how many times backoff should try to post request
* before failing permanently
*/
call.failAfter(10);
// Triggers backoff to execute given function
call.start();
There are many backoff libraries for NodeJS, leveraging either Promise-style, callback-style or even event-style backoff handling (example above being second of the mentioned ones). They're really easy to use if you understand backoff algorithm itself. And as the backoff parameters could be stored in config, if backoff is failing too often, they could be adjusted to achieve better results.

Nodejs Mongoose - Serve clients a single query result

I'm looking to implement a solution where I can query the Mongoose Database on a regular interval and then store the results to serve to my clients.
I'm assuming this will reduce my response time when my users pull the collection.
I attempted to implement this plan by creating an empty global object and then writing a function that queries the db and then stores the results as the global object mentioned previously. At the end of the function I setTimeout for 60 seconds and then ran the function again. I call this function the first time the server controller gets called when the app is first run.
I then set my clients up so that when they requested the collection, it would first look to see if the global object exists, and if so return that as the response. I figured this would cut my 7-10 second queries down to < 1 sec.
In my novice thinking I assumed that Nodejs being 'single-threaded' something like this could work quite well - but it just seemed to eat up all my RAM and cause fatal errors.
Am I on the right track with my thinking or is it better to query the db every time people pull the collection?
Here is the code in question:
var allLeads = {};
var getAllLeads = function(){
allLeads = {};
console.log('Getting All Leads...');
Lead.find().sort('-lastCalled').exec(function(err, leads) {
if (err) {
console.log('Error getting leads');
} else {
allLeads = leads;
}
});
setTimeout(function(){
getAllLeads();
}, 60000);
};
getAllLeads();
Thanks in advance for your assistance.

Strange issue with socket.on method

I am facing a strange issue with calling socket.on methods from the Javascript client. Consider below code:
for(var i=0;i<2;i++) {
var socket = io.connect('http://localhost:5000/');
socket.emit('getLoad');
socket.on('cpuUsage',function(data) {
document.write(data);
});
}
Here basically I am calling a cpuUsage event which is emitted by socket server, but for each iteration I am getting the same value. This is the output:
0.03549148310035006
0.03549148310035006
0.03549148310035006
0.03549148310035006
Edit: Server side code, basically I am using node-usage library to calculate CPU usage:
socket.on('getLoad', function (data) {
usage.lookup(pid, function(err, result) {
cpuUsage = result.cpu;
memUsage = result.memory;
console.log("Cpu Usage1: " + cpuUsage);
console.log("Cpu Usage2: " + memUsage);
/*socket.emit('cpuUsage',result.cpu);
socket.emit('memUsage',result.memory);*/
socket.emit('cpuUsage',cpuUsage);
socket.emit('memUsage',memUsage);
});
});
Where as in the server side, I am getting different values for each emit and socket.on. I am very much feeling strange why this is happening. I tried setting data = null after each socket.on call, but still it prints the same value. I don't know what phrase to search, so I posted. Can anyone please guide me?
Please note: I am basically Java developer and have a less experience in Javascript side.

You are making the assumption that when you use .emit(), a subsequent .on() will wait for a reply, but that's not how socket.io works.
Your code basically does this:
it emits two getLoad messages directly after each other (which is probably why the returning value is the same);
it installs two handlers for a returning cpuUsage message being sent by the server;
This also means that each time you run your loop, you're installing more and more handlers for the same message.
Now I'm not sure what exactly it is you want. If you want to periodically request the CPU load, use setInterval or setTimeout. If you want to send a message to the server and want to 'wait' for a response, you may want to use acknowledgement functions (not very well documented, but see this blog post).
But you should assume that for each type of message, you should only call socket.on('MESSAGETYPE', ) once during the runtime of your code.
EDIT: here's an example client-side setup for a periodic poll of the data:
var socket = io.connect(...);
socket.on('connect', function() {
// Handle the server response:
socket.on('cpuUsage', function(data) {
document.write(data);
});
// Start an interval to query the server for the load every 30 seconds:
setInterval(function() {
socket.emit('getLoad');
}, 30 * 1000); // milliseconds
});

Use this line instead:
var socket = io.connect('iptoserver', {'force new connection': true});
Replace iptoserver with the actual ip to the server of course, in this case localhost.
Edit.
That is, if you want to create multiple clients.
Else you have to place your initiation of the socket variable before the for loop.

I suspected the call returns average CPU usage at the time of startup, which seems to be the case here. Checking the node-usage documentation page (average-cpu-usage-vs-current-cpu-usage) I found:
By default CPU Percentage provided is an average from the starting
time of the process. It does not correctly reflect the current CPU
usage. (this is also a problem with linux ps utility)
But If you call usage.lookup() continuously for a given pid, you can
turn on keepHistory flag and you'll get the CPU usage since last time
you track the usage. This reflects the current CPU usage.
Also given the example how to use it.
var pid = process.pid;
var options = { keepHistory: true }
usage.lookup(pid, options, function(err, result) {
});

Node.js Kue how to restart failed jobs

I am using kue for delayed jobs in my node.js application.
I have some problems to figure out how I can restart a job using the API of kue without having to move the id of a job manually from the the list of failed jobs to the list of inactive jobs using redis commands.
Is this possible using kue?
I don't want to set a fixed number of retry attempts - I just want to retry specific jobs.
Suggestions for a well maintained alternative to kue are also welcome.

i dont know if this is working but you could try to reset the state of the job to active, and save the job again:
job.on('failed', function() {
job.state('inactive').save();
Edit: setting state to inactive will correctly re-enqueue the task.

This can also be done using queue level events.
queue.on('job failed', function(id, result) {
kue.Job.get(id, function(err, job) {
if (!err && shouldRetry(job))
job.state('inactive').save();
});
});
Thus you don't need to do for every job that you wish to retry. Instead you can filter it in the queue level event.

see Failure Attempts in the official docs
By default jobs only have one attempt, that is when they fail, they
are marked as a failure, and remain that way until you intervene.
However, Kue allows you to specify this, which is important for jobs
such as transferring an email, which upon failure, may usually retry
without issue. To do this invoke the .attempts() method with a number.
queue.create('email', {
title: 'welcome email for tj'
, to: 'tj#learnboost.com'
, template: 'welcome-email'
}).priority('high').attempts(5).save();
reference: failure attempts

Database Backed Work Queue

My situation ...
I have a set of workers that are scheduled to run periodically, each at different intervals, and would like to find a good implementation to manage their execution.
Example: Let's say I have a worker that goes to the store and buys me milk once a week. I would like to store this job and it's configuration in a mysql table. But, it seems like a really bad idea to poll the table (every second?) and see which jobs are ready to be put into the execution pipeline.
All of my workers are written in javascript, so I'm using node.js for execution and beanstalkd as a pipeline.
If new jobs (ie. scheduling a worker to run at a given time) are being created asynchronously and I need to store the job result and configuration persistently, how do I avoid polling a table?
Thanks!

I agree that it seems inelegant, but given the way that computers work something *somewhere* is going to have to do polling of some kind in order to figure out which jobs to execute when. So, let's go over some of your options:
Poll the database table. This isn't a bad idea at all - it's probably the simplest option if you're storing the jobs in MySQL anyway. A rate of one query per second is nothing - give it a try and you'll notice that your system doesn't even feel it.
Some ideas to help you scale this to possibly hundreds of queries per second, or just keep system resource requirements down:
Create a second table, 'job_pending', where you put the jobs that need to be executed within the next X seconds/minutes/hours.
Run queries on your big table of all jobs only once in a longer while, then populate the small table which you query every shorter while.
Remove jobs that were executed from the small table in order to keep it small.
Use an index on your 'execute_time' (or whatever you call it) column.
If you have to scale even further, keep the main jobs table in the database, and use the second, smaller table I suggest, just put that table in RAM: either as a memory table in the DB engine, or in a Queue of some kind in your program. Query the queue at extremely short intervals if you have too - it'll take some extreme use cases to cause any performance issues here.
The main issue with this option is that you'll have to keep track of jobs that were in memory but didn't execute, e.g. due to a system crash - more coding for you...
Create a thread for each of a bunch of jobs (say, all jobs that need to execute in the next minute), and call thread.sleep(millis_until_execution_time) (or whatever, I'm not that familiar with node.js).
This option has the same problem as no. 2 - where you have to keep track job execution for crash recovery. It's also the most wasteful imo - every sleeping job thread still takes system resources.
There may be additional options of course - I hope that others answer with more ideas.
Just realize that polling the DB every second isn't a bad idea at all. It's the most straightforward way imo (remember KISS), and at this rate you shouldn't have performance issues so avoid premature optimizations.

Why not have a Job object in node.js that's saved to the database.
var Job = {
id: long,
task: String,
configuration: JSON,
dueDate: Date,
finished: bit
};
I would suggest you only store the id in RAM and leave all the other Job data in the database. When your timeout function finally runs it only needs to know the .id to get the other data.
var job = createJob(...); // create from async data somewhere.
job.save(); // save the job.
var id = job.id // only store the id in RAM
// ask the job to be run in the future.
setTimeout(Date.now - job.dueDate, function() {
// load the job when you want to run it
db.load(id, function(job) {
// run it.
run(job);
// mark as finished
job.finished = true;
// save your finished = true state
job.save();
});
});
// remove job from RAM now.
job = null;
If the server ever crashes all you have to is query all jobs that have [finished=false], load them into RAM and start the setTimeouts again.
If anything goes wrong you should be able to restart cleanly like such:
db.find("job", { finished: false }, function(jobs) {
each(jobs, function(job) {
var id = job.id;
setTimeout(Date.now - job.dueDate, function() {
// load the job when you want to run it
db.load(id, function(job) {
// run it.
run(job);
// mark as finished
job.finished = true;
// save your finished = true state
job.save();
});
});
job = null;
});
});

Develop Reference

JavaScript is the programming language of the Web.