The perfect open-source task scheduler

We have ever more regular tasks that need to be run on our servers in a reliable and resilient way, but resilience is harder to achieve than reliability.

The perfect open-source task scheduler

I can schedule scripts on our Unix system using cron, and on our Windows systems via Task Scheduler, but neither of these are resilient – if the machine running those cron tasks crashes, my tasks won’t get executed.

The failed machine may come back online reasonably fast – and were I running an enhanced cron such as “anacron” it might catch up on the jobs it missed – but a task scheduled to run every hour probably wouldn’t get done on time, and performing a catch-up might make matters worse. What I need is some kind of failover so that if one machine fails, another will run its tasks on time.

We’ve implemented such a system for a client on a pair of Linux machines, by using heartbeat and the Disaster Recovery Block Device (DRBD). This client runs two sets of services out of these two machines, one set running on each machine under normal conditions.

Should one machine get taken out of service, its services will fail over onto the other. We use heartbeat to enable each machine to monitor the other, so that in case of failure it can take over its services. Part of this service switch involves a simple cluster file system implemented using the DRBD.

We use heartbeat to enable each machine to monitor the other, so that in case of failure it can take over its services. Part of this service switch involves a simple cluster file system implemented using the DRBD

Under normal operation each machine reads and writes its own copy of each service’s files, and DRBD copies those changes to the other machine, so that if a failover occurs the failed-over services will have access to up-to-date copies of their files (normally each machine sees only the files for its own services and can’t see files on the other machine).

This combination of heartbeat and DRBD works well for services such as web serving (the primary application), but there was also a need to have various regular tasks failover from one machine to another. We implemented this by putting the same crontab onto each machine, so that each knows what tasks it might have to run in the event of failover and has their scripts in its file system.

Then we prefix each task with a script that checks whether the service task is present: since each machine normally sees only the files for its own services, the other’s task files will only be visible if a failover is in progress. This rather complex schema enables us to implement a form of resilient cron, but it isn’t without its problems.

First of all, the system is complicated, and second of all it isn’t human-proof: when a new task is required you must remember to add it to both machines. Third, there are all sorts of boundary conditions relating to what happens to tasks that are running when a failure occurs, or worse still when a failure occurs and the services “fail back” in the middle of a scheduled task.

Finally, it isn’t scalable – it can fail services from one node to another, but for a large server farm of identical machines with potential multiple failures it wouldn’t work.

This last point is where you see the real problem with a cron-based solution. Cron and Task Scheduler were only designed to run regular housekeeping tasks on a single machine, but when you’re providing a service from a collection of machines then every single machine is a single point of failure.
Also, cron isn’t good at keeping records – how can you easily tell whether a task ran successfully last time, or the time before? Cron can send emails to people when there’s a problem, but when you need them most you can bet they’ll end up in unattended accounts.

So if cron and Task Scheduler aren’t ideal, what exactly do we need from a task scheduling solution? We need a centralised, distributed solution with a single management interface in which we can see all the regular tasks that are to be run, and it needs to be distributed so that we never rely entirely on any one machine working. For example, if a task needs to be run every hour, we want to set that up through a single management interface, but to say that it can run from any of the machines.

Obviously, we’ll need access to that single management interface from more than one machine too. We’d also like the management interface to provide job histories, so that we know if and why any task has failed. Finally, it must be secure and reliable, and must work with our Unix and Windows servers. Believe it or not, there are quite a few open-source solutions that fit this bill: not all of them satisfy our requirements, but many of them are constantly being improved.

Why not outsource your task management?

My company has outsourced its email system to Google Apps, most of its DNS lookups to OpenDNS, and the phone system to a VoIP company, so why shouldn’t we outsource our regular task management too?

There are several companies that offer “web-based cron services” that operate from external servers that access pages on your own web servers at the required times. Hopefully, these will have solved the problem of resilience in their own systems, and have their own centralised management interfaces.

Ortro

Such services are ideal for people who use web hosting and thus don’t have access to facilities such as cron. We can add further resilience to the service at our end by having the pages they access served via load balancers, and hence split over our whole server farm.

We considered using one of these services, but there are two reasons why we didn’t. First, some of the tasks that we need to run can take quite a long time to complete, and it may be that a web connection can’t be held open long enough, so we wouldn’t be sure that the remote service actually logged the success or failure correctly.

Second, there’s a security concern in that we’re exposing our internal business processes to external control. While that wouldn’t be a big problem for us, it would be for, say, financial businesses, and any competent security advisor would counsel against it.

Ortro: a web-based cron replacement

Having rejected an external web-based solution, we wondered if there’s an open-source system we could employ. We came across Ortro, a web-based system where tasks are set up via a web interface with all the configuration held in a database.

This solves a number of problems, because we can store that configuration database on our replicated database server and run its web interface from multiple machines.

To implement the underlying scheduling process Ortro uses cron to kick itself into action. Since it uses cron running on one machine to make it work, it has to have some means of making tasks run on the other machines, which it can either accomplish by accessing web pages or by running remote shells via SSH.
There are also neat facilities for setting up simple work flows, as in “run this process and if it works then run this process, or else run that one”.

Everything in Ortro works well. We set up SSH on our internal load balancers and used it to schedule work in a cluster and “skip over” any failed machine. However, there were two things we didn’t like. We couldn’t easily get a history of what a job had done – for example, if it ran overnight did it fail at 1am but run okay at 2am, 3am and so on?

The biggest problem, though, was getting away from a single point of failure: Ortro relies on having one single master process run via cron on a single machine, and there was no simple way to keep two master machines operating in an active+standby mode.

Schedulers written in Java

When we found Ortro, we also came across a large collection of Java-based systems that we’d initially rejected because they were written in Java. We don’t have anything against the language per se: it’s more that Java is something other people use, but not us. To be fair, we do use the Eclipse editor that’s written in Java, but we haven’t used it much on our servers.

We also use the Dell PowerEdge Server Administrator package, which employs Java at various points, but we don’t use it to deliver services. However, once we’d decided that Ortro wasn’t right for us, we decided to put our prejudices aside and give Java a go.

There are lots of distributed resilient schedulers written in Java, some truly open-source, but also a few “demoware” versions of commercial products. There are also several products aimed at large grid systems that we didn’t look at. The two Java-based systems we tried were a simple one called OddJob and a more complex one called Job Scheduler.

OddJob is easy to set up: download the ZIP file and you’re up and running. Unlike Ortro, OddJob relies on having a process that runs continually, which ticked many of our boxes: we could control and monitor jobs on our network via a GUI interface; we could schedule jobs conditionally on what happened to other jobs; and it all seemed to work.

Oddjob

The system still has to have a master process with its own database, which isn’t currently an ordinary SQL database. As such, while we liked the product it didn’t quite tick all the boxes.

Job Scheduler is maintained by the German company SOS, and this package does literally everything we want of it.

It comes with an easy-to-use installer, it will work with any reasonably adult SQL database – in our case MySQL, but it’s known to work with PostgreSQL and SQL Server too – and it can be set up with an active scheduler node and a standby node.

Jobs may either be scheduled to run on the machine where the scheduler is located or on remote machines, which is accomplished either through remote connections or by having the job scheduler on that machine communicate with the master processes. The system supports conditional processing, workflows, logging and other features that would put many commercial solutions to shame. We spent a lot of time setting up Job Scheduler and playing with it – but ultimately, we gave up on it too.

The reason we gave up wasn’t really the software’s fault, it was ours. We use cron, and other people use equivalent software on Windows, because it’s so easy: we don’t need to set up scheduled jobs every day, but when we do we want it to be easy.

Unfortunately, we found Job Scheduler too difficult. It exposes the fact that it can do everything to you at every step, meaning that simple operations weren’t actually that simple. Also, although the system is managed via a web interface, we found that hard to access from different machines or from different networks.

None of these problems were insurmountable, but I think ultimately that our “eyes were bigger than our bellies”, and we didn’t really want all its abilities along with the features we did want.

What next: GNUbatch

So far this was going badly: we’d rejected every package we’d looked at. We didn’t like some solutions because they lacked resilience, and we didn’t like the more resilient solution because it was too complicated.

We then discovered a package called GNUbatch. It’s a mature product in the job-scheduling arena, but relatively new to the open-source community. It used to be a commercial package called Xi-Batch and it’s been actively developed since 1990, before being made open-source in 2009.

Before you all start rushing for your package manager, be aware that you probably won’t find GNUbatch in repositories, so you’ll have to download the source code and compile it yourself.

GNUbatch is a mature product in the job-scheduling arena, but relatively new to the open-source community

Also be aware that this is good old-fashioned software: GNUbatch doesn’t use a relational database to store its information; it isn’t written in Java; and if you want you can control it all from a command line, although it does have GUI, Windows and web interfaces as well.

So what can GNUbatch do? It lets you set up jobs to run on one machine or many machines, and each job can have a start time and be set to repeat at various intervals – with a little bit of thought you can also restrict it to working days, avoiding weekends and holidays, and have it understand more sophisticated scheduling options such as “the last day of the working month”.

Jobs can be chained together, with optional paths based on errors, and these chains can be created to span multiple machines. For example, you could have a reporting task that runs on a collection of machines, which then causes a single job to run on another machine to aggregate the results into a report.

All results of jobs are logged as text files and these logs can be manipulated later. All in all, we found that GNUbatch did exactly what we wanted, and while it did have a steepish learning curve, that was mitigated by the fact that it has a substantial collection of readable manuals.

So we eventually found a solution to our search for a job scheduling package, after struggling with the classic open-source problem of too many to choose from. If you have a similar need to ours, look at Job Scheduler and GNUbatch.

Disclaimer: Some pages on this site may include an affiliate link. This does not effect our editorial in any way.