The perfect open-source task scheduler
We have ever more regular tasks that need to be run on our servers in a reliable and resilient way, but resilience is harder to achieve than reliability.
I can schedule scripts on our Unix system using cron, and on our Windows systems via Task Scheduler, but neither of these are resilient – if the machine running those cron tasks crashes, my tasks won’t get executed.
The failed machine may come back online reasonably fast – and were I running an enhanced cron such as “anacron” it might catch up on the jobs it missed – but a task scheduled to run every hour probably wouldn’t get done on time, and performing a catch-up might make matters worse. What I need is some kind of failover so that if one machine fails, another will run its tasks on time.
We’ve implemented such a system for a client on a pair of Linux machines, by using heartbeat and the Disaster Recovery Block Device (DRBD). This client runs two sets of services out of these two machines, one set running on each machine under normal conditions.
Should one machine get taken out of service, its services will fail over onto the other. We use heartbeat to enable each machine to monitor the other, so that in case of failure it can take over its services. Part of this service switch involves a simple cluster file system implemented using the DRBD.
We use heartbeat to enable each machine to monitor the other, so that in case of failure it can take over its services. Part of this service switch involves a simple cluster file system implemented using the DRBD
Under normal operation each machine reads and writes its own copy of each service’s files, and DRBD copies those changes to the other machine, so that if a failover occurs the failed-over services will have access to up-to-date copies of their files (normally each machine sees only the files for its own services and can’t see files on the other machine).
This combination of heartbeat and DRBD works well for services such as web serving (the primary application), but there was also a need to have various regular tasks failover from one machine to another. We implemented this by putting the same crontab onto each machine, so that each knows what tasks it might have to run in the event of failover and has their scripts in its file system.
Then we prefix each task with a script that checks whether the service task is present: since each machine normally sees only the files for its own services, the other’s task files will only be visible if a failover is in progress. This rather complex schema enables us to implement a form of resilient cron, but it isn’t without its problems.
First of all, the system is complicated, and second of all it isn’t human-proof: when a new task is required you must remember to add it to both machines. Third, there are all sorts of boundary conditions relating to what happens to tasks that are running when a failure occurs, or worse still when a failure occurs and the services “fail back” in the middle of a scheduled task.
Finally, it isn’t scalable – it can fail services from one node to another, but for a large server farm of identical machines with potential multiple failures it wouldn’t work.
This last point is where you see the real problem with a cron-based solution. Cron and Task Scheduler were only designed to run regular housekeeping tasks on a single machine, but when you’re providing a service from a collection of machines then every single machine is a single point of failure.