Locking from Cron
Periodic jobs must often not run more than one instance at a time. Unfortunately, simple solutions often fail to account for common edge cases. For example, assume a need to synchronize files each hour with rsync. On unix, a cron job is perhaps the quickest solution:
# run just past the top of the hour # as many other things run then 7 * * * * rsync -e ssh -az --delete /source desthost:/dest
However, this solution has a major edge case that can bring down the system. Worse, simplistic attempts to fix this fault can result in rsync not running.
Technorati Tags: cron
Resource Usage Spiral
rsync, run directly from cron, will run until the file transfer has completed, or a bug causes the rsync to hang, or some fault causes rsync to terminate unexpectedly: bad memory, a system reboot, a cowboy admin getting frisky with kill -9, and so forth. The most worrying is when rsync operates normally, but takes too long to complete. In this case, cron will launch subsequent rsync, and over time, if the other rsync never exit, the system will eventually fail.
The rsync --timeout=999 option is useful, but not complete. This option ensures rsync will eventually exit. However, it will not help when crond launches the next instance while rsync is still transferring files. A wrapper script around rsync is necessary to prevent multiple instances from running.
Naïve Locking Schemes
< Lovecraft> thrig: its a matter of making a script that looks for a lock file. if exist <lockfile>, don't start. Ifnot exist <lockfile> make one and start. < thrig> Lovecraft: and what else? < Lovecraft> Thats it
File locking requires slightly more thought than looking for a lock, and not running if it exists. Lovecraft’s incomplete solution will cause problems should the system crash, should the system restart normally without the script handling the shutdown signal properly, should the script be terminated by kill -9 (even if it trapped other signals properly), should a hardware fault cause the script to exit, should a system configuration issue prevent the lock file from being created. Worst, if poorly written and poorly monitored, nobody may know the rsync process is not running—until some need reveals the lack of recent files on the destination server, which could be weeks or months since things went awry.
#!/bin/sh # Skeleton locking with signal handling. # Much more sanity checking required! PID_FILE="/var/lock/foopid" cleanup () { rm -- "$PID_FILE" } trap "cleanup" 0 1 2 13 15 # Race condition probably not a concern, # due to the infrequency of rsync runs. [ -e "$PID_FILE" ] && exit touch "$PID_FILE" || exit rsync --timeout=999 -e ssh -az \ /sourcedir/ desthost:/destdir cleanup
Lock files introduce a new problem: the lock file suggests—but by no means proves—the associated process is actually running. That is, depending on the implementation, the rsync may be running, and no lock file created—a permissions problem coupled with an “ignore errors creating lock file” implementation—or the rsync process may not be running, and a lock file exists, for the various reasons outlined above. Consider locking against the process name, not a file on disk.
Ideas for Improvements
Software besides crond, such as CFEngine, provide locking functionality. If possible, use these solutions, as they are likely better tested than an in-house shell script. CFEngine or similar configuration management software can also ensure the lock directory exists and has the correct permissions, if a lock file scheme is used.
With Perl, one solution is to lock the script itself via the special __DATA__ filehandle, which will avoid the various problems of an external lock file. I generally prefer Perl over the shell, as the shell lacks the equivalent of perl -c, makes writing unit tests difficult, and has a number of scary edge cases that can delete entire disks.
Another implementation wraps the rsync inside a loop. This prevents multiple rsync from running, but pushes the locking and monitoring to the wrapper script instead. This runs the process an hour after the previous one completes, not once every hour.
while sleep 3600; do rsync ... done
Monitoring whether rsync actually did anything is another can of worms. This monitoring should not be an e-mail, as if frequent it will become cron spam, filtered and ignored. Monitoring should also not report transitory errors, where the target is temporarily unavailable, as investigating false alarms waste time.

