Watchdog User's Manual

February 2006

The Cooperative Computing Tools are Copyright (C) 2003-2004 Douglas Thain and Copyright (C) 2005- The University of Notre Dame. All rights reserved. This software is distributed under the GNU General Public License. See the file COPYING for details.


Keeping a collection of processes running in large distributed system presents many practical challenges. Machines reboot, programs crash, filesystems go up and down, and software must be upgraded. Ensuring that a collection of services is always running and up-to-date can require a large amount of manual activity in the face of these challenges.

watchdog is a tool for keeping server processes running continuously. The idea is simple: watchdog is responsible for starting a server. If the server should crash or exit, watchdog restarts it. If the program on disk is upgraded, watchdog will cleanly stop and restart the server to take advantage of the new version. To avoid storms of coordinated activity in a large cluster, these actions are taken with an exponential backoff and a random factor.

watchdog is recommended for running the chirp and catalog servers found elsewhere in this package.


To run a server under the eye of watchdog, simply place watchdog in front of the server name. That is, if you normally run:
/path/to/chirp_server -r /my/data -p 10101
Then run this instead:
watchdog /path/to/chirp_server -r /my/data -p 10101
For most situations, this is all that is needed. You may fine tune the behavior of watchdog in the following ways:

Logging. Watchdog keeps a log of all actions that it performs on the watched process. Use the -d all option to see them, and the -o file option to direct them to a log file.

Upgrading. To upgrade servers running on a large cluster, simply install the new binary in the filesystem. By default, each watchdog will check for an upgraded binary once per hour and restart if necessary. Checks are randomly distributed around the hour so that the network and/or filesystem will not be overloaded. (To force a restart, send a SIGHUP to the watchdog.) Use the -c option to change the upgrade check interval.

Timing. Watchdog has several internal timers to ensure that the system is not overloaded by cyclic errors. These can be adjusted by various options (in parentheses.) A minimum time of ten seconds (-m) must pass between a server stop and restart, regardless of the cause of the restart. If the server exits within the first minute (-s) of execution, it is considered to have failed. For each such failure, the minimum restart time is doubled, up to a maximum of one hour (-M). If the program must be stopped, it is first sent an advisory SIGTERM signal. If it does not exit voluntarily within one minute (-S), then it is killed outright with a SIGKILL signal.