The Cooperative Computing Tools are Copyright (C) 2003-2004 Douglas Thain and Copyright (C) 2005- The University of Notre Dame. All rights reserved. This software is distributed under the GNU General Public License. See the file COPYING for details.
watchdog is a tool for keeping server processes running continuously. The idea is simple: watchdog is responsible for starting a server. If the server should crash or exit, watchdog restarts it. If the program on disk is upgraded, watchdog will cleanly stop and restart the server to take advantage of the new version. To avoid storms of coordinated activity in a large cluster, these actions are taken with an exponential backoff and a random factor.
watchdog is recommended for running the chirp and catalog servers found elsewhere in this package.
/path/to/chirp_server -r /my/data -p 10101Then run this instead:
watchdog /path/to/chirp_server -r /my/data -p 10101For most situations, this is all that is needed. You may fine tune the behavior of watchdog in the following ways:
Logging. Watchdog keeps a log of all actions that it performs on the watched process. Use the -d all option to see them, and the -o file option to direct them to a log file.
Upgrading. To upgrade servers running on a large cluster, simply install the new binary in the filesystem. By default, each watchdog will check for an upgraded binary once per hour and restart if necessary. Checks are randomly distributed around the hour so that the network and/or filesystem will not be overloaded. (To force a restart, send a SIGHUP to the watchdog.) Use the -c option to change the upgrade check interval.
Timing. Watchdog has several internal timers to ensure that the system is not overloaded by cyclic errors. These can be adjusted by various options (in parentheses.) A minimum time of ten seconds (-m) must pass between a server stop and restart, regardless of the cause of the restart. If the server exits within the first minute (-s) of execution, it is considered to have failed. For each such failure, the minimum restart time is doubled, up to a maximum of one hour (-M). If the program must be stopped, it is first sent an advisory SIGTERM signal. If it does not exit voluntarily within one minute (-S), then it is killed outright with a SIGKILL signal.