Prerequisites:
To complete Tutorial A, you will need access to a working cluster running the HTCondor batch system. HTCondor is used widely around the world, so you may already have a working installation at your local university. If not, you can can access an HTCondor cluster via the XSEDE computing infrastructure or even set up a personal HTCondor cluster via Google Cloud
ssh USERNAME@condorfe.crc.nd.eduIf you are using a Windows machine, download and install PuTTY and use that to connect to the host condorfe.crc.nd.edu.
# If you are using tcsh: setenv PATH /afs/crc.nd.edu/user/c/condor/software/bin:$PATH # Or if you are using bash: export PATH=/afs/crc.nd.edu/user/c/condor/software/bin:$PATH
mkdir /tmp/YOURNAME cd /tmp/YOURNAME
condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@wang044.crc. LINUX X86_64 Claimed Busy 0.000 23933 0+00:35:04 slot1@wang046.crc. LINUX X86_64 Unclaimed Idle 0.010 23933 0+00:35:04 slot1@cclmac00.cse OSX INTEL Owner Idle 1.900 2048 0+02:20:04 slot1@cclmac02.cse OSX X86_64 Claimed Busy 1.670 8192 0+06:45:29 slot1@em-mjpthf0.N WINDOWS X86_64 Unclaimed Idle 0.000 2039 5+01:45:53 slot2@em-mjpthf0.N WINDOWS X86_64 Unclaimed Idle 0.110 2039 5+01:45:54 ... Total Owner Claimed Unclaimed Matched Preempting Backfill INTEL/LINUX 5 0 0 5 0 0 0 INTEL/OSX 1 0 0 1 0 0 0 X86_64/LINUX 805 43 107 655 0 0 0 X86_64/OSX 1 0 0 1 0 0 0 X86_64/WINDOWS 8 0 0 8 0 0 0 Total 820 43 107 670 0 0 0
Note that an HTCondor pool can have machines of many different operating systems, architectures, and varying numbers of CPUs, memory, and disk. By default, your job will only run on a machine compatible with the one you submitted from.
Also, many machines are divided into "slots" which are simply places to run independent jobs. If, for example, a job that only needs one core and one GB of memory lands on a machine with 16 cores and 32GB of memory, HTCondor will break off a slot large enough to hold the job, leaving an idle slot of 15 cores and 31GB of memory that another job (or multiple jobs) could use.
To see what other users are doing on the system, try these commands:
condor_status -submitters condor_userprio
Note that, in a busy HTCondor system, you can expect to see a large number of users submitting a large number of jobs, perhaps keeping all of the machines busy. Each user has a changing "user priority" that increases (gets worse) as machines are used. All other things being equal, Condor gives more machines to users with lower (better) user priorities. So, just go ahead and submit the work you want to do, and HTCondor will give you some fraction of the available machines, even if they are already in use.
Ordinarily, you would use HTCondor to run some very long running programs, like a simulation or data analysis code. For testing purposes, let's create a little script that displays the current host and time, and then sleeps for five seconds. Use your text editor to create a file called test.sh and put the following commands in it:
#!/bin/sh date uname -n sleep 5
Then, make the program executable with the chmod command, and test to see that it works as expected
chmod 755 test.sh ./test.sh
To submit a batch job to Condor, you must create a submission file and then run the condor_submit command. Using your favorite text editor, create a new file called test.submit and enter the following text into it:
universe = vanilla executable = test.sh arguments = hello output = test.output error = test.error should_transfer_files = yes when_to_transfer_output = on_exit log = test.logfile queueNow, to submit the job to Condor, execute:
condor_submit test.submit Submitting job(s)... 1 job(s) submitted to cluster 2.Once the job is submitted, you can use condor_q to look at the status of the jobs in your queue. If you run condor_q quickly enough, you will see your job idle:
condor_q -- Submitter: hedwig.cse.nd.edu : <129.74.154.241:33593> : hedwig.cse.nd.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 dthain 8/26 17:21 0+00:00:00 I 0 0.0 ./test.shAnd then running:
condor_q -- Submitter: hedwig.cse.nd.edu : <129.74.154.241:33593> : hedwig.cse.nd.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 dthain 8/26 17:21 0+00:00:00 R 0 0.0 ./test.shAnd then gone when it is complete:
condor_q -- Submitter: hedwig.cse.nd.edu : <129.74.154.241:33593> : hedwig.cse.nd.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMDIf you decide to cancel a job, use condor_rm and the job id:
condor_rm 2.0 Job 2.0 marked for removal.
Because you will certainly want to run many jobs at once via Condor, you can easily modify your submit file to run a program with tens or hundreds of variations. Modify test.submit, changeing the final queue command to queue sten jobs at once, and add the $(PROCESS) macro to modify the parameters so that each job has a distinct output file.
universe = vanilla executable = test.sh arguments = $(PROCESS) output = test.output.$(PROCESS) error = test.error.$(PROCESS) should_transfer_files = yes when_to_transfer_output = on_exit log = test.logfile queue 10Now, when you run condor_submit, you should see something like this:
condor_submit test.submit Submitting job(s).......... 10 job(s) submitted to cluster 9.Note in this case that "cluster" means "a bunch of jobs", where each job is named 9.0, 9.1, 9.2, and so forth. In this next example, condor_q shows that cluster 9 is halfway complete, with job 9.5 currently running.
condor_q -- Submitter: hedwig.cse.nd.edu : <129.74.154.241:33593> : hedwig.cse.nd.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 9.5 dthain 8/26 17:46 0+00:00:01 R 0 0.0 ./test.sh 5 9.6 dthain 8/26 17:46 0+00:00:00 I 0 0.0 ./test.sh 6 9.7 dthain 8/26 17:46 0+00:00:00 I 0 0.0 ./test.sh 7 9.8 dthain 8/26 17:46 0+00:00:00 I 0 0.0 ./test.sh 8 9.9 dthain 8/26 17:46 0+00:00:00 I 0 0.0 ./test.sh 9
Once you have successfully run ten short jobs, try going a little bigger. Modify test.sh so that it runs for thirty seconds, modify test.submit to submit 100 instances of those jobs, and then submit them with condor_submit
As you run a larger number of jobs, it becomes more somewhat more challenging to keep track of everything and understand what's happening. Use condor_q several times to see which jobs are running. Try using condor_q -run to see exactly what machines they run on. Watch the file test.logfile which shows each execution event:
tail -f test.logfile
Of course, reading everything in that logfile would be quite tedious to do manually. Instead, we have developed some tools which make it easier to see the big picture of what happened. When all (or most) of your jobs are complete, take the logfile and upload it to the Condor Log Analyzer website. It will give you a timeline of the number of jobs running over time, something like this:
This graph above shows a common pattern, in which the number of jobs running jumps up quite quickly, stays (relatively) stable for some time, and then decays gradually until the final jobs is complete. We will discuss this phenomenon in more detail in Lecture 3 of the DISC course.