DISC Tutorial A: HTCondor

Prerequisites:

  • Familiarity with basic Unix/Linux commands.
  • Ability to use a text editor to create and modify text files.
  • Completed Lecture 2 in the DISC online course.
  • Access to a working HTCondor system.
  • To complete Tutorial A, you will need access to a working cluster running the HTCondor batch system. HTCondor is used widely around the world, so you may already have a working installation at your local university. If not, you can can access an HTCondor cluster via the XSEDE computing infrastructure or even set up a personal HTCondor cluster via Google Cloud

    Setup for HTCondor at Notre Dame

    1. If using the wireless network, make sure that you are using the eduroam network and not the ND-Guest network.
    2. Connect to the CRC Condor front end node. From Linux or Mac, open up a terminal and use ssh:
      ssh USERNAME@condorfe.crc.nd.edu
      
      If you are using a Windows machine, download and install PuTTY and use that to connect to the host condorfe.crc.nd.edu.
    3. Set up your environment to include the Condor tools:
      # If you are using tcsh:
      setenv PATH /afs/crc.nd.edu/user/c/condor/software/bin:$PATH
      # Or if you are using bash:
      export PATH=/afs/crc.nd.edu/user/c/condor/software/bin:$PATH
      
    4. Make a temporary directory with your name and move there. (A peculiarity of our local setup is that Condor cannot write to the AFS filesystem, so you must use /tmp instead.)
      mkdir /tmp/YOURNAME
      cd /tmp/YOURNAME
      
    5. View the ND HTCondor status web pages and the HTCondor Matrix.

    View System Status

    To see the machines available in the HTCondor pool, use the condor_status tool:
    condor_status
    
    Name               OpSys      Arch   State     Activity  LoadAv Mem   ActvtyTime
    slot1@wang044.crc. LINUX      X86_64 Claimed   Busy      0.000 23933  0+00:35:04
    slot1@wang046.crc. LINUX      X86_64 Unclaimed Idle      0.010 23933  0+00:35:04
    slot1@cclmac00.cse OSX        INTEL  Owner     Idle      1.900 2048  0+02:20:04
    slot1@cclmac02.cse OSX        X86_64 Claimed   Busy      1.670 8192  0+06:45:29
    slot1@em-mjpthf0.N WINDOWS    X86_64 Unclaimed Idle      0.000 2039  5+01:45:53
    slot2@em-mjpthf0.N WINDOWS    X86_64 Unclaimed Idle      0.110 2039  5+01:45:54
    ...
                        Total Owner Claimed Unclaimed Matched Preempting Backfill
    
             INTEL/LINUX     5     0       0         5       0          0        0
               INTEL/OSX     1     0       0         1       0          0        0
            X86_64/LINUX   805    43     107       655       0          0        0
              X86_64/OSX     1     0       0         1       0          0        0
          X86_64/WINDOWS     8     0       0         8       0          0        0
    
                   Total   820    43     107       670       0          0        0
    

    Note that an HTCondor pool can have machines of many different operating systems, architectures, and varying numbers of CPUs, memory, and disk. By default, your job will only run on a machine compatible with the one you submitted from.

    Also, many machines are divided into "slots" which are simply places to run independent jobs. If, for example, a job that only needs one core and one GB of memory lands on a machine with 16 cores and 32GB of memory, HTCondor will break off a slot large enough to hold the job, leaving an idle slot of 15 cores and 31GB of memory that another job (or multiple jobs) could use.

    To see what other users are doing on the system, try these commands:

    condor_status -submitters
    condor_userprio
    

    Note that, in a busy HTCondor system, you can expect to see a large number of users submitting a large number of jobs, perhaps keeping all of the machines busy. Each user has a changing "user priority" that increases (gets worse) as machines are used. All other things being equal, Condor gives more machines to users with lower (better) user priorities. So, just go ahead and submit the work you want to do, and HTCondor will give you some fraction of the available machines, even if they are already in use.

    Use what you learned so far to answer these questions about your particular HTCondor pool:
    1. How many machines are available in your HTCondor pool?
    2. How many different operating systems and architectures are available?
    3. What user is currently running the greatest number of jobs?
    4. Which user has the highest (worst) priority, and which has the lowest (best) priority?

    Create a Sample Job

    Ordinarily, you would use HTCondor to run some very long running programs, like a simulation or data analysis code. For testing purposes, let's create a little script that displays the current host and time, and then sleeps for five seconds. Use your text editor to create a file called test.sh and put the following commands in it:

    #!/bin/sh
    date
    uname -n
    sleep 5
    

    Then, make the program executable with the chmod command, and test to see that it works as expected

    chmod 755 test.sh
    ./test.sh
    

    Submit One Job

    To submit a batch job to Condor, you must create a submission file and then run the condor_submit command. Using your favorite text editor, create a new file called test.submit and enter the following text into it:

    universe = vanilla
    executable = test.sh
    arguments = hello
    output = test.output
    error = test.error
    should_transfer_files = yes
    when_to_transfer_output = on_exit
    log = test.logfile
    queue 
    
    Now, to submit the job to Condor, execute:
    condor_submit test.submit
    Submitting job(s)...
    1 job(s) submitted to cluster 2.
    
    Once the job is submitted, you can use condor_q to look at the status of the jobs in your queue. If you run condor_q quickly enough, you will see your job idle:
    condor_q
    -- Submitter: hedwig.cse.nd.edu : <129.74.154.241:33593> : hedwig.cse.nd.edu
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
       2.0   dthain          8/26 17:21   0+00:00:00 I  0   0.0  ./test.sh
    
    And then running:
    condor_q
    -- Submitter: hedwig.cse.nd.edu : <129.74.154.241:33593> : hedwig.cse.nd.edu
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
       2.0   dthain          8/26 17:21   0+00:00:00 R  0   0.0  ./test.sh
    
    And then gone when it is complete:
    condor_q
    -- Submitter: hedwig.cse.nd.edu : <129.74.154.241:33593> : hedwig.cse.nd.edu
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
    
    If you decide to cancel a job, use condor_rm and the job id:
    condor_rm 2.0
    Job 2.0 marked for removal.
    
    Now use what you know to answer these questions:
    1. Look at the output of your job test.output. On what machine did the job actually run?
    2. Look at the log file test.logfile. What was the turnaround time of the job, from submission to completion?
    3. Use the condor_history command to see what jobs have previously run.

    Submitting Many Jobs

    Because you will certainly want to run many jobs at once via Condor, you can easily modify your submit file to run a program with tens or hundreds of variations. Modify test.submit, changeing the final queue command to queue sten jobs at once, and add the $(PROCESS) macro to modify the parameters so that each job has a distinct output file.

    universe = vanilla
    executable = test.sh
    arguments = $(PROCESS)
    output = test.output.$(PROCESS)
    error = test.error.$(PROCESS)
    should_transfer_files = yes
    when_to_transfer_output = on_exit
    log = test.logfile
    queue 10
    
    Now, when you run condor_submit, you should see something like this:
    condor_submit test.submit
    Submitting job(s)..........
    10 job(s) submitted to cluster 9.
    
    Note in this case that "cluster" means "a bunch of jobs", where each job is named 9.0, 9.1, 9.2, and so forth. In this next example, condor_q shows that cluster 9 is halfway complete, with job 9.5 currently running.
    condor_q
    
    -- Submitter: hedwig.cse.nd.edu : <129.74.154.241:33593> : hedwig.cse.nd.edu
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
       9.5   dthain          8/26 17:46   0+00:00:01 R  0   0.0  ./test.sh 5 
       9.6   dthain          8/26 17:46   0+00:00:00 I  0   0.0  ./test.sh 6
       9.7   dthain          8/26 17:46   0+00:00:00 I  0   0.0  ./test.sh 7
       9.8   dthain          8/26 17:46   0+00:00:00 I  0   0.0  ./test.sh 8
       9.9   dthain          8/26 17:46   0+00:00:00 I  0   0.0  ./test.sh 9
    
    Now use what you know to answer these questions:
    1. After the jobs complete, examine your working directory. How many output and error files were created? Are they different?
    2. How long did it take all the jobs to complete, the first submission to the last completion?
    3. Did your jobs all run on the same machine, or different machines?

    Scaling Up

    Once you have successfully run ten short jobs, try going a little bigger. Modify test.sh so that it runs for thirty seconds, modify test.submit to submit 100 instances of those jobs, and then submit them with condor_submit

    As you run a larger number of jobs, it becomes more somewhat more challenging to keep track of everything and understand what's happening. Use condor_q several times to see which jobs are running. Try using condor_q -run to see exactly what machines they run on. Watch the file test.logfile which shows each execution event:

    tail -f test.logfile

    Of course, reading everything in that logfile would be quite tedious to do manually. Instead, we have developed some tools which make it easier to see the big picture of what happened. When all (or most) of your jobs are complete, take the logfile and upload it to the Condor Log Analyzer website. It will give you a timeline of the number of jobs running over time, something like this:

    This graph above shows a common pattern, in which the number of jobs running jumps up quite quickly, stays (relatively) stable for some time, and then decays gradually until the final jobs is complete. We will discuss this phenomenon in more detail in Lecture 3 of the DISC course.

    Homework

    For More Information

    HTCondor is a powerful system with many more options and capabilities. Check out the HTCondor manual and web page to learn more.