Project 4B: Paracheck

Note: Project 4 has been split into two smaller "puzzles", one due before spring break, the other due after. Together, they will have the weight of one "project".

Project Goals

In this project, you will:
  1. gain more experience with concurrent data structures
  2. reason about problems in concurrency and synchronization
  3. integrate techniques of file I/O, synchronization, and parallel performance.

Objective

Create a program that computes the checksum of a large number of files concurrently. The program should be invoked as follows:
./paracheck <dir> <nthreads>
where dir is the name of a directory containing some files to be checksummed, and nthreads is the number of threads that should perform the checksum. The output of the program should give the name of each file and its MD5 checksum. For example, paracheck /usr/bin 1 should display, in part:
...
def411abb41ab7874a73c32ba770b188  ac
68e7732c09889c1ecae5f819327d2c3a  aclocal
68e7732c09889c1ecae5f819327d2c3a  aclocal-1.11
82afa70b9112735b41f79945eae4016e  aconnect
4550fedc17f7f1cca623cf774e1b6749  acpi_listen
a2afa7e5799fe3fde6ea65a8944ea17d  acroread
...
The overall architecture of this program should use a monitor (mutex + condition variable + data structures) that is accessed by a main thread and N worker threads. The main thread should generate the list of file names, and add each one to the monitor. The worker threads should take a name out of the monitor, read the file and compute a checksum, and then put the results back in to the monitor. Finally, the main thread should display the results as they become available.

Hint: There are a variety of C libraries available on the student machines that can be used to compute MD5 checksums: openssl/md5.h and mhash.h are good choices, and there may be others available. You are welcome and encouraged to read man pages, search coding sites, etc in order to figure out how to use these libraries. However, the outcome of that process should be understanding, not a cut-and-paste of mystery code. You must be write and be able to explain in detail the behavior of every line of code in your program.

Once your output is correct (you can verify it with the md5sum command) then conduct a little performance study by measuring the time necessary to checksum a large directory (such as /usr/bin) with a varying number of threads. Can you explain the observed performance?

Handing In

This project is due on Thursday, March 8th at 11:59pm. Late assignments are not accepted.

Turn in all of your C source code, a Makefile and a README to your dropbox. Please review the general instructions for assignments. The README should be plain text, give your table of results, and a brief explanation of the observed performance.

Your grade will be based on the following considerations:

  • Correctness - Is the checksum of every file computed correctly?
  • Safety - Do multiple threads interact safely without crashes or races?
  • Performance - Do multiple threads result in an appropriate degree of speedup?
  • Error Handling - Are all errors from the OS handled with a sensible message or action?
  • Programming Style - Is code clearly structured, named, indented, and commented?