Course Project for Distributed Systems

The final project in this course will be open ended. You will propose, carry out, and report upon a project singly or in groups of two students. The project should be about twice the size or difficulty of one of the assigned class projects. Your project must involve each of the following three elements:
  • Build. Your project must involve building a system of some kind. You are encouraged to make use of existing systems and software packages, particularly those used in class. However, your project must involve coding of some kind, in whatever language is suited for the task.
  • Evaluate. You must evaluate what you have built for both functional correctness and quantitative performance. For correctness, you must develop a testing procedure that shows that your system operates correctly under all expected conditions. For performance, you must define an appropriate metric -- latency, bandwidth, throughput -- and measure it across a range of configurations.
  • Communicate. You must present your work cogently by writing a paper and making an oral presentation. The paper should describe the motivation, architecture, technical details, evaluation methods, and quantitative results of your project. The oral presentation should summarize the most important aspects of the paper and give a demo of how it works, during the last week of class.
  • More information on the requirements of each will be forthcoming.

    If your project will make use of Amazon's EC2 service, you can make use of a academic grant from Amazon. Each member of the class can receive a $100 credit in Amazon web services, to be used to run virtual machines and other services for the class. You will have to register with Amazon, enter your own credit card, and then enter a special credit code. Talk to Prof. Thain to receive your credit code.

    Project Ideas

    The following are rough ideas for possible projects. You are strongly encouraged to make use of software and systems employed in the required projects. Part of your job will be to flesh out the details of the project before you begin work. You may do a project that is not on this list, but discuss it with Prof. Thain first.

  • Convert a Real Application to WQ. Take a real application that you use in your research, classes, or for fun, convert it to parallel form, and get it to run on as many processors as possible. Be sure to carefully measure the performance at a varying number of processors to produce a good speedup graph.

  • On Demand Allocation in Work Queue. As written, the Work Queue simply accepts whatever workers are started by external means. Modify the system to start and stop workers as needed, based on the number of jobs waiting in the queue. (You may need to add some hysteresis to prevent oscillations.) For a given application, carefully quantify the tradeoff in performance vs resources consumed using pre-allocation vs on-demand allocation.

  • Distributed Storage Comparison. Do a head-to-head performance comparison of three storage systems - Chirp, Hadoop, and Amazon S3. Scale up from one client at a time, to multiple clients going at once. (Use Condor to get multiple clients going at once.) Compare the throughput of large files, as well as the latency of small operations.

  • Elastic Condor Pool. When multiple users are active, the Condor pool can fill up, resulting in significant delays. Build a system that monitors how busy the Condor pool is. When load is sufficiently high, allocate new machines from EC2, and start Condor running to augment the pool. Measure the performance, cost, and responsiveness of the system under load.

  • Virtual Machines on Condor. In many settings, it is necessary to precisely reproduce an execution environment: A particular program may only be compatible with a certain operating system version, or a particular application may require root privileges. To make it easy to reproduce a given environment, design a virtual machine facility that allows the user to simply issue something like "runvm redhat4 emacs", causing a virtual machine of the given type to be created, submitted to Condor to allocate a CPU, and then connected to the submitting user's display via VNC. (Suggest that you use the QEMU virtual machine, which can be installed in your home directory.) Measure the performance of several applications in the virtual machines, and figure out how to manage lots of VM images at the same time.

  • Hadoop on Condor (or Condor on Hadoop?). Hadoop has a high data processing capacity, but does not have a very rich scheduling policy. Come up with a method of using Condor to run jobs on Hadoop. (Or Hadoop jobs on Condor if you like.) Find a way to express the same workload in both methods, and then compare the performance.

  • Peer to Peer Preservation. Design a peer-to-peer preservation system. Suppose that a user wishes to preserve a document forever. If the user delivers it to any one node in the system, then that node should take active steps to communicate with other known peers to make further copies of the document. Be careful to ensure that the system can recover from the loss of multiple nodes, but is also not causing continuous unnecessary network traffic. You may borrow ideas from other systems such as Gnutella Freenet, or Chord, provided that you implement the system yourself. Start by reviewing section 4.5.2 in the textbook.

  • Chained Message Queueing System. Build a chained message queueing system like that described in section 4.3. It should consist of a message forwarding process that reads files from a directory and then delivers them to an identical process running on another machine, which stores them in a directory where could be read by another forwarding process. Make sure that your system can handle expected failure modes such as server crashes, network outages, full disks. Evaluate the throughput of your system on a large amount of data.

  • Scaling up the Work Queue. Currently, the Work Queue is limited to about 1000 clients per master, usually limited by the number of TCP connections that can be active at any one time. Modify the Work Queue to make it more scalable. You could change the master-worker protocol so that the worker disconnects while working on a work unit. Or, you could make the system hierarchical, with masters talking to sub-masters, and then to workers. There are many ways to proceed, but be sure to preserve the fault-tolerance and other aspects of the system.

  • Visualize Condor. Design a rich visualization tool for Condor, perhaps similar to the Chirp visualization. To be useful, it should display all sorts of information, such as the resources on each node, the jobs running, the owners of the jobs, and group or re-arrange the display based on the relationship between schedds and startds.
  • Milestones

  • Tuesday, February 23rd. - Turn in a one page project proposal that describes the project that you intend to do, what languages and resources will be necessary to carry it out, and how you intend to evaluate the work. The proposal should be one to two pages of text. The instructor will follow up with you to make sure that the project is of appropriate size and difficulty.

  • Week of 29 March. - Meet with the instructor to give a demo of what you have so far. The project does not have to be complete, but you must definitely be able to show the main components working, and have a plan for finishing the semester.

  • Tuesday, 27 April - Give a fifteen minute in-class presentation on your project. (Time may vary once number of projects is known.) The talk should include an overview of the goal or problem, a detailed description of the structure of your system, an example of how your system operates, and an evaluation of the correctness and performance of your system. Your talk should be accompanied by 5-10 carefully designed and edited PowerPoint slides.

  • Wednesday May 5th, Noon - Turn in your code and the final paper. The code should be structured such that the instructor can build and execute it independently. The paper should give an overview of the goal or the problem, a detailed description of the structure of your system, including a good diagram where appropriate, and an evaluation of the correctness and performance of the system. There is no specific length requirement; the paper should be long enough to explain all of the necessary details. The said, anything less than five pages is probably too short; anything longer than fifteen pages is probably too long. Deposit your code and your paper in PDF or DOC format in your dropbox directory.