CSE 66771 - Foundations of Distributed Systems

Prof. Douglas Thain
Email: dthain at nd dot edu
Office: 382 Fitzpatrick
Summer 2014 Session MWF 9:30AM DeBartolo 334

Overview

This course explores the foundations of distributed systems through a series of classic papers selected from the research literature. Topics include time, synchronization, consensus, consistency, fault tolerance, and security. This course serves as a foundation for advanced graduate study in fields such as mobile computing, cloud computing, networking, and large scale system design.

Time and Location

Because the course is held during the compressed summer session (Monday, June 16th - Friday, July 25th) we will meet 9-10AM MTWRF in a conference room, location TBA. Each class session will involve discussion of the assigned readings for the day. Active participation by all students is required. Students will be expected to read a

Grading

  • 25% Discussion
  • 25% Paper Summaries
  • 25% Midterm Exam
  • 25% Final Exam
  • Topics and Readings

    Time

    1. Leslie Lamport, Time, Clocks, and Ordering of Events in a Distributed System, Communications of the ACM 12:7, 1978 cite
    2. K. Mani Chandy and Leslie Lamport, Distributed Snapshots: Determining Global States of Distributed Systems, ACM Transactions on Computer Systems, 3:1, 1985. cite
    3. D. Jefferson, B. Beckman, F. Wieland, L. Blume, M. Diloreto, Distributed Simulation and the Time Warp Operating System, ACM Symposium on Operating System Principles. cite
    4. D. L. Mills, Internet time synchronization: The network time protocol, IEEE Transactions on Communication, 39:10. cite
    5. Barbara Liskov, Practical Uses of Synchronized Clocks, Distributed Computing 6:4, 1993. cite

    Consensus

    1. Ricart and Agrawala, An Optimal Algorithm for Mutual Exclusion in Computer Networks, Operating Systems Review, 1981 pdf
    2. H. Garcia Molina, Elections in a Distributed Computing System, IEEE Transactions on Computers 13:1, 1982. cite
    3. K. Mani Chandi, Jayadev Misra, Laura Hass, Distributed Deadlock Detection, ACM Transactions on Computer Systems 1:2, 1983, cite
    4. Leslie Lamport, Robert Shostak, Marsall Pease, The Byzantine Generals Problem, ACM Transactions on Programming Languages and Systems 4:3, 1982. cite
    5. Miguel Castro and Barbara Liskov, Practical Byzantine Fault Tolerance and proactive recovery, ACM Transactions on Computer Systems 20:4, 2002. cite
    6. Kenneth Birman, Andre Schiper, and Pat Stephenson, Lightweight Causal and Atomic Group Multicast, ACM Transactions on Computer Sytems, 9;3, 1991. cite

    Robustness and Correctness

    1. Butler Lampson and Howard Sturgis Crash Recovery in a Distributed Data Storage System Technical Report, Xerox PARC, 1979. pdf
    2. Richard Schlichting and Fred Schneider, Fail-Stop Processors: an approach to designing fault tolerant computing systems, ACM Transactions on Computer Systems 1;3, 1983. cite
    3. P. J. Leu Concurrent robust checkpointing and recovery in distributed systems International Conference on Data Engineering, 1988. cite
    4. Gerard J. Holzmann, "The Model Checker SPIN", IEEE Transactions on Software Engineering, 23:5, 1997. cite pdf

    Consistency

    1. Davidson, Hector Garcia-Molina, Dale Skeen, Consistency in a Partitioned Network: A survey, ACM Computing Surveys, 17:3, 1985. cite
    2. C. Gray and D. Cheriton, Leases: An Efficient Fault Tolerant Mechanism for Distributed File Cache Consistency, ACM Symposium on Operating Systems Principles, 1989 cite
    3. Ladin, Liskov, Shrira, and Ghemaway, Providing High Availability with Lazy Replication, ACM TOCS 10:4, 1992. cite
    4. Petersen, Spreitzer, Terry, Theimer, and Demers, Flexible Update Propagation for Weakly Consistent Replication, ACM Symposium on Operating Systems Principles, 1997. cite
    5. Yasushi Saito and Marc Shapiro, Optimistic Replication, ACM Computing Surveys, 37:1, 2005. cite
    6. Renesse and Schneider, Chain Replication for Supporting High Throughput and Availability, USENIX Symposium on Operating System Design and Implementation, 2004. cite
    7. Verner Vogels, Eventually Consistent, Comunications of the ACM 52:1, 2009 cite

    Trust

    1. Saltzer and Schroeder, The Protection of Information in Computer Systems, Communications of the ACM 17:7, July 1974. cite
    2. Needham and Schroeder Using encryption for authentication in large networks of computers Communications of the ACM, 1978 cite
    3. Steiner, Neuman, and Schiller, Kerberos: An authentication service for open network systems, USENIX Winter Conference, 1988. pdf
    4. Satoshi Nakamoto, Bitcoin: A peer-to-peer electronic cash system, pdf
    5. Petros Maniatis, Mema Roussopoulos, T. J. Giuli, David S. H. Rosenthal, and Mary Baker, The LOCKSS peer-to-peer digital preservation system, ACM Transactions on Computer Systems, 23:1, 2005 cite

    Advice

    1. Waldo, Wyant, Wollrath, and Kendall, A Note on Distributed Computing, Sun Microsystem Technical Report, November 1994. pdf
    2. Saltzer, Reed, and Clark, End-to-End Arguments in Computer System Design, ACM Transactions on Computer Systems, 1984. cite
    3. Eric A. Brewer, Lessons from Giant-Scale Services, IEEE Internet Computing. Vol. 5, No. 4. pp. 46-55. July/August 2001. cite pdf