Discussion Questions

Make sure that you understand each paper in broad strokes by addressing the following questions:

Problem. What is the fundamental problem the paper is trying to solve?

Solution. What is the basic approach to solving the problem, temporarily leaving out the ugly details?

Complications. The solution probably isn't trivial, so briefly summarize what complications must be addressed.

Evaluation. Was the solution evaluated via a working system, via a simulation, or via a mathematical model? What exactly was measured, and how was it compared against?

Conclusions. At the end of the day, what was proven? Was the idea a complete success, or does it only work in particular situations?

Then, for each group of 3-4 papers, consider the following discussion questions in your reading groups:

Storage Systems (RAID / DiffRAID)

Describe each of the RAID levels 0-6. For each one, be able to sketch how a block of data (and its parity) is divided among multiple disks.

What are the three types of failure contemplated by the paper, and how do they arise?

Does each level of RAID address each kind of failure to the same degree?

Think about recovery after a failure. For a given kind of failure, what steps are needed to continue operating? What?s steps are needed for complete recovery?

The example disk detailed in the paper is quite old and slow by today's standards. Given that modern disks are much faster and larger in many dimensions, is RAID still relevant?

What are the essential properties of solid state disks, and how do they change the design and implementation of RAID systems?

Local File Systems (FFS, LFS, Rethink, ZFS)

For each file system, be able to sketch the on-disk data structures, such as inodes, data blocks, and so forth. Invent a few application behaviors (such as create a file, write 128MB, delete the file) and sketch out the state of the disk after each operation, to be sure that you understand.

What is the essential performance property of a magnetic disk, and how does each filesystem attempt to address that problem through organization or optimization?

Consider the possibility that the power may fail why a filesystem is in the middle of writing its internal data structures. How does each filesystem deal with this problem? What is the worst possible time for a power failure to occur, and how much data will be lost or corrupted?

The rename system call is discussed at some length in the FFS paper. What are the atomicity guarantees of rename? Sketch out how rename should be implemented in FFS, LFS, and ZFS.

Suppose that a user wants to modify files A and B atomically. (Meaning: In the event of a power failure, either both A and B have changed, or neither have changed.) Is this possible with FFS, LFS, or ZFS? Why?

Define external synchrony, and explain how you could implement it in the context of FFS, LFS, and ZFS.

Distributed File Systems (NFS, AFS, Ceph)

For each distributed file system, be able to sketch the relationship between clients and server(s) in the system. Explain what kinds of messages (open, close, read, write, get, put, lookup, etc) flow between which processes in the system.

For each system, explain the consistency model: when are changes written to stable storage, and when do they become visible to other clients?

For each system, explain what affect the consistency model has on the performance or scalability of the system.

For each system, consider what happens when the network, the client, or the server fails. Can the client continue to operate? Will data be lost? At what point will a failure be exposed to the client?

Carefully consider the title of the AFS paper. Does AFS offer better performance than NFS? Better scalability? What overall conclusions can we draw about scalability and performance in general?

Although separated by many years, both AFS and Ceph are designed in response to basically the same technical problem, outlined at the beginning of each paper. What is that problem, and are they using the same approach to solve it?

Memory Management

Discuss the differences between the concept of a segmented memory versus a flat memory model. Explain any possible advantages and disadvantages of each.

Sketch out all of the data structures necessary to implement segmentation and paging in MULTICS. Which parts are particular to one process, and which are shared between processes?

Read and discuss the implementation of MULTICS rings in greater detail. Be able to explain the read, write, and execute brackets, and the gate extension. Write out exactly what logic is necessary to determine if and how a procedure in ring A can call a procedure in ring B.

Explain the working set model for program behavior in descriptive terms. State it again in formal terms.

Explain why it is impractical to implement the pure LRU or pure WS page replacement algorithms.

State the CLOCK and WS-CLOCK algorithms. Under what conditions do they make good decisions? Under what conditions do they make bad decisions?

Superpages: Explain the fundamental problem motivating the paper. Why not just make all pages bigger? Explain the conditions under which pages are promoted or demoted in the modified kernel.

Mach: Explain the general concept of a microkernel, and evaluate the possible benefits and drawbacks of such a design. Explain the relationship between tasks, ports, messages, and memory objects. Discuss the 'duality' presented in the paper, and how it can be used to implement portable parallel applications.

Is it possible to implement segments on hardware that only provides a flat memory model?

Is it possible to implement protection rings on hardware that only provides user and supervisor modes?

Compare and contrast Mach and MULTICS. Although they are separated by several decades and have different objectives, they have several important capabilities in common.

Concurrency

Explain the basic semantics of Hoare monitors. In what way are they different from Mesa monitors? Why do the authors claim that Mesa monitors are easier to use?

A Mesa "process" is rather different than what we call a process today. How does it compare to a Unix thread or a Unix process? What about the API?

What is the problem with nested monitors in Mesa, and how should it be solved?

How do exceptions interact with monitors?

Explain the difference between traditional kernel-level threads and traditional user-level threads.

Sketch and explain how scheduler activations work. What is the necessary API between the kernel level and the user level? How are critical sections dealt with?

Explain the difference between a thread-based program and an event-based program. What problems can each model encounter? How would you go about deciding which model to use?

Sketch and explain the SEDA model for constructing a scalable internet service. Read and evaluate Figure 12 very carefully. How would you explain the difference in performance between Apache, Flash, and Haboob? If you were running a large public internet service, which would you pick?

All operating systems have a way of forcibly terminating a process (e.g. kill -9 $pid) but few are able to forcibly kill a thread. (This didn't stop early versions of Java, which had Thread.stop(), but then deprecated it after realizing it was just a bad idea.) Discuss why killing a process is safe (and common), but killing a thread is not.

Virtualization

State the formal model of a third generation computer S=(E,M,P,R). Explain exactly what happens during a trap. State the key property necessary to ensure a virtualizable system.

Sketch out a model system consisting of a conventional operating system running two applications. Trace exactly what happens when an application requests a system call, or when a timer interrupt triggers a context switch.

Repeat the previous question, assuming that the entire system is now contained within a virtual machine.

What is paravirtualization and why is it necessary? How are page tables paravirtualized in Xen?

Explain how an operating system can be used to provide the equivalent services of hardware in order to build a type II virtual machine.

Three major improvements were made to UMLinux. For each, explain the problem observed and the nature of the solution.

Sketch the main components of the VMWare "hosted" virtual machine. What is a world switch?

Explain why I/O is highly inefficient in the basic configuration. Discuss the three methods used to improve the I/O performance.

A common theme across all of the virtualization papers is the proper design of interfaces between software components. What general conclusions can you draw about how to design an interface that is easily virtualized?

Clouds, Multicore, and Beyond

Describe the Map-Reduce abstraction in mathematical terms, independent of the implementation.

Sketch how Google has implemented Map-Reduce, indicating the location of each process and how it communicates with it's peers.

Which stage of Map-Reduce is naturally efficient and why? Which is naturally inefficient? Would this affect how you would design a Map-Reduce program?

Explain what stragglers are, and how they are handled.

Describe the All-Pairs and Wavefront abstractions in mathematical terms, then explain how each is implemented in practice.

Wavefront deals with stragglers in a different way than Map-Reduce. Why?

Explain the properties of a modern multicore system that complicate operating system design.

What is the limitation of a conventional operating system when run on a large multicore computer.

Explain the concept of shares in Corey, and how they address the multicore problem.

Sketch out the data structures related to filesystems and I/O in a conventional operating system. Suggest how these could be modified to use Corey shares. Under what conditions could indepedent access be achieved?

Explain the factored in fos -- exactly what is being broken up, and why?

Compare and contrast the microkernel aspects of fos with Mach.

Compare and contrast the load-scaling aspects of fos with SEDA.