Project Ideas

The following are rough project ideas for your consideration. A significant portion of your job will be to crystallize the purpose, methods, and scope of your specific project. Students may undertake projects not listed here, but should consult with the instructor before submitting a proposal.

Storage

Large Scale File Distribution. Scientific users running codes on large distributed systems often need to distribute a large dataset to every single node on which they wish to run a job. To address this problem, develop and evaluate an algorithm for rapidly and reliabily distributing large files to all nodes of the CCL storage pool. The mechanism for transferring files -- and directing transfers between other machines -- is already in place. The challenge is to decide how to schedule the transfers, and how to deal with failures and performance variations. Evaluate the algorithm on the 200-node CCL system. (Careful: Make sure you also have a plan to clean up whatever you transfer, so as not to wedge the entire system!)

A File System for Lots of Little Files. Many scientific users employ filesystems in an unusual way: they create directories containing thousands or millions of little files, each containing perhaps 10-20 bytes of data. File systems are unusually bad at storing such data -- each file occupies a minimum of one 4KB disk block. However, users continue to work this way, because the data is easy to manipulate with standard commands. Begin by demonstrating a filesystem workload that could be dramatically improved. For example, compare the performance of creating one million 10 byte files versus one million ten byte records in a single file. Then, design, implement, and evaluate a filesystem tailored to datasets of many small files. Use the FUSE Toolkit to easily build and deploy the system in user space. Compare the performance of this filesystem to a conventional filesystem on a wide variety of workloads.

Porting Dependencies Automatically. Modern applications are quite difficult to move from computer to computer. Few programs are constructed as a single, self-contained program. Instead, they depend on all sorts of files, libraries, and other components that must be "installed" wherever the program runs, perhaps requiring special privileges to install. This makes distributing programs to new machines a real headache. Construct a system for solving this problem automatically. Using the logging facilities of Parrot, record absolutely everything needed by an application as it runs. Use this log to construct a self-contained package of files that can be moved to another machine. Again using Parrot, redirect file accesses to the package, thus making it appear as if everything needed by the application is available locally. Compare this approach to other data access schemes, such as demand-paging using a distributed filesystem. Note: This project doesn't require much code to be written, but will require a fair amount of careful analysis and performance measurement.

Memory

Gigantic Distributed Data Structures. Many scientific users would like to create programs that employ gigantic data structures. For example, biometric researchers at Notre Dame would like to manipulate arrays and matricies of tens of thousands of entries with each element potentially on the order of a megabyte. If implemented as an ordinary C data structure in virtual memory, the structure might not even fit on the available disk or even in the available address space. However, we can create such data structures by tying together the disk and the memory available in multiple machines. Construct a library for gigantic arrays and matrices that uses the CCL storage pool as the backing storage. Make sure that it is possible to resize, reconfigure, and migrate data structures from place to place. Evaluate the performance of your data structures on a wide variety of synthetic workloads -- row-major access, column-major access, random access -- and compare to the performance on a large memory machine. Describe the strengths and the limitations of this approach to data structures.

Multiscale Memory Modeling. Many advanced computing systems (such as the Cascade project at Notre Dame) have observed that access to memory is the new system bottleneck, and propose clever new techniques for organizing CPUs and memories together. However, the model of how programs access memory has not been updated significantly since the work on WS that we will read in class. An good model of memory access is needed in order to design advanced systems. Construct tools to observe the memory behavior of modern complex programs at runtime. Measure at two levels: (1) memory allocations using malloc/free and (2) virtual memory accesses by raw address. To measure (1), build a malloc from source that logs all mallocs and frees to a file. To measure (2), modify the Bochs virtual machine to log all memory accesses to a file. Then, run a selection of modern applications to capture their behavior. Characterize the behavior of each application. What is the relationship between the layers? How can we take advantage of these observed patterns?

Are You Linking What I'm Linking? Dynamic linking allow multiple running programs to share a single copy of common routines in memory, thus reducing the overall memory usage of a aystem. However, dynamic linking also has some drawbacks, including complexity of installation and management, loss of locality in the filesystem and memory, loss of runtime performance, and perhaps others that you can think of. Perform a comparison of static and dynamic linking in the real world, and discuss the tradeoffs in detail. This comparison can be performed at several levels:

Construct small synthetic programs that are linked both ways, and compare microbenchmarks such as startup time, function call overhead, and CPU performance. (Dynamically linking consumes an extra register that might affect performance.

Build larger standard programs in both versions and compare them. How may system calls, I/O operations, and other kernel activities are required by each version?

Use tools like nm and ar to examine the contents of executables and libraries on a standard Linux system. What is the degree of sharing across the filesystem?

By rebuilding from source packges, construct an entire system distribution both statically and dynamically. How does this affect performance and resource consumption?

Virtual Machines

Virtual Machine Cluster for Utility Computing. Many administrators are interested in the utility computing concept, which proposes that users with large computing needs should simply pay to use remote CPUs only when they are needed. The trick is that everyone requires a different computing environment: one user wants RedHat 7.2, another wants Debian 6.3, and another wants OpenBSD with his favorite libraries installed. No service provider wants to spend the day installing and re-installing machines for each customer. Instead, one may use virtual machines on an existing cluster to create the needed computing environment on the fly. Build a simple utility computing cluster that allows a user to request a configuration type by name. The system would pick an unused machine in the cluster, establish a virtual machine, install the software, and then inform the user of the machine and port to login to. When the user is done, a simple message to the system should release the virtual machine. The problem is, installing software for each virtual machine is very expensive! How can we maximize the performance of such a system?

Security

Detailed Kernel Logging for Auditing. Despite the best efforts of administrators to prevent unwanted access, wily hackers have always found ways to defeat security mechanisms. An audit log can be used to reconstruct what happened to a system after a security incident. However, standard Unix auditing (man acct) does not provide much information. Augment a Linux kernel to log a wide variety of information at run time: programs run, files read or written, network connections initiated or accepted. Evaluate the performance overhead of your auditing system, and perhaps adjust the detail up or down as needed. Using a virtual machine, deploy the audited system, and establish some interesting network services. Recruit your classmates to log in and perform some benevolent or malicious acts. Develop a tool to analyze the auditing log and reconstruct what happened. Report on the performance, effectiveness, and limitations of your design.

Sandboxing by File System Logging. The sandboxing technologies discussed in class prevent unwanted access to simply denying actions that are not permitted. However, sometimes it is not clear if an action is desirable without looking at all of a program's actions together. So, construct a file system that writes all changes to a log instead of writing to the target file system. Use the FUSE Toolkit as a starting point for your filesystem. This log file can then be examined to see the overall effect of a program. If accepable, the log can be played to modify the filesystem. This system could even be used to generate patches: a system change can be run and logged on one machine, then carried to another to play the same log. Evaluate the performance overhead of such logging, and describe how it affects programs with varying I/O patterns.

Processes and Synchronization

Adaptive Parallelism - High level languages and systems allow a user to trivially harness many independent processors and run many independent tasks in parallel. Unfortunately, such systems also make it easy to allow a user to overload the system. For example, with a simple script, one may easily send or retrieve a file from one hundred machines in parallel. Transferring one file at a time will likely not achieve maximum utilization of the network. At the other extreme, transferring all one hundred at once will likely result in collisions on the network and also achieve poor performance. The optimal parallelism lies somewhere in between -- perhaps five or ten transfers at once -- and depends on the exact resources in use and the load on the system. Design and implement an algorithm for finding the optimal amount of parallelism in a system with unknown resource constraints. Evaluate this algorithm in a variety of settings with different kinds and amounts of resources.