Operating Systems Depth Exam

	UNIVERSITY OF WISCONSIN-MADISON Computer Sciences Department
Operating Systems Depth Exam		Spring 2002

Instructions: There are seven questions on this exam; answer any six of the following seven questions. This exam has three (3) pages.

Question 1. Global Snapshots:

Chandy and Lamport's global snapshot algorithm is based on detecting stable properties.

A. What is a stable property and why does the algorithm contain this restriction?

B. What guarantees are given on the termination time of the C&L's snapshot algorithm?

Question 2. Failure Recovery:

A. What does the term "idempotent" mean, in the context of an operation done by a computer?
("idempotent" is used in a similar fashion in mathematics)

B. Give an example of its use in recovery from a failure or crash. Why is it desirable that failure recovery is idempotent?

Question 3: Consistency (The Hobgoblin Of Small Minds)

AFS provides a well-defined cache consistency model. In this question, we explore AFS and discuss ramifications of its design and implementation.

A. Central to AFS consistency is the notion of a callback. Describe what a callback is and why it's important to AFS, when one is established, and two circumstances when a client callback might be "broken" (i.e., the client is notified by the server that the callback is no longer valid).

B. The AFS consistency model is often described as "last writer wins." Describe what this term means, and why it is appropriate to use in describing AFS. In particular, give an example scenario where this issue arises.

C. The interface between AFS clients and servers (in the original AFS system) is a "whole file" interface, and presents a stark contrast to a block-based protocol such as NFS. Discuss why the AFS designers chose this whole-file approach, and then describe its strengths and weaknesses.

Question 4: Mistake In Identity

This question is about mistakes made in the construction of systems. For two of the three of the systems below, you are to name and discuss a single design flaw. Be as detailed as possible. What is the poor result that arises due to the flaw? What would you change in order to circumvent the problem, or is it a fundamental problem? Do later systems address and correct the problem?

Remember, choose only TWO of the three!

A. The Mach Microkernel Operating System.

B. The Berkeley Fast File System.

C. The Pilot Operating System.

Question 5: Networking

Consider the original exponentially weighted, moving averages (EWMA) method for calculating retransmission timeout values with the Karn/Partridge algorithm. Suppose a TCP connection is measuring RTTs of 1.0s and it suddenly jumps to 5.0s. Assume alpha = .875

A. Show what the successive RTT and RTO values would be until RTO is greater than 5.0s.

B. Assuming that we start counting at 1 for packets transferred when the RTT jumps, which packets will be retransmitted due to timeouts?

C. What would the largest RTO value if we continued to transfer packets?

Question 6: Threads

In this question, you will explore different types of support for multi-threaded applications.

A. You have a multi-threaded application that you would like to port to a new operating system. One common option is to use user-level threads; the other option is to use kernel threads. What are the advantages and disadvantages of user-level versus kernel threads? Would knowing any of the characteristics of your multi-threaded application help in deciding between the two?

B. It turns out that you don't have to choose between either type of thread package, because this OS has support for scheduler activations. Unfortunately, there isn't yet a traditional thread library implemented on top of scheduler activations -- this is your job! First, what events must your thread library communicate to the kernel? For each possible event, give a short example of how the event could occur. Second, what events must the kernel communicate to the thread library? For each event, briefly describe how your thread library should react.

Question 7: Disconnected operation

The Coda system provides support for disconnected operation, in which clients are able to continue accessing data during temporary disconnection from (or failure of) the file servers.

A. There are two basic families of approaches for controlling access to replicas in a distributed system that can suffer from network partitions: optimisitic and pessimistic. Briefly describe how both optimistic and pessimistic families of replica control work and the advantages or disadvantages of each approach.

B. The Coda system uses optimistic replica control when it allows clients to access files when disconnected. What data is likely to be in the client's cache when it is first disconnected? Which of this data is guaranteed to be in the cache and why can this differ from that likely to be in the cache?

C. While disconnected, what types of errors can occur on a Coda client that don't occur during normal operation? Why do these errors occur? When reconnecting (i.e., the process of reintegration), what types of errors can occur and why?