active_catalogs.ps.Z "Distributed Active Catalogs and Meta-Data Caching in Descriptive Name Services" by Joann J. Ordille and Barton P. Miller Best paper at the 13th International IEEE Conference on Distributed Computing Systems, Pittsburgh, May 1993, pp. 120-129. Also Technical Report 1118, November 1992. joann@research.att.com, bart@cs.wisc.edu Abstract: Today's global internetworks challenge the ability of name ser- vices and other information services to locate data quickly. We introduce a distributed active catalog and meta-data caching for optimizing queries in this environment. Our active catalog constrains the search space for a query by returning a list of data repositories where the answer to the query is likely to be found. Components of the catalog are distributed indices that isolate queries to parts of the network, and smart algorithms for limiting the search space by using semantic, syntactic, or structural constraints. Meta-data cach- ing improves performance by keeping frequently used characterizations of the search space close to the user, and eliminating active catalog communication and processing costs. When searching for query responses, our techniques contact only the small percentage of the data repositories with actual responses, resulting in search times of a few seconds. Even a request to locate a person with a popular name in the global name space can often be answered in seconds. We imple- mented a distributed active catalog and meta-data caching in a proto- type descriptive name service called Nomenclator. We present perfor- mance results for Nomenclator in a search space of 1000 data reposi- tories. array_distrib.ps.Z Optimizing Array Distributions in Data-Parallel Programs" Krishna Kunchithapadam and Barton P. Miller 7th Annual Workshop on Languages and Compilers for Parallel Computing, Ithaca, NY. August 1994. krishna@cs.wisc.edu, bart@cs.wisc.edu Abstract: Data parallel programs are sensitive to the distribution of data across processor nodes. We formulate the reduction of inter-node communication as an optimization on a colored graph. We present a technique that records the run time inter-node communication caused by the movement of array data between nodes during execution and builds the colored graph, and provide a simple algorithm that optimizes the coloring of this graph to describe new data distributions that would result in less inter-node communication. From the distribution information, we write compiler pragmas to be used in the application program. Using these techniques, we traced the execution of a real data-parallel application (written in CM Fortran) and collected the array access information. We computed new distributions that should provide an overall reduction in program execution time. However, compiler optimizations and poor interfaces between the compiler and runtime systems counteracted any potential benefit from the new data layouts. In this context, we provide a set of recommendations for compiler writers that we think are needed to both write efficient programs and to build the next generation of tools for parallel systems. The techniques that we have developed form the basis for future work in monitoring array access patterns and generate on-the-fly redistributions of arrays. breakpoint.ps.Z "Breakpoints and Halting in Distributed Programs", Barton P. Miller and Jong-Deok Choi 8th Int'l Conf on Distributed Computing Systems (1988) bart@cs.wisc.edu, jdchoi@watson.ibm.com Abstract: Interactive debugging requires that the programmer be able to halt a program at interesting points in its execution. This paper presents an algorithm for halting a distributed program in a consistent state, and presents a definition of distributed breakpoints with an algorithm for implementing the detection of these breakpoints. The Halting Algorithm extends Chandy and Lamport's algorithm for recording global state and solves the problem of processes that are not fully connected or frequently communicating. The definition of distributed breakpoints is based on those events that can be detected in a distributed system. Events that can be partially ordered are detectable and form the basis for the breakpoint predicates, and from the breakpoint definition comes the description of an algorithm that can be used in a distributed debugger to detect these breakpoints. characterization.ps.Z "What are Race Conditions? Some Issues and Formalizations" Robert H.B. Netzer and Barton P. Miller ACM Letters on Programming Languages and Systems, March 1992. (Also available as Univ. of Wisconsin-Madison Comp Sci Dept TR#1014) netzer@cs.wisc.edu, bart@cs.wisc.edu Abstract: In shared-memory parallel programs that use explicit synchronization, race conditions result when accesses to shared memory are not properly synchronized. Race conditions are often considered to be manifestations of bugs since their presence can cause the program to behave unexpectedly. Unfortunately, there has been little agreement in the literature as to precisely what constitutes a race condition. Two different notions have been implicitly considered: one pertaining to programs intended to be deterministic (which we call general races) and the other to nondeterministic programs containing critical sections (which we call data races). However, the differences between general races and data races have not yet been recognized. This paper examines these differences by characterizing races using a formal model and exploring their properties. We show that two variations of each type of race exist: feasible general races and data races capture the intuitive notions desired for debugging and apparent races capture less accurate notions implicitly assumed by most dynamic race detection methods. We also show that locating feasible races is an NP-hard problem, implying that only the apparent races, which are approximations to feasible races, can be detected in practice. The complexity of dynamically locating apparent races depends on the type of synchronization used by the program. Apparent races can be exhaustively located efficiently only for weak types of synchronization that are incapable of implementing mutual exclusion. This result has important implications since we argue that debugging general races requires exhaustive race detection and is inherently harder than debugging data races (which requires only partial race detection). Programs containing data races can therefore be efficiently debugged by locating certain easily identifiable races. In contrast, programs containing general races require more complex debugging techniques. cohrs_thesis.ps.Z "A Specification Language for Multi-Domain Network and Distributed Systems Management" David L. Cohrs August 1991. Univ. of Wisconsin-Madison Comp Sci Dept Tech Rep #10xx (Ph.D. Thesis) cohrs@Legato.COM complexity.ps.Z "On the Complexity of Event Ordering for Shared-Memory Parallel Program Executions" Robert H.B. Netzer and Barton P. Miller Univ. of Wisconsin-Madison Comp Sci Dept Technical Report #908, January 1990 A condensed version of this paper appears in Proc. of the 1990 Int'l Conference on Parallel Processing, St Charles, IL, August 1990 netzer@cs.wisc.edu, bart@cs.wisc.edu Abstract: This paper presents results on the complexity of computing event orderings for shared-memory parallel program executions. Given a program execution, we formally define the problem of computing orderings that the execution \fImust have\fP exhibited or \fIcould have\fP exhibited, and prove that computing such orderings is an intractable problem. We present a formal model of a shared-memory parallel program execution on a sequentially consistent processor, and discuss event orderings in terms of this model. Programs are considered that use fork/join and either counting semaphores or event style synchronization. We define a \fIfeasible program execution\fP to be an execution of the program that performs the same events as an observed execution, but which may exhibit different orderings among those events. Any program execution exhibiting the same data dependences among the shared data as the observed execution is feasible. We define several relations that capture the orderings present in all (or some) of these feasible program executions. The \fIhappened-before\fP, \fIconcurrent-with\fP, and \fIordered-with\fP relations are defined to show events that execute in a certain order, that execute concurrently, or that execute in either order but not concurrently. Each of these ordering relations is defined in two ways. In the \fImust-have\fP sense they show the orderings that are guaranteed to be present in all feasible program executions, and in the \fIcould-have\fP sense they show the orderings that could potentially occur in at least one feasible program execution due to timing variations. We prove that computing any of the must-have ordering relations is a co-NP-hard problem and that computing any of the could-have ordering relations is an NP-hard problem. costmodel.ps.Z "An Adaptive Cost Model for Parallel Program Instrumentation" Jeffrey K. Hollingsworth and Barton P. Miller Euro-Par `96, Lyon, France, August 1996. hollings@cs.umd.edu bart@cs.wisc.edu Abstract: Software based instrumentation is frequently used to measure the performance of parallel and distributed programs. However, using software instrumentation can introduce serious perturbation of the program being measured. In this paper we present a new data collection cost system that provides programmers with feedback about the impact data collection is having on their application. In addition, we introduce a technique that permits programmers to define the perturbation their application can tolerate and then we are able to regulate the amount of instrumentation to ensure that threshold is not exceeded. We also describe an implementation of the cost model and presents results from using it to measure the instrumentation overhead for several real applications. db_challenge.ps.Z "Database Challenges in Global Information Systems" Joann J. Ordille and Barton P. Miller ACM SIGMOD International Conference on the Management of Data, Washington, DC, May 1993, pp. 403-407. joann@research.att.com, bart@cs.wisc.edu detecting.ps.Z "Detecting Data Races in Parallel Program Executions" Robert H.B. Netzer and Barton P. Miller Advances in Languages and Compilers for Parallel Processing, Research Monographs in Parallel and Distributed Computing, MIT Press, 1991 netzer@cs.wisc.edu, bart@cs.wisc.edu Abstract: Several methods currently exist for detecting data races in an execution of a shared-memory parallel program. Although these methods address an important aspect of parallel program debugging, they do not precisely define the notion of a data race. As a result, is it not possible to precisely state which data races are detected, nor is the meaning of the reported data races always clear. Furthermore, these methods can sometimes generate false data race reports. They can determine whether a data race was exhibited during an execution, but when more than one data race is reported, only limited indication is given as to which ones are real. This paper addresses these two issues. We first present a model for reasoning about data races, and then present a two-phase approach to data race detection that attempts to validate the accuracy of each detected data race. Our model of data races distinguishes among those data races that actually occurred during an execution (\fIactual\fP data races), those that could have occurred because of timing variations (\fIfeasible\fP data races), and those that appeared to have occurred (\fIapparent\fP data races). The first phase of our two-phase approach to data race detection is similar to previous methods and detects a set of data race candidates (the apparent data races). We prove that this set always contains all actual data races, although it may contain other data races, both feasible and infeasible. Unlike previous methods, we then employ a second phase which validates the apparent data races by attempting to determine which ones are feasible. This second phase requires no more information than previous methods collect, and involves making a conservative estimate of the data dependences among the shared data to determine how these dependences may have constrained alternate orderings potentially exhibited by the execution. Each apparent data race can then be characterized as either being feasible, or as belonging to a set of apparent data races where at least one is feasible. dyninst.ps.Z "Dynamic Program Instrumentation for Scalable Performance Tools" Jeffrey K. Hollingsworth, Barton P. Miller, and Jon Cargille Scalable High Performance Computing Conference, Knoxville, May 1994 hollings@cs.umd.edu, bart@cs.wisc.edu, jon@cs.wisc.edu Abstract: In this paper, we present a new technique called dynamic instrumentation that provides efficient, scaleable, yet detailed data collection for large-scale parallel applications. Our approach is unique because it defers inserting any instrumentation until the application is in execution. We can insert or change instrumentation at any time during execution. Instrumentation is inserted by modifying the application's binary image. This permits us to insert only the instrumentation that is necessary for the current analysis being performed or visualization presented. As a result, our technique collects several orders of magnitude less data than traditional data collection approaches. We have implemented a prototype of our dynamic instrumentation on the Thinking Machines CM-5, and present results for several real applications. In addition, we include recommendations to operating system designers, compiler writers, and computer architects about the features necessary to to permit efficient monitoring of large-scale parallel systems. edcu.ps.Z "The Integration of Application and System Based Metrics in a Parallel Program Performance Tool" Jeffrey K. Hollingsworth, R. Bruce Irvin and Barton P. Miller ACM Symp on Prin. and Practice of Parallel Processing, May 1991. hollings@cs.umd.edu, rbi@informix.com, bart@cs.wisc.edu Abstract: The IPS-2 parallel program measurement tools provide performance data from application programs, the operating system, hardware, network, and other sources. Previous versions of IPS-2 allowed programmers to collect information about an application based only on what could be collected by software instrumentation inserted into the program (and system call libraries). We have developed an open interface, called the ``external time histogram'', providing a graceful way to include external data from many sources. The user can tell IPS-2 of new sources of performance data through an extensible metric description language. The data from these external sources is automatically collected when the application program is run. IPS-2 provides a library to simplify constructing the external data collectors. The new version of IPS-2 can measure shared-memory and message-passing parallel programs running on a heterogeneous collection of machines. Data from C or Fortran programs, and data from simulations can be processed by the same tool. As a result of including the new external performance data, IPS-2 now can report on a whole new set of performance problems. We describe the results of using IPS-2 on two real applications: a shared-memory database join utility, and a multi-processor interconnection network simulator. Even though these applications previously went through careful tuning, we were able to precisely identify performance problems and extract additional performance improvements of about 30%. experience.ps.Z "Experience with Techniques for Refining Data Race Detection" Robert H.B. Netzer and Barton P. Miller May 1992. netzer@cs.wisc.edu, bart@cs.wisc.edu Abstract: Dynamic data race detection is a critical part of debugging shared-memory parallel programs. The races that can be detected must be \fIrefined\fP to filter out false alarms and pinpoint only those that are direct manifestations of bugs. Most race detection methods can report false alarms because of imprecise run-time information and because some races are caused by others. To overcome this problem, race refinement uses whatever run-time information is available to speculate on which of the detected races should be reported. In this paper we report on experimental tests of two refinement techniques previously developed by us. Our goal was to determine whether good refinement is possible, and how much run-time information is required. We analyzed two sets of programs, one set written by others (which they had tested and believed to be race-free but which in fact had subtle races) and another set written by us (in which we introduced more complex races). We performed race detection and refinement on executions of these programs, and recorded both the global event ordering and an approximate ordering recorded without a global clock. We found that in all the programs written by others, accurate refinement was possible even without the global ordering. In the other programs, accurate refinement was also possible but required the global ordering. These results suggest that our techniques refine races accurately, and lead a programmer directly to race-causing bugs. They also suggest that race detection methods should be able to record and use either global or approximate event orderings, depending on the severity of the races being debugged. fuzz.ps.Z "An Empirical Study of the Reliability Operating System Utilities" Barton P. Miller, Lars Fredriksen, and Bryan So Appears in Comm. of the ACM, December 1990. bart@cs.wisc.edu Operating system facilities, such as the kernel and utility programs, are assumed to be reliable. Recent experience has led us to question the robustness of our operating system utility programs under unusual input streams. Specifically, we were interested in a basic form of testing: whether the utilities would crash or hang (infinite loop) in response to unusual input. We constructed a set of tools to test this form of reliability. Almost 90 utilities were tested on each of seven versions of the UNIX operating system. As a result of our tests, we were able to crash 25-32% of the utilities that we tested, including commonly used utilities such as editors, shells, and document formatters. These were programs that either terminated abnormally, generating a core dump, or programs that had infinite loops. We then examined each utility program that crashed and identified the cause. These results are then categorized by the cause of crash. We describe the cause of the crashes, the programming practices that led to the errors, and make some suggestions on how to avoid these problems in the future. fuzz-revisited.ps.Z "Fuzz Revisited: A Re-examination of the Reliability of UNIX Utilities and Services" Barton P. Miller, David Koski, Cjin Pheow Lee, Vivekananda Maganty, Ravi Murthy, Ajitkumar Natarajan, and Jeff Steidl. Computer Sciences Technical Report 1268, April 1995. bart@cs.wisc.edu Also appears (in German translation) in iX, September 1995. We have tested the reliability of a large collection of basic UNIX utility programs, X-Window applications and servers, and network services. We used a simple testing method of subjecting these programs to a random input stream. Our testing methods and tools are largely automatic and simple to use. We tested programs on nine versions of the UNIX operating system, including seven commercial systems and the freely-available GNU utilities and Linux. We report which programs failed on which systems, and identify and categorize the causes of these failures. The result of our testing is that we can crash (with core dump) or hang (infinite loop) over 40% (in the worst case) of the basic programs and over 25% of the X-Window applications. We were not able to crash any of the network services that we tested nor any of X-Window servers. This study parallels our 1990 study (that tested only the basic UNIX utilities); all systems that we compared between 1990 and 1995 noticeably improved in reliability, but still had significant rates of failure. The reliability of the basic utilities from GNU and Linux were noticeably better than those of the commercial systems. We also tested how utility programs checked their return codes from the memory allocation library routines by simulating the unavailability of virtual memory. We could crash almost half of the programs that we tested in this way. hollingsworth_thesis.ps.Z "Finding Bottlenecks in Large-scale Parallel Programs" August 1994 Univ. of Wisconsin-Madison Comp Sci Dept Tech Rep #tba (Ph.D. Thesis) hollings@cs.umd.edu Abstract: This thesis addresses the problem of trying to locate the source of performance bottlenecks in large-scale parallel and distributed applications. Performance monitoring creates a dilemma: identifying a bottleneck necessitates collecting detailed information, yet collecting all this data can introduce serious data collection bottlenecks. At the same time, users are being inundated with volumes of complex graphs and tables that require a performance expert to interpret. I have developed a new approach that addresses both these problems by combining dynamic on-the-fly selection of what performance data to collect with decision support to assist users with the selection and presentation of performance data. The approach is called the W3 Search Model (pronounced W-cubed). To make it possible to implement the W3 Search Model, I have developed a new monitoring technique for parallel programs called Dynamic Instrumentation. The premise of my work is that not only is it possible to do on-line performance debugging, but for large scale parallelism it is mandatory. The W3 Search Model closes the loop between data collection and analysis. Searching for a performance problem is an iterative process of refining the answers to three questions: why is the application performing poorly, where is the bottleneck, and when does the problem occur. To answer the why question, tests are conducted to identify the type of bottleneck (e.g., synchronization, I/O, computation). Answering the where question isolates a performance bottleneck to a specific resource used by the program (e.g., a disk system, a synchronization variable, or a procedure). Answering when a problem occurs, tries to isolate a bottleneck to a specific phase of the program's execution. Dynamic Instrumentation differs from traditional data collection because it defers selecting what data to collect until the program is running. This permits insertion and alteration of the instrumentation during program execution. It also features a new type of data collection that combines the low data volume of sampling with the accuracy of tracing. Instrumentation to precisely count and time events is inserted by dynamically modifying the binary program. These counters and timers are then periodically sampled to provide intermediate values to the W3 Search Model. Based on this intermediate data, changes are made in the instrumentation to collect more information to further isolate the bottleneck. I have built a prototype implementation of the W3 Search Model and of Dynamic Instrumentation. The prototype runs on Thinking Machine's CM-5 and a network of workstations running PVM. In one study, the tools identified the bottlenecks in several real programs using two to three orders of magnitude less data than traditional techniques. In another study, Dynamic Instrumentation was used to monitor long running programs and introduced less than 10% perturbation. While the W3 Search Model and Dynamic Instrumentation complement each other, they are also useful individually. The W3 Search Model can be applied to existing post-mortem performance tools or even to simulated machines and environments. Dynamic Instrumentation has be used to collect performance data for other uses including visualization. The W3 Search Model and Dynamic Instrumentation are being incorporated into the Paradyn Performance Tools. improving.ps.Z "Improving the Accuracy of Data Race Detection" Robert H.B. Netzer and Barton P. Miller In Proc. of the Third ACM Symposium on Principles and Practice of Parallel Programming, April 1991. netzer@cs.wisc.edu, bart@cs.wisc.edu Abstract: For shared-memory parallel programs that use explicit synchronization, \fIdata race\fP detection is an important part of debugging. A data race exists when concurrently executing sections of code access common shared variables. In programs intended to be data race free, they are sources of nondeterminism usually considered bugs. Previous methods for detecting data races in executions of parallel programs can determine when races occurred, but can report many data races that are artifacts of others and not direct manifestations of program bugs. Artifacts exist because some races can cause others and can also make false races appear real. Such artifacts can overwhelm the programmer with information irrelevant for debugging. This paper presents results showing how to identify non-artifact data races by \fIvalidation\fP and \fIordering\fP. Data race validation attempts to determine which races involve events that either did execute concurrently or could have (called \fIfeasible data races\fP). We show how each detected race can either be guaranteed feasible, or when insufficient information is available, sets of races can be identified within which at least one is guaranteed feasible. Data race ordering attempts to identify races that did not occur only as a result of others. Data races can be partitioned so that it is known whether a race in one partition may have affected a race in another. The \fIfirst\fP partitions are guaranteed to contain at least one feasible data race that is not an artifact of any kind. By combining validation and ordering, the programmer can be directed to those data races that should be investigated first for debugging. install_guide.ps.Z "IPS Installation Guide" Jeff Hollingsworth, Bruce Irvin, Mike McMenemy, and Barton P. Miller Version 5.0. hollings@cs.umd.edu, rbi@informix.com, mcmenemy@sequoia.com, bart@cs.wisc.edu ips.tar This file contains information on obtaining a license and distribution of the IPS-2 parallel programming tools. hollings@cs.umd.edu, rbi@informix.com, bart@cs.wisc.edu isca.ps.Z "Detecting Data Races on Weak Memory Systems" Sarita V. Adve, Mark D. Hill, Barton P. Miller, Robert H.B. Netzer In Proc. of the 18th Intl Symposium on Computer Architecture, May 1991. sarita@cs.wisc.edu, markhill@cs.wisc.edu, netzer@cs.wisc.edu, bart@cs.wisc.edu Abstract: For shared-memory systems, the most commonly assumed programmer's model of memory is sequential consistency. The weaker models of weak ordering, release consistency with sequentially consistent synchronization operations, data-race-free-0, and data-race-free-1 provide higher performance by guaranteeing sequential consistency to only a restricted class of programs - mainly programs that do not exhibit data races. To allow programmers to use the intuition and algorithms already developed for sequentially consistent systems, it is important to determine when a program written for a weak system exhibits no data races. In this paper, we investigate the extension of dynamic data race detection techniques developed for sequentially consistent systems to weak systems. A potential problem is that in the presence of a data race, weak systems fail to guarantee sequential consistency and therefore dynamic techniques may not give meaningful results. However, we reason that in practice a weak system will preserve sequential consistency at least until the ``first'' data races since it cannot predict if a data race will occur. We formalize this condition and show that it allows data races to be dynamically detected. Further, since this condition is already obeyed by all proposed implementations of weak systems, the full performance of weak systems can be exploited. javaperf.ps.Z "Performance Measurement of Interpreted Programs" Tia Newhall and Barton P. Miller December 11, 1997 In an interpreted execution there is an interdependence between the interpreter program's execution and the interpreted application's execution; the implementation of the interpreter determines how the interpreted application is executed, and the interpreted application triggers certain activities in the interpreter program. We present a representational model for describing performance data from an interpreted execution that explicitly represents the interaction between the interpreter and the interpreted application in terms of both the interpreter developer's and the interpreted application developer's view of the execution. We present results of a prototype implementation of a performance tool based on our representational model for measuring the performance of interpreted Java applications and applets. Our prototype uses dynamic instrumentation to measure Java programs starting with unmodified Java .class files and an unmodified Java virtual machine binary. As a demonstration, we use performance data from our tool to tune a Java application program, and as a result, we improved its performance by more than a factor of 3. kerninst.ps kerninst.ps.Z kerninst.ps.gz kerninst.pdf "Fine-Grained Dynamic Instrumentation of Commodity Operating System Kernels" Ariel Tamches and Barton P. Miller. tamches@cs.wisc.edu, bart@cs.wisc.edu Third Symp. on Operating Design and Implementation (OSDI), New Orleans, February 1999. We have developed a technology, fine-grained dynamic instrumentation of commodity kernels, which can splice (insert) dynamically generated code before almost any machine code instruction of a completely unmodified commodity operating system kernel, as it runs. This technology is well-suited to performance profiling, debugging, code coverage, security auditing, runtime code optimizations, and kernel extensions. We have designed and implemented a tool called KernInst that performs dynamic instrumentation on a stock production Solaris 2.5.1 kernel running on an UltraSPARC. We have implemented a kernel performance profiling tool on top of KernInst, and show how it uses dynamic instrumentation to quickly identify bottlenecks in a Web proxy server. We found that 10-20% of the proxy server's time was due to the Unix file system semantics of synchronizing file changes when updating meta- data. We also found that a further 20-30% of the proxy server's time was spent translating file names when opening or creating files. kerninst-blackberry.ps kerninst-blackberry.ps.Z kerninst-blackberry.ps.gz kerninst-blackberry.pdf "Using Dynamic Kernel Instrumentation for Kernel and Application Tuning" Ariel Tamches and Barton P. Miller. tamches@cs.wisc.edu, bart@cs.wisc.edu Submitted for publication, November 1998. Abstract: We have designed a new technology, fine-grained dynamic instrumentation of commodity operating system kernels, which can insert runtime-generated code at almost any machine code instruction of an unmodified operating system kernel. This technology is ideally suited for kernel performance profiling, debugging, code coverage, runtime optimization, and extensibility. We have written a tool called KernInst that implements dynamic instrumentation on a stock production Solaris 2.5.1 kernel running on an UltraSparc CPU. We have written a kernel performance profiler on top of KernInst. Measuring kernel performance has a two-way benefit; it can suggest optimizations to both the kernel and to applications that spend much of their time in kernel code. In this paper, we present our experiences using KernInst to identify kernel bottlenecks when running a web proxy server. By profiling kernel routines, we were able to understand performance bottlenecks inherent in the proxy's disk cache organization. We used this understanding to make two changes-one to the kernel and one to the application-that cumulatively reduce the percentage of elapsed time that the proxy spends opening disk cache files for writing from 40% to 13%. labyrinth.ps.Z "Lost in a Labyrinth of Workstations" Joann J. Ordille and Barton P. Miller Third Workshop on Workstation Operating Systems, Key Biscayne, April 1992, pp. 52-55. joann@research.att.com, bart@cs.wisc.edu mapping.ps.Z "Mechanisms for Mapping High-Level Parallel Performance Data" R. Bruce Irvin and Barton P. Miller ICPP Workshop on Challenges for Parallel Processing Chicago, August 1996. rbi@informix.com, bart@cs.wisc.edu Abstract: A primary problem in the performance measurement of high-level parallel programming languages is to map low-level events to high-level programming constructs. We discuss several aspects of this problem and presents three methods with which performance tools can map performance data and provide accurate performance information to programmers. In particular, we discuss static mapping, dynamic mapping, and a new technique that uses a data structure called the set of active sentences. Because each of these methods requires cooperation between compilers and performance tools, we describe the nature and amount of cooperation required. The three mapping methods are orthogonal; we describe how they should be combined in a complete tool. Although we concentrate on mapping upward through layers of abstraction, our techniques are independent of mapping direction. mdl.ps mdl.ps.Z mdl.ps.gz mdl.ps.pdf "MDL: A Language and Compiler for Dynamic Program Instrumentation" Jeffrey K. Hollingsworth, Barton P. Miller, Marcelo J. R. Goncalves, Oscar Naim, Zhichen Xu and Ling Zheng 1997 Int'l Conf. on Parallel Architectures and Compilation Techniques November 1997, San Francisco, California. hollings@cs.umd.edu, {bart,mjrg,naim,zhichen,lzheng}@cs.wisc.edu Abstract: We use a form of dynamic code generation, called dynamic instrumentation, to collect data about the execution of an application program. Dynamic instrumentation allows us to instrument running programs to collect performance and other types of informations. The instrumentation code is generated incrementally and can be inserted and removed at any time. We base our instrumentation on a machine independent language and compiling system that currently runs on the SPARC, PA-RISC, Power2, Alpha, and x86 architectures. Specification of what data to collect are written in this specialized language, called the Metric Description Language, that is part of the Paradyn Parallel Performance Tools. This language allows platform- independent descriptions of how to collect performance data. It also provides a concise way to specify how to constrain performance data to particular resources such as modules, procedures, nodes, files, or message channels (or combinations of these resources). MDL is less general than a procedural programming language (so compiling is faster and simpler), but more general than using specialization. Our dynamic instrumentation environment includes a compiler and code generation, structural analysis of the binary, and an instrumentation manager that allows code to be inserted and removed from the running program. We are able to operate on optimized code on all of currently support platforms. metrics.ps.Z "Parallel Program Performance Metrics: A Comparison and Validation" Jeffrey K. Hollingsworth and Barton P. Miller Supercomputing '92, November 1992. hollings@cs.umd.edu, bart@cs.wisc.edu Abstract: There are many metrics designed to assist in the performance debugging of large-scale parallel applications. We describe a new technique, call True Zeroing, that permits direct quantitative comparison of the guidance supplied by these metrics on real applications. We apply this technique to three programs that include both numeric and symbolic applications. We compare three existing metrics: gprof, Critical Path, and Quartz/NPT, and several new variations. Critical Path provided the best overall guidance, but it was not infallible. We also include a set of recommendations to tool builders based on the experience gained during our case study. msgtracing.ps.Z "Optimal Tracing and Replay for Debugging Message-Passing Parallel Programs" Robert H.B. Netzer and Barton P. Miller netzer@cs.wisc.edu, bart@cs.wisc.edu Supercomputing '92, November 1992. Abstract: A common debugging strategy involves re-executing a program (on a given input) over and over, each time gaining more information about bugs. Such techniques can fail on message-passing parallel programs. Because of variations in message latencies and process scheduling, different runs on the given input may produce different results. This non-repeatability is a serious debugging problem, since an execution cannot always be reproduced to track down bugs. This paper presents a technique for tracing and replaying message-passing programs for debugging. Our technique is \fIoptimal\fP in the common case and has good performance in the worst case. By making run-time tracing decisions, we trace only a fraction of the total number of messages, gaining two orders of magnitude reduction over traditional techniques which trace every message. Experiments indicate that only 1% of the messages often need be traced. These traces are sufficient to provide replay, allowing an execution to be reproduced any number of times for debugging. Our work is novel in that we use run-time decisions to detect and trace only those messages that introduce nondeterminacy. With our strategy, large reductions in trace size allow long-running programs to be replayed that were previously unmanageable. In addition, the reduced tracing requirements alleviate tracing bottlenecks, allowing executions to be debugged with substantially lower execution-time overhead. multiapp.ps.Z "Multi-Application Support in a Parallel Program Performance Tool" R. Bruce Irvin and Barton P. Miller IEEE Parallel & Distributed Technology, 2(1), pp. 40-50, Spring 1994 rbi@informix.com, bart@cs.wisc.edu Abstract: Program performance measurement tools have proven to be useful for tuning single, isolated, parallel and distributed applications. However, large-scale parallel machines and heterogeneous networks often do not allow for such isolated execution, much less isolated measurement. Performance measurement tools should allow users to study workload scheduling policies, resource competition among application programs, client/server interactions in distributed systems, and comparisons of application programs running on multiple hardware platforms. To enable and encourage such studies, we have extended the IPS-2 parallel program measurement tools to support the analysis of multiple applications (and multiple runs of the same application) in a single measurement session. This multi-application support allows the user to study each application as a logically separate entity, study groupings of the applications based on their physical location, or study the entire collection of applications. We used the new multi-application support in three case studies. In these studies we examined (1) the effects of clock precision on the quality of performance data, (2) the effects of gang scheduling on competing parallel applications, and (3) the performance interaction of client processes and server in a database system. The multi-application support allowed quick comparison of different versions of a program, with a concrete visual and numeric comparison. Also, it directly showed performance dependences between parallel applications. netzer_thesis.ps.Z "Race Condition Detection for Debugging Shared-Memory Parallel Programs" Robert H.B. Netzer Univ. of Wisconsin-Madison Comp Sci Dept Tech Rep #1039 (Ph.D. Thesis) netzer@cs.wisc.edu Abstract: This thesis addresses theoretical and practical aspects of the dynamic detecting and debugging of race conditions in shared-memory parallel programs. To reason about race conditions, we present a formal model that characterizes actual, observed, and potential behaviors of the program. The actual behavior precisely represents the program execution, the observed behavior represents partial information that can be reasonably recorded, and the potential behavior represents alternate executions possibly allowed by nondeterministic timing variations. These behaviors are used to characterize different types of race conditions, \fIgeneral races\fP and \fIdata races\fP, which pertain to different classes of parallel programs and require different detection techniques. General races apply to programs intended to be deterministic; data races apply to nondeterministic programs containing critical sections. We prove that, for executions using synchronization powerful enough to implement two-process mutual exclusion, locating every general race or data race is an NP-hard problem. However, for data races, we show that detecting only a subset of all races is sufficient for debugging. We also prove that, for weaker types of synchronization, races can be efficiently located. We present post-mortem algorithms for detecting race conditions as accurately as possible, given the constraint of limited run-time information. We characterize those races that are direct manifestations of bugs and not artifacts caused by other races, imprecise run-time traces (causing false races to appear real), or unintentional synchronization (caused by shared-memory references). Our techniques analyze the observed behavior to conservatively locate races that either did occur or had the potential of occurring, and that were unaffected by any other race in the execution. Finally, we describe a prototype data race detector that we used to analyze a sample collection of parallel programs. Experiments indicate that our techniques effectively pinpoint non-artifact races, directing the programmer to parts of the execution containing direct manifestations of bugs. In all programs analyzed, our techniques reduced hundreds to thousands of races down to four or fewer that required investigation. nv.ps.Z "A Performance Tool for High-Level Parallel Programming Languages" R. Bruce Irvin and Barton P. Miller Programming Environments for Massively Parallel Distributed Systems, K.M. Decker, R.M. Rehmann editors, Birkhauser Verlag, pp. 299-314, 1994. rbi@informix.com, bart@cs.wisc.edu Abstract: Users of high-level parallel programming languages require accurate performance information that is relevant to their source code. Furthermore, when their programs cause performance problems at the lowest levels of their hardware and software systems, programmers need to be able to peel back layers of abstraction to examine low-level problems while maintaining references to the high-level source code that ultimately caused the problem. In this paper, we present NV, a model for the explanation of performance information for programs built on multiple levels of abstraction. In NV, a level of abstraction includes a collection of nouns (code and data objects), verbs (activities), and performance information measured for the nouns and verbs. Performance information is mapped from level to level to maintain the relationships between low-level activities and high-level code, even when such relationships are implicit. We have used the NV model to build ParaMap, a performance tool for the CM Fortran language that has, in practice, guided us to substantial improvements in real CM Fortran applications. We describe the design and implementation of our tool and show how its simple tabular and graphical performance displays helped us to find performance problems in two applications. In each case, we found that performance information at all levels was most useful when related to parallel CM Fortran arrays, and that we could subsequently reduce each application's execution time by more than half. nv2.ps.Z "Mapping Performance Data for High-Level and Data Views of Parallel Program Performance" Int'l Conference on Supercomputing, Philadelphia, May 1996. R. Bruce Irvin and Barton P. Miller rbi@informix.com, bart@cs.wisc.edu Abstract: Programs written in high-level parallel languages need profiling tools that provide performance data in terms of the semantics of the high-level language. But high-level performance data can be incomplete when the cause of a performance problem cannot be explained in terms of the semantics of the language. We also need the ability to view the performance of the underlying mechanisms used by the language and correlate the underlying activity to the language source code. The key techniques for providing these performance views is the ability to map low-level performance data up to the language abstractions. We identify the various kinds of mapping information that needs to be gathered to support multiple views of performance data and describe how we can mine mapping information from the compiler and run-time environment. We also describe how we use this information to produce performance data at the higher levels, and how we present this data in terms of both the code and parallel data structures. We have developed an implementation of these mapping techniques for the data parallel CM Fortran language running on the TMC CM- 5. We have augmented the Paradyn Parallel Performance Tools with these mapping and high-level language facilities and used them to study several real data parallel Fortran (CM Fortran) applications. Our mapping and high-level language techniques allowed us to quickly understand these applications and modify them to obtain significant performance improvements. ordille_thesis.ps.Z "Descriptive Name Services for Large Internets" Joann J. Ordille November 1993. Univ. of Wisconsin-Madison Comp Sci Dept Tech Rep #1208 (Ph.D. Thesis) joann@research.att.com This thesis addresses the challenge of locating people, resources, and other objects in the global Internet. As the Internet grows beyond a million hosts in tens of thousands of organizations, it is increasingly difficult to locate any partic- ular object. Hierarchical name services are frustrating, because users must guess the unique names for objects or navigate the name space to find information. Descriptive (i.e. relational) name services offer the promise of simple resource location through a non-procedural query language. Users locate resources by describing resource attributes. This thesis makes the promise of descriptive name services real by providing fast query pro- cessing in large internets. The key to speed in descriptive query processing is con- straining the search space using two new techniques, called an active catalog and meta-data caching. The active catalog con- strains the search space for a query by returning a list of data repositories where the answer to the query is likely to be found. Components of the catalog are distributed indices that isolate queries to parts of the network, and smart algorithms for limit- ing the search space by using semantic, syntactic, or structural constraints. Meta-data caching improves performance by keeping frequently used characterizations of the search space close to the user, thus reducing active catalog communication and process- ing costs. When searching for query responses, these techniques improve query performance by contacting only the data reposi- tories likely to have actual responses, resulting in acceptable search times. Even a request to locate a person with a popular name in the global name space can often be answered in seconds. The new techniques are integrated with existing data caching techniques through a single mechanism, called a referral. Refer- rals describe the conditions for using active catalog components, or re-using meta-data or data cache entries. Referrals form a DAG, called a referral graph, that directs query processing. The referral graph provides a uniform framework for identifying the specific system resources (e.g. cache entries or catalog com- ponents) that optimize processing for a query. It identifies search spaces that can be combined through intersection or union. It combines indices specific to organizations into a composite index for the global name space. It eliminates components with overlapping functions from query processing. The referral graph also provides a natural source of advice to users. Users who present expensive queries can be asked for specific attribute values to constrain the search for responses and lower processing cost. Referral graphs, an active catalog, meta-data and data cach- ing are combined in the prototype descriptive name service called Nomenclator. In one measurement study, Nomenclator consistently improved the performance of descriptive queries in X.500. Another measurement study shows how Nomenclator uses a small investment of network bandwidth and server resources to improve the response time for a wide range of query sizes. An analytical modeling study shows that Nomenclator can amortize this invest- ment over many queries to provide an overall reduction in system load and, as a consequence, better scaling and response time. Nomenclator provides a uniform interface to various name ser- vices, such as X.500 and the Ninth Edition Unix Name Service from Bell Labs. Nomenclator is simple to extend to include new name services. overview.ps overview.ps.Z overview.ps.gz overview.ps.pdf "The Paradyn Parallel Performance Measurement Tools" Barton P. Miller, Mark D. Callaghan, Jonathan M. Cargille, Jeffrey K. Hollingsworth R. Bruce Irvin, Karen L. Karavanic, Krishna Kunchithapadam, and Tia Newhall. IEEE Computer 28, 11, (November 1995). {bart,markc,jon,karavan,newhall}@cs.wisc.edu hollings@cs.umd.edu rbi@informix.com The Paradyn project home page is http://www.cs.wisc.edu/~paradyn. ** Note: this paper contains several color postscript pages. It ** should print acceptably on b/w printers. Paradyn is a performance measurement tool for parallel and distributed programs. Paradyn uses several novel technologies so that it scales to long running programs (hours or days) and large (thousand node) systems, and automates much of the search for performance bottlenecks. It can provide precise performance data down to the procedure and statement level. Paradyn is based on a dynamic notion of performance instrumentation and measurement. Unmodified executable files are placed into execution and then performance instrumentation is inserted into the application program and modified during execution. The instrumentation is controlled by the Performance Consultant module, that automatically directs the placement of instrumentation. The Performance Consultant has a well-defined notion of performance bottlenecks and program structure, so that it can associate bottlenecks with specific causes and specific parts of a program. Paradyn controls its instrumentation overhead by monitoring the cost of its data collection, limiting its instrumentation to a (user controllable) threshold. The instrumentation in Paradyn can easily be configured to accept new operating system, hardware, and application specific performance data. It also provides an open interface for performance visualization, and a simple programming library to allow these visualizations to interface to Paradyn. Paradyn can gather and present performance data in terms of high- level parallel languages (such as data parallel Fortran) and can measure programs on massively parallel computers, workstation clusters, and heterogeneous combinations of these systems. paging.ps.Z "Paging Tradeoffs in Distributed-Shared-Memory Multiprocessors" Douglas C. Burger, Rahmat S. Hyder, Barton P. Miller, and David A. Wood Supercomputing '94, Washington DC, November 1994. dburger@cs.wisc.edu, hyder@cs.wisc.edu, bart@cs.wisc.edu, david@cs.wisc.edu Massively parallel processors have begun using commodity operating systems that support demand-paged virtual memory. To evaluate the utility of virtual memory, we measured the behavior of seven shared- memory parallel application programs on a simulated distributed-shared- memory machine. Our results (i) confirm the importance of gang CPU scheduling, (ii) show that a page-faulting processor should spin rather than invoke a parallel context switch, (iii) show that our parallel programs frequently touch most of their data, and (iv) indicate that memory, not just CPUs, must be ``gang scheduled''. Overall, our experiments demonstrate that demand paging has limited value on current parallel machines because of the applications' synchronization and memory reference patterns and the machines' high page-fault and parallel-context- switch overheads. paradyn_and_devise.ps.Z "Integrated Visualization of Parallel Program Performance Data" Karen L. Karavanic, Jussi Myllymaki, Miron Livny, and Barton P. Miller Environments and Tools for Parallel Scientific Computing, SIAM Press, 1996. {karavan,jussi,miron,bart}@cs.wisc.edu Performance tuning a parallel application involves integrating performance data from many components of the system, including the message passing library, performance monitoring tool, resource manager, operating system, and the application itself. The current practice of visualizing these data streams using a separate, customized tool for each source is inconvenient from a usability perspective, and there is no easy way to visualize the data in an integrated fashion. We demonstrate a solution to this problem using Devise, a generic visualization tool which is designed to allow an arbitrary number of different but related data streams to be integrated and explored visually in a flexible manner. We display data emanating from a variety of sources side by side in three case studies. First we interface the Paradyn Parallel Performance Tool and Devise, using two simple data export modules and Paradyn's simple visualization interface. We show several Devise/Paradyn visualizations which are useful for performance tuning parallel codes, and which incorporate data from Unix utilities and application output. Next we describe the visualization of trace data from a parallel application running in a Condor cluster of workstations. Finally we demonstrate the utility of Devise visualizations in a study of Condor cluster activity. paradynPVM.ps.Z "The Paradyn Parallel Performance Tools and PVM" Barton P. Miller, Jeffrey K. Hollingsworth, and Mark D. Callaghan in Environments and Tools for Parallel Scientific Computing, SIAM Press, J. Dongarra and B. Tourancheua, eds., 1994. bart@cs.wisc.edu hollings@cs.umd.edu markc@cs.wisc.edu Paradyn is a performance tool for large-scale parallel applications. By using dynamic instrumentation and automating the search for bottlenecks, it can measure long running applications on production-sized data sets. Paradyn has recently been ported to measure native PVM applications. Programmers run their unmodified PVM application programs with Paradyn. Paradyn automatically inserts and modifies instrumentation during the execution of the application, systematically searching for the causes of performance problems. In most cases, Paradyn can isolate major causes of performance problems, and the part of the program that is responsible the problem. Paradyn currently runs on the Thinking Machine CM-5, Sun workstations, and PVM (currently only on Suns). It can measure heterogeneous programs across any of these platforms. This paper presents an overview of Paradyn, describes the new facility in PVM that supports Paradyn, and reports experience with PVM applications. pointers.ps.Z "The Frequency of Dynamic Pointer References in ``C'' Programs" Barton P. Miller SIGPLAN Notices 23, 6 (June 1988). bart@cs.wisc.edu Abstract: A collection of ``C'' programs was measured for the number of dynamic references to pointers. The number of dynamic references to pointers is presenteded with respect to the total number of instructions a program executes, giving the percentage of pointer references executed in a ``C'' program. The measurements were done on a VAX 11/780 running the Berkeley UNIX operating system. The measured programs were selected by examining the most commonly run programs on the Computer Sciences Department UNIX machines. The measurement process was performed in two steps: (1) the dynamic counting of pointer references, and (2) the counting of the total number of instructions executed by the program. porting_guide.ps.Z "Guide to Porting IPS" Bruce Irvin and Bart Miller Version 5.0. rbi@informix.com, bart@cs.wisc.edu postwait.ps.Z "Efficient Race Condition Detection for Shared-Memory Programs with Post/Wait Synchronization" Robert H.B. Netzer and Sanjoy Ghosh January 1992. 1992 Int'l Conference on Parallel Processing, St Charles, IL, August 1992. netzer@cs.wisc.edu, ghosh@csrd.uiuc.edu Abstract: Shared-memory parallel programs are often designed to be deterministic, both in their final results and intermediate states. Determinacy is desired because it simplifies debugging and because some applications naturally have deterministic implementations. However, debugging such programs requires a mechanism for locating \fIrace conditions\fP or violations of the intended determinacy when they occur. This paper presents the first precise, efficient algorithm for dynamically detecting race conditions in programs that use non-trivial synchronization. We address Post/Wait synchronization, the most powerful type of synchronization for which efficient race detection is possible. Our algorithm computes the order in which synchronization operations in the execution are guaranteed to have occurred. Using this information race conditions can be detected either post-mortem or on-the-fly. This work answers a previously open question by considering non-trivial synchronization and presenting a precise, efficient algorithm. In contrast, previous work has addressed either simpler types of synchronization, approximations to race detection, or a different (and easier to detect) type of race. ppd_sigplan.ps.Z "A Mechanism for Efficient Debugging of Parallel Programs" Barton P. Miller and Jong-Deok Choi SIGPLAN Conf. on Programming Language Design and Implementation, June 1988, Atlanta, GA, pp. 135-144. Also appears in SIGPLAN Notices 23(7), July 1988. bart@cs.wisc.edu, jdchoi@watson.ibm.com Abstract: This paper addresses the design and implementation of an integrated debugging system for parallel programs running on shared memory multi-processors (SMMP). We describe the use of \fIflowback analysis\fP to provide information on causal relationships between events in a program's execution without re-executing the program for debugging. We introduce a mechanism called \fIincremental tracing\fP that, by using semantic analyses of the debugged program, makes the flowback analysis practical with only a small amount of trace generated during execution. We extend flowback analysis to apply to parallel programs and describe a method to detect race conditions in the interactions of the co-operating processes. ppd_toplas.ps.Z "Techniques for Debugging Parallel Programs with Flowback Analysis" Jong-Deok Choi, Barton P. Miller, and Robert H.B. Netzer, ACM Trans. on Programming Languages and Systems (Oct. 91). bart@cs.wisc.edu, jdchoi@watson.ibm.com, netzer@cs.wisc.edu Abstract: Flowback analysis is a powerful technique for debugging programs. It allows the programmer to examine dynamic dependences in a program's execution history without having to re-execute the program. The goal is to present to the programmer a graphical view of the dynamic program dependences. We are building a system, called PPD, that performs \fIflowback analysis\fP while keeping the execution time overhead low. We also extend the semantics of flowback analysis to parallel programs. This paper describes details of the graphs and algorithms needed to implement efficient flowback analysis for parallel programs. Execution time overhead is kept low by recording only a small amount of trace during a program's execution. We use semantic analysis and a technique called "incremental tracing" to keep the time and space overhead low. As part of the semantic analysis, PPD uses a static program dependence graph structure that reduces the amount of work done at compile time and takes advantage of the dynamic information produced during execution time. Parallel programs have been accommodated in two ways. First, the flowback dependences can span process boundaries; i.e., the most recent modification to a variable might be traced to a different process than the one that contains the current reference. The static and dynamic program dependence graphs of the individual processes are tied together with synchronization and data dependence information to form complete graphs that represent the entire program. Second, our algorithms will detect potential data race conditions in the access to shared variables. The programmer can be directed to the cause of the race condition. PPD is currently being implemented for the C programming language on a Sequent Symmetry shared-memory multiprocessor. ppopp97.ps.Z "Shared-Memory Performance Profiling" Zhichen Xu, James R. Larus, and Barton P. Miller 1997 ACM SIGPLAN Symposium on Principles and Practices of Parallel Computing, Las Vegas, June 1997. {zhichen,larus,bart}@cs.wisc.edu Abstract: We describe a new approach to finding performance bottlenecks in shared- memory parallel programs, and implemented this new approach in the Paradyn Parallel Performance Tools running on the Blizzard fine-grain distributed shared memory system. The main characteristics of our approach are detecting and monitoring data sharing patterns, data-centric presentation of performance data, and automating the search for performance bottlenecks. Monitoring data sharing patterns helps detect when a program accesses data in a way that indicates a performance bottleneck. The data structure- centric view associates performance results with both data structures and memory blocks, as well as Paradyn's usual control-centric view of module/procedure/statement entities. Automated search makes performance tuning a less painful experience. With these mechanisms, we improved the performance of a new untuned shared-memory application by more than a factor of four. rbi_thesis.ps.Z "Performance Measurement Tools for High-Level Parallel Programming Languages" R. Bruce Irvin October 1995. Univ. of Wisconsin-Madison Comp Sci Dept Tech Rep #tba (Ph.D. Thesis) rbi@informix.com Abstract: Users of high-level parallel programming languages require accurate performance information that is relevant to their source code. Furthermore, when their programs experience performance problems at the lowest levels of their hardware and software systems, programmers need to be able to peel back layers of abstraction to examine low-level problems while maintaining references to the high-level source code that ultimately caused the problem. This dissertation addresses the problems associated with providing useful performance data to users of high-level parallel programming languages. In particular it describes techniques for providing source-level performance data to programmers, for mapping performance data among multiple layers of abstraction, and for providing data-oriented views of performance. We present NV, a model for the explanation of performance information for high-level parallel language programs. In NV, a level of abstraction includes a collection of nouns (code and data objects), verbs (activities), and performance information measured for the nouns and verbs. Performance information is mapped from level to level to maintain relationships between low-level activities and high-level code, even when such relationships are implicit. The NV model has helped us to implement support for performance measurement of high-level parallel language applications in two performance measurement tools (ParaMap and Paradyn). We describe the design and implementation of these tools and show how they provide performance information for CM Fortran programmers. Finally, we present results of measurement studies in which we have used ParaMap and Paradyn to improve the performance of a variety of real CM Fortran applications running on CM-5 parallel computers. In each case, we found that overall performance trends could be observed at the source code level and that both data views and code views of performance were useful. We found that some performance problems could not be explained at the source code level. In these cases, we used the performance tools to examine lower levels of abstraction to find performance problems. We found that low-level information was most useful when related to source-level code structures and (especially) data structures. Finally, we made relatively small changes to the applications' source code to achieve substantial performance improvements. ref_guide.ps.Z "IPS System Programmers Reference Guide" Jeffrey K. Hollingsworth, R. Bruce Irvin and Barton P. Miller Version 5.0. hollings@cs.umd.edu, rbi@informix.com, bart@cs.wisc.edu slack.ps.Z "Slack: A New Performance Metric for Parallel Programs" Jeffrey K. Hollingsworth and Barton P. Miller hollings@cs.umd.edu, bart@cs.wisc.edu Appears as a CS dept. technical report Critical Path Profiling is a technique that provides guidance to help programmers try to improve the running time of their program. However, Critical Path Profiling provides only an upper bound estimate of the improvement possible in a parallel program execution. In this paper, we present a new metric, called Slack, to complement Critical Path and provide additional information to parallel programmers about the potential impact of making improvements along the critical path. sigcomm91.ps.Z "Nomenclator Descriptive Query Optimization for Large X.500 Environments" Joann J. Ordille and Barton P. Miller ACM SIGCOMM Symposium on Communications Architectures and Protocols, Zurich, September 1991, pp. 185-196. joann@research.att.com, bart@cs.wisc.edu Abstract: Nomenclator is an architecture for providing efficient descriptive (attribute-based) naming in a large internet environment. As a test of the basic design, we have built a Nomenclator prototype that uses X.500 as its underlying data repository. X.500 SEARCH queries that previously took several minutes, can, in many cases, be answered in a matter of seconds. Our system improves descriptive query performance by trimming branches of the X.500 directory tree from the search. These tree-trimming techniques are part of an active catalog that constrains the search space as needed during query processing. The active catalog provides information about the data distribution (meta-data) to constrain query processing on demand. Nomenclator caches both data (responses to queries) and meta-data (data distribution information, tree-trimming techniques, data access techniques) to speed future queries. Nomenclator relieves users of the need to understand the structure of the name space to locate objects quickly in a large, structured name environment. Nomenclator is a meta-level service that will eventually incorporate other name services in addition to X.500. Its techniques for improving performance should be generally applicable to other naming systems. steering.ps.Z "Integrating a Debugger and Performance Tool for Steering" Krishna Kunchithapadam and Barton P. Miller Proceedings of the Workshop on Debugging and Performance Tuning for Parallel Computing Systems. Cape Cod, Massachusetts, USA. October 3-5, 1994. krishna@cs.wisc.edu, bart@cs.wisc.edu Abstract: Steering is a performance optimization idiom applicable to many problem domains. It allows control and performance tuning to take place during program execution. Steering emphasizes the optimization and control of the performance of a program using mechanisms that are external to the program. Performance measurement tools and symbolic debuggers already independently provide some of the mechanisms needed to implement a steering tool. In this paper we describe a configuration that integrates a performance tool, Paradyn, and a debugger to build a steering environment. The steering configuration allows fast prototyping of steering policies, and provides support for both interactive and automated steering. swb.ps.Z "Reliable and Secure UNIX Connection Service" , D. Draheim, B.P. Miller, and S. Snyder, 6th Symp. on Reliability in Distributed Software and Database Systems Williamsburg, VA (March 1987). bart@cs.wisc.edu Distributed programs require a method for processes residing on different machines to identify each other and establish communication. One method is to provide a special connection service to perform this task. A good connection service should be easy to use. It should allow arbitrary processes to connect to each other as well as helping client processes to connect to server processes. It should provide location transparency; that is, the programmer should not have to know the network address of a process to connect to it. The connection service should be reliable. It should provide a way for a process to establish the identity of the user associated with the process to which it has connected, and to communicate securely with that process. We have implemented a connection service for Berkeley UNIX that is reliable, available, secure, and easy to use. The connection service achieves ease of use through a simple interface based on the library routine meet(). Meet() allows one process to connect to another by specifying arbitrary names for itself and the other process. The connection service imposes no naming conventions of its own so it can be used with most name spaces and naming services. The service is location-transparent. It also provides a routine for posting services. Reliable and available service is provided by replicating connection servers. Each server knows about all pending connection requests. The connection service provides continuous service as long as at least one server is running. Connections can be authenticated by an authentication server that works in cooperation with the connection server. Secure communication is achieved via the RSA public-key encryption algorithm. The connection server was put in regular use in June 1986. Our limited experience indicates that it satisfies an important need of UNIX users. threads.ps threads.ps.Z threads.ps.gz threads.pdf Dynamic Instrumentation of Threaded Applications Zhichen Xu, Barton P. Miller and Oscar Naim. Submitted for publication. zhichen@cs.wisc.edu, bart@cs.wisc.edu, onaim@us.oracle.com The use of threads is becoming commonplace in both sequential and parallel programs. This paper describes our design and initial experience with non-trace based performance instrumentation techniques for threaded programs. Our goal is to provide detailed performance data while maintaining control of instrumentation costs. We have extended Paradyn's dynamic instrumentation (which can instrument programs without recompiling or relinking) to handle threaded programs. Controlling instrumentation costs means efficient instrumentation code and avoiding locks in the instrumentation. Our design is based on low contention data structures. To associate performance data with individual threads, we have all threads share the same instrumentation code and assign each thread with its own private copy of performance counters or timers. The asynchrony in a threaded program poses a major challenge to dynamic instrumentation. To implement time-based metrics on a per-thread basis, we need to instrument thread context switches, which can cause instrumentation code to interleave. Interleaved instrumentation can not only corrupt performance data, but can also cause a scenario we call self-deadlock where an instrumentation code deadlocks a thread. We introduce thread-conscious locks to avoid self-deadlock, and per-thread virtual CPU timers to reduce the chance of interleaved instrumentation accessing the same performance counter or timer, and to reduce the number of expensive timer calls at thread context switches. Our initial implementation is on SPARC Solaris2.x including multiprocessor Sun UltraSPARC Enterprise machines. We tested our tool on large multithreaded applications, including the Java Virtual Machine (JVM). We show how our new techniques helped us to speed up a Java native method by 22% and overall execution time of the JVM interpreting the AppletViewer driven by a game applet by 7%. usersguide.ps.Z "IPS-2 Users Guide" Jeffrey K. Hollingsworth, R. Bruce Irvin and Barton P. Miller Version 5.0. hollings@cs.umd.edu, rbi@informix.com, bart@cs.wisc.edu visualize.ps.Z "What to Draw? When to Draw? An Essay on Parallel Program Visualization" Barton P. Miller Journal of Parallel and Distributed Computing, vol 18, no 2 (June 1993). bart@cs.wisc.edu In this essay, I share opinions and experiences that I have formed about visualization. These opinions were formed while developing tools for understanding execution of parallel programs. I try to codify some of the implicit rules that have guided my research efforts. wrapping.ps.Z "Binary Wrapping: A Technique for Instrumenting Object Code" Jon Cargille and Barton P. Miller SIGPLAN Notice, vol 27, no 6 (June 1992). jon@cs.wisc.edu, bart@cs.wisc.edu Abstract: We present a technique, called "binary wrapping", that allows object code routines to be instrumented. Binary wrapping allows tracing to be placed around (and sometimes within) proprietary code, when source code access is difficult or impossible. This technique is based on wrapping user-written code around the object code routine. No modifications are needed to the programs that call the object code routine. Binary wrapping has proven itself invaluable in instrumenting proprietary libraries, and may well be useful in similar circumstances. w3search.ps.Z "Dynamic Control of Performance Monitoring on Large Scale Parallel Systems" Jeffrey K. Hollingsworth and Barton P. Miller International Conference on Supercomputing, Tokyo, July 19-23, 1993. hollings@cs.umd.edu, bart@cs.wisc.edu Abstract: Performance monitoring of large scale parallel computers creates a dilemma: we need to collect detailed information to find performance bottlenecks, yet collecting all this data can introduce serious data collection bottlenecks. At the same time, users are being inundated with volumes of complex graphs and tables that require a performance expert to interpret. We present a new approach called the W cubed Search Model, that addresses both these problems by combining dynamic on-the-fly selection of what performance data to collect with decision support to assist users with the selection and presentation of performance data. Our goal is to provide users with answers to their performance questions and at the same time dramatically reduce the volume of performance data we need to collect. We present a case study describing how a prototype implementation of our technique was able to identify the bottlenecks in three real programs. In addition, we were able to reduce the amount of performance data collected by a factor ranging from 13 to 700 compared to traditional sampling and trace based instrumentation techniques. xman.ps.Z "Experiment Management Support for Performance Tuning" Karen L. Karavanic and Barton P. Miller SC97, San Jose, California, Nov 1997. karavan@cs.wisc.edu, bart@cs.wisc.edu Abstract: The development of a high-performance parallel system or application is an evolutionary process -- both the code and the environment go through many changes during a program's lifetime -- and at each change, a key question for developers is: how and how much did the performance change? No existing performance tool provides the necessary functionality to answer this question. This paper reports on the design and preliminary implementation of a tool which views each execution as a scientific experiment and provides the functionality to answer questions about a program's performance which span more than a single execution or environment. We report results of using our tool with an actual performance tuning study and with a scientific application run in changing environments. Our goal is to use historic program performance data to develop techniques for parallel program performance diagnosis.