active_catalogs.ps.Z
    "Distributed Active Catalogs and Meta-Data Caching in
     Descriptive Name Services" by Joann J. Ordille and Barton P. Miller
     Best paper at the 13th International IEEE Conference on Distributed
     Computing Systems, Pittsburgh, May 1993, pp. 120-129.
     Also Technical Report 1118, November 1992.
     joann@research.att.com, bart@cs.wisc.edu

    Abstract:

    Today's global internetworks challenge the ability of name ser-
    vices and other information services to locate data quickly.  We
    introduce a distributed active catalog and meta-data caching for
    optimizing queries in this environment.  Our active catalog constrains
    the search space for a query by returning a list of data repositories
    where the answer to the query is likely to be found.  Components of
    the catalog are distributed indices that isolate queries to parts of
    the network, and smart algorithms for limiting the search space by
    using semantic, syntactic, or structural constraints.  Meta-data cach-
    ing improves performance by keeping frequently used characterizations
    of the search space close to the user, and eliminating active catalog
    communication and processing costs.  When searching for query
    responses, our techniques contact only the small percentage of the
    data repositories with actual responses, resulting in search times of
    a few seconds. Even a request to locate a person with a popular name
    in the global name space can often be answered in seconds.  We imple-
    mented a distributed active catalog and meta-data caching in a proto-
    type descriptive name service called Nomenclator.  We present perfor-
    mance results for Nomenclator in a search space of 1000 data reposi-
    tories.

array_distrib.ps.Z
    Optimizing Array Distributions in Data-Parallel Programs"
    Krishna Kunchithapadam and Barton P. Miller
    7th Annual Workshop on Languages and Compilers for Parallel Computing,
    Ithaca, NY. August 1994.
    krishna@cs.wisc.edu, bart@cs.wisc.edu

    Abstract:

    Data parallel programs are sensitive to the distribution of data
    across processor nodes.  We formulate the reduction of inter-node
    communication as an optimization on a colored graph.  We present a
    technique that records the run time inter-node communication caused
    by the movement of array data between nodes during execution and
    builds the colored graph, and provide a simple algorithm that
    optimizes the coloring of this graph to describe new data
    distributions that would result in less inter-node communication.
    From the distribution information, we write compiler pragmas to be
    used in the application program.  Using these techniques, we traced
    the execution of a real data-parallel application (written in CM
    Fortran) and collected the array access information.  We computed
    new distributions that should provide an overall reduction in
    program execution time.  However, compiler optimizations and poor
    interfaces between the compiler and runtime systems counteracted
    any potential benefit from the new data layouts.  In this context,
    we provide a set of recommendations for compiler writers that we
    think are needed to both write efficient programs and to build the
    next generation of tools for parallel systems.  The techniques that
    we have developed form the basis for future work in monitoring
    array access patterns and generate on-the-fly redistributions of
    arrays.

breakpoint.ps.Z
    "Breakpoints and Halting in Distributed Programs",
    Barton P. Miller and Jong-Deok Choi
    8th Int'l Conf on Distributed Computing Systems (1988)
    bart@cs.wisc.edu, jdchoi@watson.ibm.com

    Abstract:

    Interactive debugging requires that the programmer be able to halt a
    program at interesting points in its execution.  This paper presents an
    algorithm for halting a distributed program in a consistent state, and
    presents a definition of distributed breakpoints with an algorithm for
    implementing the detection of these breakpoints.  The Halting Algorithm
    extends Chandy and Lamport's algorithm for recording global state and
    solves the problem of processes that are not fully connected or frequently
    communicating.  The definition of distributed breakpoints is based on
    those events that can be detected in a distributed system.  Events that
    can be partially ordered are detectable and form the basis for the
    breakpoint predicates, and from the breakpoint definition comes the
    description of an algorithm that can be used in a distributed debugger to
    detect these breakpoints.

characterization.ps.Z
    "What are Race Conditions? Some Issues and Formalizations"
    Robert H.B. Netzer  and  Barton P. Miller
    ACM Letters on Programming Languages and Systems, March 1992.
    (Also available as Univ. of Wisconsin-Madison Comp Sci Dept TR#1014)
    netzer@cs.wisc.edu, bart@cs.wisc.edu

    Abstract:

    In shared-memory parallel programs that use explicit synchronization,
    race conditions result when accesses to shared memory are not
    properly synchronized.  Race conditions are often considered to be
    manifestations of bugs since their presence can cause the program
    to behave unexpectedly.  Unfortunately, there has been little
    agreement in the literature as to precisely what constitutes a race
    condition.  Two different notions have been implicitly considered:
    one pertaining to programs intended to be deterministic (which we
    call general races) and the other to nondeterministic programs
    containing critical sections (which we call data races).  However,
    the differences between general races and data races have not yet
    been recognized.  This paper examines these differences by
    characterizing races using a formal model and exploring their
    properties.  We show that two variations of each type of race
    exist:  feasible general races and data races capture the intuitive
    notions desired for debugging and apparent races capture less
    accurate notions implicitly assumed by most dynamic race detection
    methods.  We also show that locating feasible races is an NP-hard
    problem, implying that only the apparent races, which are
    approximations to feasible races, can be detected in practice.  The
    complexity of dynamically locating apparent races depends on the
    type of synchronization used by the program.  Apparent races can be
    exhaustively located efficiently only for weak types of
    synchronization that are incapable of implementing mutual
    exclusion.  This result has important implications since we argue
    that debugging general races requires exhaustive race detection and
    is inherently harder than debugging data races (which requires only
    partial race detection).  Programs containing data races can
    therefore be efficiently debugged by locating certain easily
    identifiable races.  In contrast, programs containing general races
    require more complex debugging techniques.

cohrs_thesis.ps.Z
    "A Specification Language for Multi-Domain Network and Distributed Systems
    Management"
    David L. Cohrs
    August 1991.
    Univ. of Wisconsin-Madison Comp Sci Dept Tech Rep #10xx (Ph.D. Thesis)
    cohrs@Legato.COM

complexity.ps.Z
    "On the Complexity of Event Ordering for Shared-Memory Parallel Program
    Executions"
    Robert H.B. Netzer  and  Barton P. Miller
    Univ. of Wisconsin-Madison Comp Sci Dept Technical Report #908, January 1990
    A condensed version of this paper appears in Proc. of the 1990 Int'l
    Conference on Parallel Processing, St Charles, IL, August 1990
    netzer@cs.wisc.edu, bart@cs.wisc.edu

    Abstract:

    This paper presents results on the complexity of computing event
    orderings for shared-memory parallel program executions.  Given a
    program execution, we formally define the problem of computing
    orderings that the execution \fImust have\fP exhibited or \fIcould
    have\fP exhibited, and prove that computing such orderings is an
    intractable problem.

    We present a formal model of a shared-memory parallel program execution
    on a sequentially consistent processor, and discuss event orderings in
    terms of this model.  Programs are considered that use fork/join and
    either counting semaphores or event style synchronization.  We define a
    \fIfeasible program execution\fP to be an execution of the program that
    performs the same events as an observed execution, but which may
    exhibit different orderings among those events.  Any program execution
    exhibiting the same data dependences among the shared data as the
    observed execution is feasible.  We define several relations that
    capture the orderings present in all (or some) of these feasible
    program executions.  The \fIhappened-before\fP, \fIconcurrent-with\fP,
    and \fIordered-with\fP relations are defined to show events that
    execute in a certain order, that execute concurrently, or that execute
    in either order but not concurrently.  Each of these ordering relations
    is defined in two ways.  In the \fImust-have\fP sense they show the
    orderings that are guaranteed to be present in all feasible program
    executions, and in the \fIcould-have\fP sense they show the orderings
    that could potentially occur in at least one feasible program execution
    due to timing variations.  We prove that computing any of the must-have
    ordering relations is a co-NP-hard problem and that computing any of
    the could-have ordering relations is an NP-hard problem.

costmodel.ps.Z
    "An Adaptive Cost Model for Parallel Program Instrumentation"
    Jeffrey K. Hollingsworth and Barton P. Miller
    Euro-Par `96, Lyon, France, August 1996.
    hollings@cs.umd.edu bart@cs.wisc.edu

    Abstract:

    Software based instrumentation is frequently used to measure the
    performance of parallel and distributed programs.  However, using
    software instrumentation can introduce serious perturbation of the
    program being measured.  In this paper we present a new data collection
    cost system that provides programmers with feedback about the impact
    data collection is having on their application.  In addition, we introduce
    a technique that permits programmers to define the perturbation their
    application can tolerate and then we are able to regulate the amount of
    instrumentation to ensure that threshold is not exceeded.  We also
    describe an implementation of the cost model and presents results from
    using it to measure the instrumentation overhead for several real
    applications.

db_challenge.ps.Z
    "Database Challenges in Global Information Systems"
    Joann J. Ordille and Barton P. Miller
    ACM SIGMOD International Conference on the Management of Data,
    Washington, DC, May 1993, pp. 403-407.
    joann@research.att.com, bart@cs.wisc.edu

detecting.ps.Z
    "Detecting Data Races in Parallel Program Executions"
    Robert H.B. Netzer  and  Barton P. Miller
    Advances in Languages and Compilers for Parallel Processing,
    Research Monographs in Parallel and Distributed Computing, MIT Press, 1991
    netzer@cs.wisc.edu, bart@cs.wisc.edu

    Abstract:

    Several methods currently exist for detecting data races in an
    execution of a shared-memory parallel program.  Although these methods
    address an important aspect of parallel program debugging, they do not
    precisely define the notion of a data race.  As a result, is it not
    possible to precisely state which data races are detected, nor is the
    meaning of the reported data races always clear.  Furthermore, these
    methods can sometimes generate false data race reports.  They can
    determine whether a data race was exhibited during an execution, but
    when more than one data race is reported, only limited indication is
    given as to which ones are real.  This paper addresses these two
    issues.  We first present a model for reasoning about data races, and
    then present a two-phase approach to data race detection that attempts
    to validate the accuracy of each detected data race.

    Our model of data races distinguishes among those data races that
    actually occurred during an execution (\fIactual\fP data races), those
    that could have occurred because of timing variations (\fIfeasible\fP
    data races), and those that appeared to have occurred (\fIapparent\fP
    data races).  The first phase of our two-phase approach to data race
    detection is similar to previous methods and detects a set of data race
    candidates (the apparent data races).  We prove that this set always
    contains all actual data races, although it may contain other data
    races, both feasible and infeasible.  Unlike previous methods, we then
    employ a second phase which validates the apparent data races by
    attempting to determine which ones are feasible.  This second phase
    requires no more information than previous methods collect, and
    involves making a conservative estimate of the data dependences among
    the shared data to determine how these dependences may have constrained
    alternate orderings potentially exhibited by the execution.  Each
    apparent data race can then be characterized as either being feasible,
    or as belonging to a set of apparent data races where at least one is
    feasible.

dyninst.ps.Z
    "Dynamic Program Instrumentation for Scalable Performance Tools"
    Jeffrey K. Hollingsworth, Barton P. Miller, and Jon Cargille
    Scalable High Performance Computing Conference, Knoxville,
    May 1994
    hollings@cs.umd.edu, bart@cs.wisc.edu, jon@cs.wisc.edu

    Abstract:

    In this paper, we present a new technique called dynamic
    instrumentation that provides efficient, scaleable, yet detailed
    data collection for large-scale parallel applications.  Our
    approach is unique because it defers inserting any instrumentation
    until the application is in execution.  We can insert or change
    instrumentation at any time during execution.  Instrumentation is
    inserted by modifying the application's binary image.  This permits
    us to insert only the instrumentation that is necessary for the
    current analysis being performed or visualization presented.  As a
    result, our technique collects several orders of magnitude less
    data than traditional data collection approaches.  We have
    implemented a prototype of our dynamic instrumentation on the
    Thinking Machines CM-5, and present results for several real
    applications.  In addition, we include recommendations to operating
    system designers, compiler writers, and computer architects about
    the features necessary to to permit efficient monitoring of
    large-scale parallel systems.

edcu.ps.Z
    "The Integration of Application and System Based Metrics in a Parallel
    Program Performance Tool"
    Jeffrey K. Hollingsworth, R. Bruce Irvin and Barton P. Miller
    ACM Symp on Prin. and Practice of Parallel Processing,
    May 1991.
    hollings@cs.umd.edu, rbi@informix.com, bart@cs.wisc.edu

    Abstract:

    The IPS-2 parallel program measurement tools provide performance
    data from application programs, the operating system, hardware,
    network, and other sources.  Previous versions of IPS-2 allowed
    programmers to collect information about an application based only
    on what could be collected by software instrumentation inserted
    into the program (and system call libraries).  We have developed an
    open interface, called the ``external time histogram'', providing a
    graceful way to include external data from many sources.  The user
    can tell IPS-2 of new sources of performance data through an
    extensible metric description language.  The data from these
    external sources is automatically collected when the application
    program is run.  IPS-2 provides a library to simplify constructing
    the external data collectors.

    The new version of IPS-2 can measure shared-memory and
    message-passing parallel programs running on a heterogeneous
    collection of machines.  Data from C or Fortran programs, and data
    from simulations can be processed by the same tool.  As a result of
    including the new external performance data, IPS-2 now can report
    on a whole new set of performance problems.

    We describe the results of using IPS-2 on two real applications: a
    shared-memory database join utility, and a multi-processor
    interconnection network simulator.  Even though these applications
    previously went through careful tuning, we were able to precisely
    identify performance problems and extract additional performance
    improvements of about 30%.

experience.ps.Z
    "Experience with Techniques for Refining Data Race Detection"
    Robert H.B. Netzer  and  Barton P. Miller
    May 1992.
    netzer@cs.wisc.edu, bart@cs.wisc.edu

    Abstract:

    Dynamic data race detection is a critical part of debugging
    shared-memory parallel programs.  The races that can be detected must
    be \fIrefined\fP to filter out false alarms and pinpoint only those
    that are direct manifestations of bugs.  Most race detection methods
    can report false alarms because of imprecise run-time information and
    because some races are caused by others.  To overcome this problem,
    race refinement uses whatever run-time information is available to
    speculate on which of the detected races should be reported.  In this
    paper we report on experimental tests of two refinement techniques
    previously developed by us.  Our goal was to determine whether good
    refinement is possible, and how much run-time information is
    required.  We analyzed two sets of programs, one set written by
    others (which they had tested and believed to be race-free but
    which in fact had subtle races) and another set written by us (in
    which we introduced more complex races).  We performed race
    detection and refinement on executions of these programs, and
    recorded both the global event ordering and an approximate ordering
    recorded without a global clock.  We found that in all the programs
    written by others, accurate refinement was possible even without
    the global ordering.  In the other programs, accurate refinement
    was also possible but required the global ordering.  These results
    suggest that our techniques refine races accurately, and lead a
    programmer directly to race-causing bugs.  They also suggest that
    race detection methods should be able to record and use either
    global or approximate event orderings, depending on the severity of
    the races being debugged.

fuzz.ps.Z
    "An Empirical Study of the Reliability Operating System Utilities"
    Barton P. Miller, Lars Fredriksen, and Bryan So
    Appears in Comm. of the ACM, December 1990.
    bart@cs.wisc.edu

    Operating system facilities, such as the kernel and utility programs, are
    assumed to be reliable.  Recent experience has led us to question the
    robustness of our operating system utility programs under unusual input
    streams.  Specifically, we were interested in a basic form of testing:
    whether the utilities would crash or hang (infinite loop) in response to
    unusual input.  We constructed a set of tools to test this form of
    reliability.  Almost 90 utilities were tested on each of seven versions of
    the UNIX operating system.  As a result of our tests, we were able to crash
    25-32% of the utilities that we tested, including commonly used utilities
    such as editors, shells, and document formatters.  These were programs that
    either terminated abnormally, generating a core dump, or programs that had
    infinite loops.

    We then examined each utility program that crashed and identified the
    cause.  These results are then categorized by the cause of crash.  We
    describe the cause of the crashes, the programming practices that led to
    the errors, and make some suggestions on how to avoid these problems in
    the future.

fuzz-revisited.ps.Z
    "Fuzz Revisited: A Re-examination of the Reliability of UNIX Utilities
    and Services"
    Barton P. Miller, David Koski, Cjin Pheow Lee, Vivekananda Maganty,
    Ravi Murthy, Ajitkumar Natarajan, and Jeff Steidl.
    Computer Sciences Technical Report 1268, April 1995.
    bart@cs.wisc.edu
    Also appears (in German translation) in iX, September 1995.

    We have tested the reliability of a large collection of basic UNIX utility
    programs, X-Window applications and servers, and network services. We used
    a simple testing method of subjecting these programs to a random input
    stream.  Our testing methods and tools are largely automatic and simple to
    use. We tested programs on nine versions of the UNIX operating system,
    including seven commercial systems and the freely-available GNU utilities
    and Linux.  We report which programs failed on which systems, and identify
    and categorize the causes of these failures.

    The result of our testing is that we can crash (with core dump) or hang
    (infinite loop) over 40% (in the worst case) of the basic programs and
    over 25% of the X-Window applications.  We were not able to crash any of
    the network services that we tested nor any of X-Window servers.  This
    study parallels our 1990 study (that tested only the basic UNIX utilities);
    all systems that we compared between 1990 and 1995 noticeably improved in
    reliability, but still had significant rates of failure. The reliability of
    the basic utilities from GNU and Linux were noticeably better than those of
    the commercial systems.

    We also tested how utility programs checked their return codes from the
    memory allocation library routines by simulating the unavailability of
    virtual memory.  We could crash almost half of the programs that we tested
    in this way.

hollingsworth_thesis.ps.Z
    "Finding Bottlenecks in Large-scale Parallel Programs"
    August 1994
    Univ. of Wisconsin-Madison Comp Sci Dept Tech Rep #tba (Ph.D. Thesis)
    hollings@cs.umd.edu

    Abstract:

    This thesis addresses the problem of trying to locate the source of
    performance bottlenecks in large-scale parallel and distributed
    applications.  Performance monitoring creates a dilemma: identifying a
    bottleneck necessitates collecting detailed information, yet collecting
    all this data can introduce serious data collection bottlenecks.  At
    the same time, users are being inundated with volumes of complex graphs
    and tables that require a performance expert to interpret.  I have
    developed a new approach that addresses both these problems by
    combining dynamic on-the-fly selection of what performance data to
    collect with decision support to assist users with the selection and
    presentation of performance data.  The approach is called the W3 Search
    Model (pronounced W-cubed).  To make it possible to implement the W3
    Search Model, I have developed a new monitoring technique for parallel
    programs called Dynamic Instrumentation.  The premise of my work is
    that not only is it possible to do on-line performance debugging, but
    for large scale parallelism it is mandatory.

    The W3 Search Model closes the loop between data collection and
    analysis.  Searching for a performance problem is an iterative process
    of refining the answers to three questions:  why is the application
    performing poorly, where is the bottleneck, and when does the problem
    occur.  To answer the why question, tests are conducted to identify the
    type of bottleneck (e.g., synchronization, I/O, computation).
    Answering the where question isolates a performance bottleneck to a
    specific resource used by the program (e.g., a disk system, a
    synchronization variable, or a procedure).  Answering when a problem
    occurs, tries to isolate a bottleneck to a specific phase of the
    program's execution.

    Dynamic Instrumentation differs from traditional data collection
    because it defers selecting what data to collect until the program is
    running.  This permits insertion and alteration of the instrumentation
    during program execution.  It also features a new type of data
    collection that combines the low data volume of sampling with the
    accuracy of tracing.  Instrumentation to precisely count and time
    events is inserted by dynamically modifying the binary program.  These
    counters and timers are then periodically sampled to provide
    intermediate values to the W3 Search Model.  Based on this intermediate
    data, changes are made in the instrumentation to collect more
    information to further isolate the bottleneck.

    I have built a prototype implementation of the W3 Search Model and of
    Dynamic Instrumentation.  The prototype runs on Thinking Machine's CM-5
    and a network of workstations running PVM.  In one study, the tools
    identified the bottlenecks in several real programs using two to three
    orders of magnitude less data than traditional techniques.  In another
    study, Dynamic Instrumentation was used to monitor long running
    programs and introduced less than 10% perturbation.  While the W3
    Search Model and Dynamic Instrumentation complement each other, they
    are also useful individually.  The W3 Search Model can be applied to
    existing post-mortem performance tools or even to simulated machines
    and environments.  Dynamic Instrumentation has be used to collect
    performance data for other uses including visualization.  The W3 Search
    Model and Dynamic Instrumentation are being incorporated into the
    Paradyn Performance Tools.

improving.ps.Z
    "Improving the Accuracy of Data Race Detection"
    Robert H.B. Netzer  and  Barton P. Miller
    In Proc. of the Third ACM Symposium on Principles and Practice of Parallel
    Programming, April 1991.
    netzer@cs.wisc.edu, bart@cs.wisc.edu

    Abstract:

    For shared-memory parallel programs that use explicit synchronization,
    \fIdata race\fP detection is an important part of debugging.  A data
    race exists when concurrently executing sections of code access common
    shared variables.  In programs intended to be data race free, they are
    sources of nondeterminism usually considered bugs.  Previous methods
    for detecting data races in executions of parallel programs can
    determine when races occurred, but can report many data races that are
    artifacts of others and not direct manifestations of program bugs.
    Artifacts exist because some races can cause others and can also make
    false races appear real.  Such artifacts can overwhelm the programmer
    with information irrelevant for debugging.  This paper presents results
    showing how to identify non-artifact data races by \fIvalidation\fP and
    \fIordering\fP.

    Data race validation attempts to determine which races involve events
    that either did execute concurrently or could have (called \fIfeasible
    data races\fP).  We show how each detected race can either be
    guaranteed feasible, or when insufficient information is available,
    sets of races can be identified within which at least one is guaranteed
    feasible.  Data race ordering attempts to identify races that did not
    occur only as a result of others.  Data races can be partitioned so
    that it is known whether a race in one partition may have affected a
    race in another.  The \fIfirst\fP partitions are guaranteed to contain
    at least one feasible data race that is not an artifact of any kind.
    By combining validation and ordering, the programmer can be directed to
    those data races that should be investigated first for debugging.

install_guide.ps.Z
    "IPS Installation Guide"
    Jeff Hollingsworth, Bruce Irvin, Mike McMenemy, and Barton P. Miller
    Version 5.0.
    hollings@cs.umd.edu, rbi@informix.com, mcmenemy@sequoia.com,
    bart@cs.wisc.edu

ips.tar
    This file contains information on obtaining a license and distribution of
    the IPS-2 parallel programming tools.
    hollings@cs.umd.edu, rbi@informix.com, bart@cs.wisc.edu

isca.ps.Z
    "Detecting Data Races on Weak Memory Systems"
    Sarita V. Adve, Mark D. Hill, Barton P. Miller, Robert H.B. Netzer
    In Proc. of the 18th Intl Symposium on Computer Architecture, May 1991.
    sarita@cs.wisc.edu, markhill@cs.wisc.edu, netzer@cs.wisc.edu,
    bart@cs.wisc.edu

    Abstract:

    For shared-memory systems, the most commonly assumed programmer's
    model of memory is sequential consistency.  The weaker models of
    weak ordering, release consistency with sequentially consistent
    synchronization operations, data-race-free-0, and data-race-free-1
    provide higher performance by guaranteeing sequential consistency
    to only a restricted class of programs - mainly programs that do
    not exhibit data races.  To allow programmers to use the intuition
    and algorithms already developed for sequentially consistent
    systems, it is important to determine when a program written for a
    weak system exhibits no data races.  In this paper, we investigate
    the extension of dynamic data race detection techniques developed
    for sequentially consistent systems to weak systems.  A potential
    problem is that in the presence of a data race, weak systems fail
    to guarantee sequential consistency and therefore dynamic
    techniques may not give meaningful results.  However, we reason
    that in practice a weak system will preserve sequential consistency
    at least until the ``first'' data races since it cannot predict if
    a data race will occur.  We formalize this condition and show that
    it allows data races to be dynamically detected.  Further, since
    this condition is already obeyed by all proposed implementations of
    weak systems, the full performance of weak systems can be exploited.

javaperf.ps.Z
    "Performance Measurement of Interpreted Programs"
    Tia Newhall and  Barton P. Miller
    December 11, 1997

    In an interpreted execution there is an interdependence between the
    interpreter program's execution and the interpreted application's execution;
    the implementation of the interpreter determines how the interpreted 
    application is executed, and the interpreted application triggers certain 
    activities in the interpreter program. We present a representational model for
    describing performance data from an interpreted execution that explicitly represents
    the interaction between the interpreter and the interpreted application in terms
    of both the interpreter developer's and the interpreted application developer's
    view of the execution. We present results of a prototype implementation of a 
    performance tool based on our representational model for measuring the performance
    of interpreted Java applications and applets. Our prototype uses dynamic 
    instrumentation to measure Java programs starting with unmodified Java .class 
    files and an unmodified Java virtual machine binary. As a demonstration, we use 
    performance data from our tool to tune a Java application program, and as a result,
    we improved its performance by more than a factor of 3.

kerninst.ps
kerninst.ps.Z
kerninst.ps.gz
kerninst.pdf
    "Fine-Grained Dynamic Instrumentation of Commodity Operating System
    Kernels"
    Ariel Tamches and Barton P. Miller.
    tamches@cs.wisc.edu, bart@cs.wisc.edu
    Third Symp. on Operating Design and Implementation (OSDI), New Orleans,
    February 1999.

    We have developed a technology, fine-grained dynamic instrumentation of
    commodity kernels, which can splice (insert) dynamically generated code
    before almost any machine code instruction of a completely unmodified
    commodity operating system kernel, as it runs. This technology is well-suited
    to performance profiling, debugging, code coverage, security auditing,
    runtime code optimizations, and kernel extensions. We have designed and
    implemented a tool called KernInst that performs dynamic instrumentation on a
    stock production Solaris 2.5.1 kernel running on an UltraSPARC. We have
    implemented a kernel performance profiling tool on top of KernInst, and show
    how it uses dynamic instrumentation to quickly identify bottlenecks in a Web
    proxy server. We found that 10-20% of the proxy server's time was due to the
    Unix file system semantics of synchronizing file changes when updating meta-
    data. We also found that a further 20-30% of the proxy server's time was
    spent translating file names when opening or creating files.

kerninst-blackberry.ps
kerninst-blackberry.ps.Z
kerninst-blackberry.ps.gz
kerninst-blackberry.pdf
    "Using Dynamic Kernel Instrumentation for Kernel and Application Tuning"
    Ariel Tamches and Barton P. Miller.
    tamches@cs.wisc.edu, bart@cs.wisc.edu
    Submitted for publication, November 1998.

    Abstract:

    We have designed a new technology, fine-grained dynamic instrumentation of
    commodity operating system kernels, which can insert runtime-generated
    code at almost any machine code instruction of an unmodified operating
    system kernel. This technology is ideally suited for kernel performance
    profiling, debugging, code coverage, runtime optimization, and
    extensibility. We have written a tool called KernInst that implements
    dynamic instrumentation on a stock production Solaris 2.5.1 kernel running
    on an UltraSparc CPU. We have written a kernel performance profiler on top
    of KernInst. Measuring kernel performance has a two-way benefit; it can
    suggest optimizations to both the kernel and to applications that spend
    much of their time in kernel code. In this paper, we present our
    experiences using KernInst to identify kernel bottlenecks when running a
    web proxy server. By profiling kernel routines, we were able to understand
    performance bottlenecks inherent in the proxy's disk cache organization.
    We used this understanding to make two changes-one to the kernel and one
    to the application-that cumulatively reduce the percentage of elapsed time
    that the proxy spends opening disk cache files for writing from 40% to 13%.

labyrinth.ps.Z
    "Lost in a Labyrinth of Workstations"
    Joann J. Ordille and Barton P. Miller
    Third Workshop on Workstation Operating Systems, Key Biscayne,
    April 1992, pp. 52-55.
    joann@research.att.com, bart@cs.wisc.edu

mapping.ps.Z
    "Mechanisms for Mapping High-Level Parallel Performance Data"
    R. Bruce Irvin and Barton P. Miller
    ICPP Workshop on Challenges for Parallel Processing
    Chicago, August 1996.
    rbi@informix.com, bart@cs.wisc.edu

    Abstract:

    A primary problem in the performance measurement of high-level parallel
    programming languages is to map low-level events to high-level programming
    constructs. We discuss several aspects of this problem and presents three
    methods with which performance tools can map performance data and provide
    accurate performance information to programmers. In particular, we discuss
    static mapping, dynamic mapping, and a new technique that uses a data
    structure called the set of active sentences. Because each of these methods
    requires cooperation between compilers and performance tools, we describe
    the nature and amount of cooperation required. The three mapping methods
    are orthogonal; we describe how they should be combined in a complete tool.
    Although we concentrate on mapping upward through layers of abstraction,
    our techniques are independent of mapping direction.

mdl.ps
mdl.ps.Z
mdl.ps.gz
mdl.ps.pdf
    "MDL: A Language and Compiler for Dynamic Program Instrumentation"
    Jeffrey K. Hollingsworth, Barton P. Miller, Marcelo J. R. Goncalves,
    Oscar Naim, Zhichen Xu and Ling Zheng
    1997 Int'l Conf. on Parallel Architectures and Compilation Techniques
    November 1997, San Francisco, California.
    hollings@cs.umd.edu, {bart,mjrg,naim,zhichen,lzheng}@cs.wisc.edu

    Abstract:

    We use a form of dynamic code generation, called dynamic instrumentation,
    to collect data about the execution of an application program. Dynamic
    instrumentation allows us to instrument running programs to collect
    performance and other types of informations. The instrumentation code is
    generated incrementally and can be inserted and removed at any time. We
    base our instrumentation on a machine independent language and compiling
    system that currently runs on the SPARC, PA-RISC, Power2, Alpha, and x86
    architectures.

    Specification of what data to collect are written in this specialized
    language, called the Metric Description Language, that is part of the
    Paradyn Parallel Performance Tools. This language allows platform-
    independent descriptions of how to collect performance data. It also
    provides a concise way to specify how to constrain performance data to
    particular resources such as modules, procedures, nodes, files, or message
    channels (or combinations of these resources). MDL is less general than a
    procedural programming language (so compiling is faster and simpler), but
    more general than using specialization.

    Our dynamic instrumentation environment includes a compiler and code
    generation, structural analysis of the binary, and an instrumentation
    manager that allows code to be inserted and removed from the running
    program. We are able to operate on optimized code on all of currently
    support platforms.

metrics.ps.Z
    "Parallel Program Performance Metrics: A Comparison and Validation"
    Jeffrey K. Hollingsworth and Barton P. Miller
    Supercomputing '92, November 1992.
    hollings@cs.umd.edu, bart@cs.wisc.edu

    Abstract:

    There are many metrics designed to assist in the performance debugging
    of large-scale parallel applications.  We describe a new technique, call
    True Zeroing, that permits direct quantitative comparison of the guidance
    supplied by these metrics on real applications.  We apply this technique
    to three programs that include both numeric and symbolic applications.  We
    compare three existing metrics: gprof, Critical Path, and Quartz/NPT, and
    several new variations.  Critical Path provided the best overall guidance,
    but it was not infallible.  We also include a set of recommendations to
    tool builders based on the experience gained during our case study.

msgtracing.ps.Z
    "Optimal Tracing and Replay for Debugging Message-Passing Parallel Programs"
    Robert H.B. Netzer and Barton P. Miller
    netzer@cs.wisc.edu, bart@cs.wisc.edu
    Supercomputing '92, November 1992.

    Abstract:

    A common debugging strategy involves re-executing a program (on a given
    input) over and over, each time gaining more information about bugs.
    Such techniques can fail on message-passing parallel programs.  Because
    of variations in message latencies and process scheduling, different
    runs on the given input may produce different results.  This
    non-repeatability is a serious debugging problem, since an execution
    cannot always be reproduced to track down bugs.  This paper presents a
    technique for tracing and replaying message-passing programs for
    debugging.  Our technique is \fIoptimal\fP in the common case and has
    good performance in the worst case.  By making run-time tracing
    decisions, we trace only a fraction of the total number of messages,
    gaining two orders of magnitude reduction over traditional techniques
    which trace every message.  Experiments indicate that only 1% of the
    messages often need be traced.  These traces are sufficient to provide
    replay, allowing an execution to be reproduced any number of times for
    debugging.  Our work is novel in that we use run-time decisions to
    detect and trace only those messages that introduce nondeterminacy.
    With our strategy, large reductions in trace size allow long-running
    programs to be replayed that were previously unmanageable.  In
    addition, the reduced tracing requirements alleviate tracing
    bottlenecks, allowing executions to be debugged with substantially
    lower execution-time overhead.

multiapp.ps.Z
    "Multi-Application Support in a Parallel Program Performance Tool"
    R. Bruce Irvin and Barton P. Miller
    IEEE Parallel & Distributed Technology, 2(1), pp. 40-50, Spring 1994
    rbi@informix.com, bart@cs.wisc.edu

    Abstract:

    Program performance measurement tools have proven to be useful for
    tuning single, isolated, parallel and distributed applications.
    However, large-scale parallel machines and heterogeneous networks
    often do not allow for such isolated execution, much less isolated
    measurement.  Performance measurement tools should allow users to
    study workload scheduling policies, resource competition among
    application programs, client/server interactions in distributed
    systems, and comparisons of application programs running on multiple
    hardware platforms.  To enable and encourage such studies, we have
    extended the IPS-2 parallel program measurement tools to support the
    analysis of multiple applications (and multiple runs of the same
    application) in a single measurement session.  This multi-application
    support allows the user to study each application as a logically
    separate entity, study groupings of the applications based on their
    physical location, or study the entire collection of applications.

    We used the new multi-application support in three case studies.  In
    these studies we examined (1) the effects of clock precision on the
    quality of performance data, (2) the effects of gang scheduling on
    competing parallel applications, and (3) the performance interaction
    of client processes and server in a database system.  The
    multi-application support allowed quick comparison of different
    versions of a program, with a concrete visual and numeric comparison.
    Also, it directly showed performance dependences between parallel
    applications.

netzer_thesis.ps.Z
    "Race Condition Detection for Debugging Shared-Memory Parallel Programs"
    Robert H.B. Netzer
    Univ. of Wisconsin-Madison Comp Sci Dept Tech Rep #1039 (Ph.D. Thesis)
    netzer@cs.wisc.edu

    Abstract:

    This thesis addresses theoretical and practical aspects of the dynamic
    detecting and debugging of race conditions in shared-memory parallel
    programs.  To reason about race conditions, we present a formal model
    that characterizes actual, observed, and potential behaviors of the
    program.  The actual behavior precisely represents the program execution,
    the observed behavior represents partial information that can be
    reasonably recorded, and the potential behavior represents alternate
    executions possibly allowed by nondeterministic timing variations.
    These behaviors are used to characterize different types of race
    conditions, \fIgeneral races\fP and \fIdata races\fP, which pertain to
    different classes of parallel programs and require different detection
    techniques.  General races apply to programs intended to be deterministic;
    data races apply to nondeterministic programs containing critical sections.

    We prove that, for executions using synchronization powerful enough to
    implement two-process mutual exclusion, locating every general race or
    data race is an NP-hard problem.  However, for data races, we show that
    detecting only a subset of all races is sufficient for debugging.  We
    also prove that, for weaker types of synchronization, races can be
    efficiently located.

    We present post-mortem algorithms for detecting race conditions as
    accurately as possible, given the constraint of limited run-time
    information.  We characterize those races that are direct
    manifestations of bugs and not artifacts caused by other races,
    imprecise run-time traces (causing false races to appear real), or
    unintentional synchronization (caused by shared-memory references).
    Our techniques analyze the observed behavior to conservatively locate
    races that either did occur or had the potential of occurring, and that
    were unaffected by any other race in the execution.

    Finally, we describe a prototype data race detector that we used to
    analyze a sample collection of parallel programs.  Experiments indicate
    that our techniques effectively pinpoint non-artifact races, directing
    the programmer to parts of the execution containing direct manifestations
    of bugs.  In all programs analyzed, our techniques reduced hundreds to
    thousands of races down to four or fewer that required investigation.

nv.ps.Z
    "A Performance Tool for High-Level Parallel Programming Languages"
    R. Bruce Irvin and Barton P. Miller
    Programming Environments for Massively Parallel Distributed Systems,
    K.M. Decker, R.M. Rehmann editors, Birkhauser Verlag, pp. 299-314, 1994.
    rbi@informix.com, bart@cs.wisc.edu

    Abstract:

    Users of high-level parallel programming languages require accurate
    performance information that is relevant to their source code.
    Furthermore, when their programs cause performance problems at the
    lowest levels of their hardware and software systems, programmers need
    to be able to peel back layers of abstraction to examine low-level
    problems while maintaining references to the high-level source code
    that ultimately caused the problem.  In this paper, we present NV, a
    model for the explanation of performance information for programs
    built on multiple levels of abstraction.  In NV, a level of
    abstraction includes a collection of nouns (code and data objects),
    verbs (activities), and performance information measured for the nouns
    and verbs.  Performance information is mapped from level to level to
    maintain the relationships between low-level activities and high-level
    code, even when such relationships are implicit.

    We have used the NV model to build ParaMap, a performance tool for the
    CM Fortran language that has, in practice, guided us to substantial
    improvements in real CM Fortran applications.  We describe the design
    and implementation of our tool and show how its simple tabular and
    graphical performance displays helped us to find performance problems
    in two applications.  In each case, we found that performance
    information at all levels was most useful when related to parallel CM
    Fortran arrays, and that we could subsequently reduce each
    application's execution time by more than half.

nv2.ps.Z
    "Mapping Performance Data for High-Level and Data Views of Parallel
    Program Performance"
    Int'l Conference on Supercomputing, Philadelphia, May 1996.
    R. Bruce Irvin and Barton P. Miller
    rbi@informix.com, bart@cs.wisc.edu

    Abstract:

    Programs written in high-level parallel languages need profiling
    tools that provide performance data in terms of the semantics of
    the high-level language. But high-level performance data can be
    incomplete when the cause of a performance problem cannot be
    explained in terms of the semantics of the language. We also need
    the ability to view the performance of the underlying mechanisms
    used by the language and correlate the underlying activity to the
    language source code. The key techniques for providing these
    performance views is the ability to map low-level performance
    data up to the language abstractions.

    We identify the various kinds of mapping information that needs
    to be gathered to support multiple views of performance data and
    describe how we can mine mapping information from the compiler
    and run-time environment. We also describe how we use this
    information to produce performance data at the higher levels, and
    how we present this data in terms of both the code and parallel
    data structures.

    We have developed an implementation of these mapping techniques
    for the data parallel CM Fortran language running on the TMC CM-
    5. We have augmented the Paradyn Parallel Performance Tools with
    these mapping and high-level language facilities and used them to
    study several real data parallel Fortran (CM Fortran)
    applications. Our mapping and high-level language techniques
    allowed us to quickly understand these applications and modify
    them to obtain significant performance improvements.

ordille_thesis.ps.Z
    "Descriptive Name Services for Large Internets"
    Joann J. Ordille
    November 1993.
    Univ. of Wisconsin-Madison Comp Sci Dept Tech Rep #1208 (Ph.D. Thesis)
    joann@research.att.com

         This thesis addresses  the  challenge  of  locating  people,
    resources,  and  other  objects  in  the global Internet.  As the
    Internet grows beyond a million hosts in  tens  of  thousands  of
    organizations, it is increasingly difficult to locate any partic-
    ular object.  Hierarchical name services are frustrating, because
    users  must  guess  the  unique names for objects or navigate the
    name space to find information.   Descriptive  (i.e.  relational)
    name  services  offer  the  promise  of  simple resource location
    through a non-procedural query language.  Users locate  resources
    by describing resource attributes.  This thesis makes the promise
    of descriptive name services real by providing  fast  query  pro-
    cessing in large internets.

         The key to speed in descriptive  query  processing  is  con-
    straining  the  search  space using two new techniques, called an
    active catalog and meta-data caching.  The  active  catalog  con-
    strains  the search space for a query by returning a list of data
    repositories where the answer to the query is likely to be found.
    Components  of  the  catalog are distributed indices that isolate
    queries to parts of the network, and smart algorithms for  limit-
    ing  the search space by using semantic, syntactic, or structural
    constraints.  Meta-data caching improves performance  by  keeping
    frequently  used  characterizations  of the search space close to
    the user, thus reducing active catalog communication and process-
    ing  costs.  When searching for query responses, these techniques
    improve query performance by contacting  only  the  data  reposi-
    tories  likely  to have actual responses, resulting in acceptable
    search times.  Even a request to locate a person with  a  popular
    name in the global name space can often be answered in seconds.

         The new techniques are integrated with existing data caching
    techniques through a single mechanism, called a referral.  Refer-
    rals describe the conditions for using active catalog components,
    or  re-using  meta-data  or data cache entries.  Referrals form a
    DAG, called a referral graph, that directs query processing.  The
    referral  graph  provides a uniform framework for identifying the
    specific system resources (e.g. cache  entries  or  catalog  com-
    ponents)  that  optimize  processing  for a query.  It identifies
    search spaces that can be combined through intersection or union.
    It  combines  indices  specific to organizations into a composite
    index for the global name space.  It eliminates  components  with
    overlapping  functions from query processing.  The referral graph
    also provides a natural source of advice  to  users.   Users  who
    present  expensive  queries  can  be asked for specific attribute
    values to constrain the search for responses and lower processing
    cost.

         Referral graphs, an active catalog, meta-data and data cach-
    ing are combined in the prototype descriptive name service called
    Nomenclator.  In one measurement study, Nomenclator  consistently
    improved   the  performance  of  descriptive  queries  in  X.500.
    Another measurement study shows  how  Nomenclator  uses  a  small
    investment  of  network bandwidth and server resources to improve
    the response time for a wide range of query sizes.  An analytical
    modeling  study  shows that Nomenclator can amortize this invest-
    ment over many queries to provide an overall reduction in  system
    load  and,  as  a  consequence, better scaling and response time.
    Nomenclator provides a uniform interface  to  various  name  ser-
    vices, such as X.500 and the Ninth Edition Unix Name Service from
    Bell Labs.  Nomenclator is simple to extend to include  new  name
    services.

overview.ps
overview.ps.Z
overview.ps.gz
overview.ps.pdf
    "The Paradyn Parallel Performance Measurement Tools"
    Barton P. Miller, Mark D. Callaghan, Jonathan M. Cargille,
    Jeffrey K. Hollingsworth R. Bruce Irvin, Karen L. Karavanic,
    Krishna Kunchithapadam, and Tia Newhall.
    IEEE Computer 28, 11, (November 1995).
    {bart,markc,jon,karavan,newhall}@cs.wisc.edu
    hollings@cs.umd.edu rbi@informix.com

    The Paradyn project home page is http://www.cs.wisc.edu/~paradyn.

    ** Note: this paper contains several color postscript pages.  It
    ** should print acceptably on b/w printers.

    Paradyn is a performance measurement tool for parallel and
    distributed programs.  Paradyn uses several novel technologies so
    that it scales to long running programs (hours or days) and large
    (thousand node) systems, and automates much of the search for
    performance bottlenecks.  It can provide precise performance data
    down to the procedure and statement level.

    Paradyn is based on a dynamic notion of performance
    instrumentation and measurement.  Unmodified executable files are
    placed into execution and then performance instrumentation is
    inserted into the application program and modified during
    execution.  The instrumentation is controlled by the Performance
    Consultant module, that automatically directs the placement of
    instrumentation.  The Performance Consultant has a well-defined
    notion of performance bottlenecks and program structure, so that
    it can associate bottlenecks with specific causes and specific
    parts of a program.  Paradyn controls its instrumentation overhead
    by monitoring the cost of its data collection, limiting its
    instrumentation to a (user controllable) threshold.

    The instrumentation in Paradyn can easily be configured to accept
    new operating system, hardware, and application specific
    performance data.  It also provides an open interface for
    performance visualization, and a simple programming library to
    allow these visualizations to interface to Paradyn.

    Paradyn can gather and present performance data in terms of high-
    level parallel languages (such as data parallel Fortran) and can
    measure programs on massively parallel computers, workstation
    clusters, and heterogeneous combinations of these systems.

paging.ps.Z
    "Paging Tradeoffs in Distributed-Shared-Memory Multiprocessors"
    Douglas C. Burger, Rahmat S. Hyder, Barton P. Miller, and David A. Wood
    Supercomputing '94, Washington DC, November 1994.
    dburger@cs.wisc.edu, hyder@cs.wisc.edu, bart@cs.wisc.edu,
    david@cs.wisc.edu

    Massively parallel processors have begun using commodity operating
    systems that support demand-paged virtual memory.  To evaluate the
    utility of virtual memory, we measured the behavior of seven shared-
    memory parallel application programs on a simulated distributed-shared-
    memory machine.  Our results (i) confirm the importance of gang CPU
    scheduling, (ii) show that a page-faulting processor should spin rather
    than invoke a parallel context switch, (iii) show that our parallel
    programs frequently touch most of their data, and (iv) indicate that
    memory, not just CPUs, must be ``gang scheduled''.  Overall, our
    experiments demonstrate that demand paging has limited value on current
    parallel machines because of the applications' synchronization and memory
    reference patterns and the machines' high page-fault and parallel-context-
    switch overheads.

paradyn_and_devise.ps.Z
    "Integrated Visualization of Parallel Program Performance Data"
    Karen L. Karavanic, Jussi Myllymaki, Miron Livny, and Barton P. Miller
    Environments and Tools for Parallel Scientific Computing, SIAM Press, 1996.
    {karavan,jussi,miron,bart}@cs.wisc.edu

    Performance tuning a parallel application involves integrating
    performance data from many components of the system, including the
    message passing library, performance monitoring tool, resource
    manager, operating system, and the application itself. The current
    practice of visualizing these data streams using a separate,
    customized tool for each source is inconvenient from a usability
    perspective, and there is no easy way to visualize the data in an
    integrated fashion.  We demonstrate a solution to this problem using
    Devise, a generic visualization tool which is designed to allow an
    arbitrary number of different but related data streams to be
    integrated and explored visually in a flexible manner.  We display
    data emanating from a variety of sources side by side in three case
    studies.  First we interface the Paradyn Parallel Performance Tool and
    Devise, using two simple data export modules and Paradyn's simple
    visualization interface.  We show several Devise/Paradyn
    visualizations which are useful for performance tuning parallel codes,
    and which incorporate data from Unix utilities and application output.
    Next we describe the visualization of trace data from a parallel
    application running in a Condor cluster of workstations. Finally we
    demonstrate the utility of Devise visualizations in a study of Condor
    cluster activity.

paradynPVM.ps.Z
    "The Paradyn Parallel Performance Tools and PVM"
    Barton P. Miller, Jeffrey K. Hollingsworth, and Mark D. Callaghan
    in Environments and Tools for Parallel Scientific Computing, SIAM Press,
    J. Dongarra and B. Tourancheua, eds., 1994.
    bart@cs.wisc.edu hollings@cs.umd.edu markc@cs.wisc.edu

    Paradyn is a performance tool for large-scale parallel applications.
    By using dynamic instrumentation and automating the search for
    bottlenecks, it can measure long running applications on production-sized
    data sets.  Paradyn has recently been ported to measure native PVM
    applications.

    Programmers run their unmodified PVM application programs with Paradyn.
    Paradyn automatically inserts and modifies instrumentation during the
    execution of the application, systematically searching for the causes of
    performance problems.  In most cases, Paradyn can isolate major causes of
    performance problems, and the part of the program that is responsible the
    problem.

    Paradyn currently runs on the Thinking Machine CM-5, Sun workstations,
    and PVM (currently only on Suns).  It can measure heterogeneous programs
    across any of these platforms.

    This paper presents an overview of Paradyn, describes the new facility
    in PVM that supports Paradyn, and reports experience with PVM
    applications.

pointers.ps.Z
   "The Frequency of Dynamic Pointer References in ``C'' Programs"
    Barton P. Miller
    SIGPLAN Notices 23, 6 (June 1988).
    bart@cs.wisc.edu

    Abstract:

    A collection of ``C'' programs was measured for the number of dynamic
    references to pointers.  The number of dynamic references to pointers
    is presenteded  with respect to the total number of instructions a program
    executes, giving the percentage of pointer references executed in a ``C''
    program.  The measurements were done on a VAX 11/780 running the Berkeley
    UNIX operating system.  The measured programs were selected by examining
    the most commonly run programs on the Computer Sciences Department UNIX
    machines.  The measurement process was performed in two steps: (1) the
    dynamic counting of pointer references, and (2) the counting of the total
    number of instructions executed by the program.

porting_guide.ps.Z
   "Guide to Porting IPS"
    Bruce Irvin and Bart Miller
    Version 5.0.
    rbi@informix.com, bart@cs.wisc.edu

postwait.ps.Z
    "Efficient Race Condition Detection for Shared-Memory Programs with
    Post/Wait Synchronization"
    Robert H.B. Netzer and Sanjoy Ghosh
    January 1992.
    1992 Int'l Conference on Parallel Processing,
    St Charles, IL, August 1992.
    netzer@cs.wisc.edu, ghosh@csrd.uiuc.edu

    Abstract:

    Shared-memory parallel programs are often designed to be deterministic,
    both in their final results and intermediate states.  Determinacy is
    desired because it simplifies debugging and because some applications
    naturally have deterministic implementations.  However, debugging such
    programs requires a mechanism for locating \fIrace conditions\fP or
    violations of the intended determinacy when they occur.  This paper
    presents the first precise, efficient algorithm for dynamically
    detecting race conditions in programs that use non-trivial
    synchronization.  We address Post/Wait synchronization, the most
    powerful type of synchronization for which efficient race detection is
    possible.  Our algorithm computes the order in which synchronization
    operations in the execution are guaranteed to have occurred.  Using
    this information race conditions can be detected either post-mortem or
    on-the-fly.  This work answers a previously open question by
    considering non-trivial synchronization and presenting a precise,
    efficient algorithm.  In contrast, previous work has addressed either
    simpler types of synchronization, approximations to race detection, or
    a different (and easier to detect) type of race.

ppd_sigplan.ps.Z
    "A Mechanism for Efficient Debugging of Parallel Programs"
    Barton P. Miller  and  Jong-Deok Choi
    SIGPLAN Conf. on Programming Language Design and Implementation, June
    1988, Atlanta, GA, pp. 135-144.
    Also appears in SIGPLAN Notices 23(7), July 1988.
    bart@cs.wisc.edu, jdchoi@watson.ibm.com

    Abstract:

    This paper addresses the design and implementation of an integrated
    debugging system for parallel programs running on shared memory
    multi-processors (SMMP).  We describe the use of \fIflowback
    analysis\fP to provide information on causal relationships between
    events in a program's execution without re-executing the program for
    debugging.  We introduce a mechanism called \fIincremental tracing\fP
    that, by using semantic analyses of the debugged program, makes the
    flowback analysis practical with only a small amount of trace generated
    during execution.  We extend flowback analysis to apply to parallel
    programs and describe a method to detect race conditions in the
    interactions of the co-operating processes.

ppd_toplas.ps.Z
    "Techniques for Debugging Parallel Programs with Flowback Analysis"
    Jong-Deok Choi, Barton P. Miller,  and  Robert H.B. Netzer,
    ACM Trans. on Programming Languages and Systems (Oct. 91).
    bart@cs.wisc.edu, jdchoi@watson.ibm.com, netzer@cs.wisc.edu

    Abstract:

    Flowback analysis is a powerful technique for debugging programs.  It
    allows the programmer to examine dynamic dependences in a program's
    execution history without having to re-execute the program.  The goal
    is to present to the programmer a graphical view of the dynamic program
    dependences.  We are building a system, called PPD, that performs
    \fIflowback analysis\fP while keeping the execution time overhead low.
    We also extend the semantics of flowback analysis to parallel
    programs.  This paper describes details of the graphs and algorithms
    needed to implement efficient flowback analysis for parallel programs.

    Execution time overhead is kept low by recording only a small amount of
    trace during a program's execution.  We use semantic analysis and a
    technique called "incremental tracing" to keep the time and space
    overhead low.  As part of the semantic analysis, PPD uses a static
    program dependence graph structure that reduces the amount of work done
    at compile time and takes advantage of the dynamic information produced
    during execution time.

    Parallel programs have been accommodated in two ways.  First, the
    flowback dependences can span process boundaries; i.e., the most recent
    modification to a variable might be traced to a different process than
    the one that contains the current reference.  The static and dynamic
    program dependence graphs of the individual processes are tied together
    with synchronization and data dependence information to form complete
    graphs that represent the entire program.  Second, our algorithms will
    detect potential data race conditions in the access to shared
    variables.  The programmer can be directed to the cause of the race
    condition.

    PPD is currently being implemented for the C programming language
    on a Sequent Symmetry shared-memory multiprocessor.

ppopp97.ps.Z
    "Shared-Memory Performance Profiling"
    Zhichen Xu, James R. Larus, and Barton P. Miller
    1997 ACM SIGPLAN Symposium on Principles and Practices of Parallel
    Computing, Las Vegas, June 1997.
    {zhichen,larus,bart}@cs.wisc.edu

    Abstract:

    We describe a new approach to finding performance bottlenecks in shared-
    memory parallel programs, and implemented this new approach in the Paradyn
    Parallel Performance Tools running on the Blizzard fine-grain distributed
    shared memory system. The main characteristics of our approach are
    detecting and monitoring data sharing patterns, data-centric presentation
    of performance data, and automating the search for performance bottlenecks.
    Monitoring data sharing patterns helps detect when a program accesses data
    in a way that indicates a performance bottleneck. The data structure-
    centric view associates performance results with both data structures and
    memory blocks, as well as Paradyn's usual control-centric view of
    module/procedure/statement entities.  Automated search makes performance
    tuning a less painful experience. With these mechanisms, we improved the
    performance of a new untuned shared-memory application by more than a
    factor of four.

rbi_thesis.ps.Z
    "Performance Measurement Tools for High-Level Parallel Programming
    Languages"
    R. Bruce Irvin
    October 1995.
    Univ. of Wisconsin-Madison Comp Sci Dept Tech Rep #tba (Ph.D. Thesis)
    rbi@informix.com

    Abstract:

    Users of high-level parallel programming languages require accurate
    performance information that is relevant to their source code. Furthermore,
    when their programs experience performance problems at the lowest levels of
    their hardware and software systems, programmers need to be able to peel
    back layers of abstraction to examine low-level problems while maintaining
    references to the high-level source code that ultimately caused the
    problem. This dissertation addresses the problems associated with providing
    useful performance data to users of high-level parallel programming
    languages. In particular it describes techniques for providing source-level
    performance data to programmers, for mapping performance data among
    multiple layers of abstraction, and for providing data-oriented views of
    performance.

    We present NV, a model for the explanation of performance information for
    high-level parallel language programs. In NV, a level of abstraction
    includes a collection of nouns (code and data objects), verbs (activities),
    and performance information measured for the nouns and verbs. Performance
    information is mapped from level to level to maintain relationships between
    low-level activities and high-level code, even when such relationships are
    implicit.

    The NV model has helped us to implement support for performance measurement
    of high-level parallel language applications in two performance measurement
    tools (ParaMap and Paradyn). We describe the design and implementation of
    these tools and show how they provide performance information for CM
    Fortran programmers.

    Finally, we present results of measurement studies in which we have used
    ParaMap and Paradyn to improve the performance of a variety of real CM
    Fortran applications running on CM-5 parallel computers. In each case, we
    found that overall performance trends could be observed at the source code
    level and that both data views and code views of performance were useful.
    We found that some performance problems could not be explained at the
    source code level. In these cases, we used the performance tools to examine
    lower levels of abstraction to find performance problems. We found that
    low-level information was most useful when related to source-level code
    structures and (especially) data structures. Finally, we made relatively
    small changes to the applications' source code to achieve substantial
    performance improvements.

ref_guide.ps.Z
    "IPS System Programmers Reference Guide"
    Jeffrey K. Hollingsworth, R. Bruce Irvin and Barton P. Miller
    Version 5.0.
    hollings@cs.umd.edu, rbi@informix.com, bart@cs.wisc.edu

slack.ps.Z
    "Slack: A New Performance Metric for Parallel Programs"
    Jeffrey K. Hollingsworth and Barton P. Miller
    hollings@cs.umd.edu, bart@cs.wisc.edu
    Appears as a CS dept. technical report

    Critical Path Profiling is a technique
    that provides guidance to help programmers
    try to improve the running time of their program.
    However, Critical Path Profiling provides only an upper bound
    estimate of the improvement possible in a parallel program execution.
    In this paper, we present a new metric, called
    Slack, to complement Critical Path and provide
    additional information to parallel programmers about the potential impact
    of making improvements along the critical path.

sigcomm91.ps.Z
    "Nomenclator Descriptive Query Optimization for Large X.500 Environments"
    Joann J. Ordille and Barton P. Miller
    ACM SIGCOMM Symposium on Communications Architectures and Protocols,
    Zurich, September 1991, pp. 185-196.
    joann@research.att.com, bart@cs.wisc.edu

    Abstract:

    Nomenclator is an architecture for providing efficient descriptive
    (attribute-based) naming in a large internet environment.  As a
    test of the basic design, we have built a Nomenclator prototype
    that uses X.500 as its underlying data repository.  X.500 SEARCH
    queries that previously took several minutes, can, in many cases,
    be answered in a matter of seconds.  Our system improves
    descriptive query performance by trimming branches of the X.500
    directory tree from the search.  These tree-trimming techniques are
    part of an active catalog that constrains the search space as
    needed during query processing.  The active catalog provides
    information about the data distribution (meta-data) to constrain
    query processing on demand.  Nomenclator caches both data
    (responses to queries) and meta-data (data distribution
    information, tree-trimming techniques, data access techniques) to
    speed future queries.  Nomenclator relieves users of the need to
    understand the structure of the name space to locate objects
    quickly in a large, structured name environment.  Nomenclator is a
    meta-level service that will eventually incorporate other name
    services in addition to X.500.  Its techniques for improving
    performance should be generally applicable to other naming
    systems.

steering.ps.Z
    "Integrating a Debugger and Performance Tool for Steering"
    Krishna Kunchithapadam and Barton P. Miller
    Proceedings of the Workshop on Debugging and Performance Tuning for
    Parallel Computing Systems.
    Cape Cod, Massachusetts, USA. October 3-5, 1994.
    krishna@cs.wisc.edu, bart@cs.wisc.edu

    Abstract:

    Steering is a performance optimization idiom applicable to many
    problem domains.  It allows control and performance tuning to take
    place during program execution.  Steering emphasizes the optimization
    and control of the performance of a program using mechanisms that are
    external to the program.  Performance measurement tools and symbolic
    debuggers already independently provide some of the mechanisms needed to
    implement a steering tool.  In this paper we describe a configuration
    that integrates a performance tool, Paradyn, and a debugger to build a
    steering environment.

    The steering configuration allows fast prototyping of steering policies,
    and provides support for both interactive and automated steering.

swb.ps.Z
    "Reliable and Secure UNIX Connection Service" ,
    D. Draheim, B.P. Miller, and S. Snyder,
    6th Symp. on Reliability in Distributed Software and Database Systems
    Williamsburg, VA (March 1987).
    bart@cs.wisc.edu

    Distributed programs require a method for processes residing on
    different machines to identify each other and establish communication.
    One method is to provide a special connection service to perform this
    task.  A good connection service should be easy to use. It should allow
    arbitrary processes to connect to each other as well as helping client
    processes to connect to server processes.  It should provide location
    transparency; that is, the programmer should not have to know the
    network address of a process to connect to it.  The connection service
    should be reliable.  It should provide a way for a process to establish
    the identity of the user associated with the process to which it has
    connected, and to communicate securely with that process.

    We have implemented a connection service for Berkeley UNIX that is
    reliable, available, secure, and easy to use.  The connection service
    achieves ease of use through a simple interface based on the library
    routine meet().  Meet() allows one process to connect to another by
    specifying arbitrary names for itself and the other process.  The
    connection service imposes no naming conventions of its own so it can be
    used with most name spaces and naming services.  The service is
    location-transparent.  It also provides a routine for posting services.

    Reliable and available service is provided by replicating connection
    servers.  Each server knows about all pending connection requests.  The
    connection service provides continuous service as long as at least one
    server is running.  Connections can be authenticated by an
    authentication server that works in cooperation with the connection
    server. Secure communication is achieved via the RSA public-key
    encryption algorithm.

    The connection server was put in regular use in June 1986.  Our limited
    experience indicates that it satisfies an important need of UNIX users.

threads.ps
threads.ps.Z
threads.ps.gz
threads.pdf
    Dynamic Instrumentation of Threaded Applications
    Zhichen Xu, Barton P. Miller and Oscar Naim. 
    Submitted for publication.
    zhichen@cs.wisc.edu, bart@cs.wisc.edu, onaim@us.oracle.com 

    The use of threads is becoming commonplace in both sequential and parallel 
    programs. This paper describes our design and initial experience with 
    non-trace based performance instrumentation techniques for threaded 
    programs. Our goal is to provide detailed performance data while 
    maintaining control of instrumentation costs. We have extended Paradyn's
    dynamic instrumentation (which can instrument programs without recompiling 
    or relinking) to handle threaded programs. 

    Controlling instrumentation costs means efficient instrumentation code 
    and avoiding locks in the instrumentation. Our design is based on low 
    contention data structures. To associate performance data with individual 
    threads, we have all threads share the same instrumentation code and 
    assign each thread with its own private copy of performance counters or 
    timers. The asynchrony in a threaded program poses a major challenge to 
    dynamic instrumentation. To implement time-based metrics on a per-thread 
    basis, we need to instrument thread context switches, which can cause 
    instrumentation code to interleave. Interleaved instrumentation can 
    not only corrupt performance data, but can also cause a scenario we call 
    self-deadlock where an instrumentation code deadlocks a thread. We 
    introduce thread-conscious locks to avoid self-deadlock, and per-thread 
    virtual CPU timers to reduce the chance of interleaved instrumentation 
    accessing the same performance counter or timer, and to reduce the
    number of expensive timer calls at thread context switches. 

    Our initial implementation is on SPARC Solaris2.x including multiprocessor 
    Sun UltraSPARC Enterprise machines. We tested our tool on large 
    multithreaded applications, including the Java Virtual Machine (JVM). 
    We show how our new techniques helped us to speed up a Java native method 
    by 22% and overall execution time of the JVM interpreting the AppletViewer 
    driven by a game applet by 7%. 

usersguide.ps.Z
    "IPS-2 Users Guide"
    Jeffrey K. Hollingsworth, R. Bruce Irvin and Barton P. Miller
    Version 5.0.
    hollings@cs.umd.edu, rbi@informix.com, bart@cs.wisc.edu

visualize.ps.Z
    "What to Draw?  When to Draw? An Essay on Parallel Program Visualization"
    Barton P. Miller
    Journal of Parallel and Distributed Computing, vol 18, no 2 (June 1993).
    bart@cs.wisc.edu

    In this essay, I share opinions and experiences that I have formed about
    visualization.  These opinions were formed while developing tools for
    understanding execution of parallel programs.  I try to codify some of
    the implicit rules that have guided my research efforts.


wrapping.ps.Z
    "Binary Wrapping: A Technique for Instrumenting Object Code"
    Jon Cargille and Barton P. Miller
    SIGPLAN Notice, vol 27, no 6 (June 1992).
    jon@cs.wisc.edu, bart@cs.wisc.edu

    Abstract:

    We present a technique, called "binary wrapping", that allows
    object code routines to be instrumented.  Binary wrapping allows
    tracing to be placed around (and sometimes within) proprietary
    code, when source code access is difficult or impossible.  This
    technique is based on wrapping user-written code around the object
    code routine.  No modifications are needed to the programs that
    call the object code routine.  Binary wrapping has proven itself
    invaluable in instrumenting proprietary libraries, and may well be
    useful in similar circumstances.

w3search.ps.Z
    "Dynamic Control of Performance Monitoring on Large Scale Parallel Systems"
    Jeffrey K. Hollingsworth and Barton P. Miller
    International Conference on Supercomputing, Tokyo, July 19-23, 1993.
    hollings@cs.umd.edu, bart@cs.wisc.edu

    Abstract:

    Performance monitoring of large scale parallel computers creates a
    dilemma: we need to collect detailed information to find performance
    bottlenecks, yet collecting all this data can introduce serious data
    collection bottlenecks.  At the same time, users are being inundated
    with volumes of complex graphs and tables that require a performance
    expert to interpret.  We present a new approach called the
    W cubed Search Model, that addresses both these problems by
    combining dynamic on-the-fly selection of what performance data to
    collect with decision support to assist users with the selection and
    presentation of performance data.  Our goal is to provide users with
    answers to their performance questions and at the same time
    dramatically reduce the volume of performance data we need to collect.
    We present a case study describing how a prototype implementation of
    our technique was able to identify the bottlenecks in three real
    programs.  In addition, we were able to reduce the amount of
    performance data collected by a factor ranging from 13 to 700 compared
    to traditional sampling and trace based instrumentation techniques.

xman.ps.Z
    "Experiment Management Support for Performance Tuning"
    Karen L. Karavanic and Barton P. Miller
    SC97, San Jose, California, Nov 1997.
    karavan@cs.wisc.edu, bart@cs.wisc.edu
    
    Abstract:

    The development of a high-performance parallel system or application 
    is an evolutionary process -- both the code and the environment go 
    through many changes during a program's lifetime -- and at each 
    change, a key question for developers is: how and how much did the 
    performance change?  No existing performance tool provides the 
    necessary functionality to answer this question.  This paper reports 
    on the design and preliminary implementation of a tool which views 
    each execution as a scientific experiment and provides the 
    functionality to answer questions about a program's performance which 
    span more than a single execution or environment.  We report results 
    of using our tool with an actual performance tuning study and with a 
    scientific application run in changing environments.  Our goal is to 
    use historic program performance data to develop techniques for 
    parallel program performance diagnosis.