UNDERGRADUATE COMPUTATIONAL SCIENCE AWARD
Name of proposer(s): Michael C. Ferris and O. L. Mangasarian
(ferris@cs.wisc.edu and olvi@cs.wisc.edu)
Title of proposer(s): Professor(s)
Department: Computer Sciences Department
School: University of Wisconsin
Mailing Address: 1210 West Dayton Street, Madison, WI 53706
Phone Number: (608) 262-4281
Fax Number: (608) 262-9777
Project Title: Breast Cancer Diagnosis via Linear Programming
Intended Audience: Undergraduate (Senior/Junior)
Brief Project Description (300 words or less)
--------------------------------------------------------------
Linear programming is used for many applications in business and
operations research. In this project, we show the students how
linear programming can be used for breast cancer diagnosis, based on
experience that we have gained in building a diagnostic system
currently in use at University of Wisconsin Hospitals.
The project will use the Wisconsin Breast Cancer Database made
publicly available by Dr. William H. Wolberg of the University of
Wisconsin Hospitals. The database consists of samples that arrive
periodically as Dr. Wolberg reports his clinical cases and therefore
reflects this chronological grouping of the data. Each data point
represents a confirmed benign or malignant fine needle aspirate (FNA).
Examples of benign and malignant FNA's are given in Figure 1 below.
The database has one line per sample, each with 11 attributes
determined by the oncologist from the FNA, separated by spaces, as
follows:
Field Attribute
1 Sample code number
2 Class: 2 for benign, 4 for malignant
3 Clump Thickness
4 Uniformity of Cell Size
5 Uniformity of Cell Shape
6 Marginal Adhesion
7 Single Epithelial Cell Size
8 Bare Nuclei
9 Bland Chromatin
10 Normal Nucleoli
11 Mitoses
For the purposes of the project, the data has been split into a
"training set" and a "testing set". For both of these, the student
should use attributes 3 to 11 (each of which take on values between 0
and 10) to form a 9-dimensional vector representing each case as a
point in 9-dimensional real space.
In generating the diagnostic program, the student should begin with
the training set consisting of two disjoint point sets (benign and
malignant) in R^9 and generate a discriminant function that
distinguishes between the two sets. This function is subsequently
used on the testing set to determine whether new aspirates are benign
or malignant.
(Include files: malignant.ps, benign.ps
Figure 1: Nuclei of cells of malignant and benign fine needle
aspirates taken from patients breasts
)
Description of how the project uses computational science methods
and techniques in an interdisciplinary context (200 words or less)
------------------------------------------------------------------
The project uses a single linear program to determine the discriminant
function. The size of the database requires a reasonably
sophisticated linear programming code that takes into account key
numerical ideas from linear algebra and mathematical programming;
these ideas are crucial for this application of computational science
to medical diagnosis.
The discriminant function $f(x) = w^T x - \gamma$ has the property
f(x) > 0 if x is malignant, f(x) < 0 if x is benign. This determines
a plane $w^T x = \gamma$ that separates malignant and benign points in
R^9.
$w$ and $\gamma$ are found by minimizing the sum of the distances to
the dividing plane whenever a point is on the incorrect side of the
plane. Using standard techniques, this can be expressed as the
following linear program:
minimize 1/m e^T y + 1/k e^T z
subject to Mw - e\gamma + y >= e,
-Bw + e\gamma + z >= e ,
y >= 0, z >= 0
Here $e$ is an appropriately dimensioned vector of 1's, m and k are
the number of malignant and benign points which are stored as
matrices M and B. The solution determines the plane, and hence the
discriminant function.
( Further details of the approach can be found in:
K.P. Bennett and O.L. Mangasarian (1992).
Robust linear programming discrimination of two linearly inseparable sets.
Optimization Methods and Software 1:23--34. )
Description of how the project will be implemented and
evaluated in the classroom (200 words or less)
------------------------------------------------------
To implement this project, three things are required.
1. The data of the project is contained in two files "training.dat"
and "testing.dat" and these ASCII files can be obtained by anonymous ftp.
2. A linear programming code (publicly available or an implementation
of a version developed in the course).
3. Simple plotting routines.
Currently, MATLAB, a state-of-the-art numerical package, is used for
all of the above. The data can be loaded easily, a satisfactory
linear programming code is easily implemented and the built-in
plotting routines are more than adequate.
The students are evaluated on the following assignments:
1. Formulate the problem as a linear program using the training set
and solve. Construct the discriminant function from this solution.
Test the discriminant function on the testing set by determining the
number of misclassified points on this set.
2. Suppose that the oncologist wants to use only 2 of the 9 attributes
in his diagnosis. Determine which pair of attributes is most
effective in determining a correct diagnosis. Use only the
training set (with these 2 features) to determine the discriminant
function and use the testing set to determine the number of
misclassifed points. Plot the testing points and the separating
plane.
(A postscipt version of the project that is currently being used is
available. This gives pictures of the cells, and a more complete
mathematical description of the problem).