UNDERGRADUATE COMPUTATIONAL SCIENCE AWARD Name of proposer(s): Michael C. Ferris and O. L. Mangasarian (ferris@cs.wisc.edu and olvi@cs.wisc.edu) Title of proposer(s): Professor(s) Department: Computer Sciences Department School: University of Wisconsin Mailing Address: 1210 West Dayton Street, Madison, WI 53706 Phone Number: (608) 262-4281 Fax Number: (608) 262-9777 Project Title: Breast Cancer Diagnosis via Linear Programming Intended Audience: Undergraduate (Senior/Junior) Brief Project Description (300 words or less) -------------------------------------------------------------- Linear programming is used for many applications in business and operations research. In this project, we show the students how linear programming can be used for breast cancer diagnosis, based on experience that we have gained in building a diagnostic system currently in use at University of Wisconsin Hospitals. The project will use the Wisconsin Breast Cancer Database made publicly available by Dr. William H. Wolberg of the University of Wisconsin Hospitals. The database consists of samples that arrive periodically as Dr. Wolberg reports his clinical cases and therefore reflects this chronological grouping of the data. Each data point represents a confirmed benign or malignant fine needle aspirate (FNA). Examples of benign and malignant FNA's are given in Figure 1 below. The database has one line per sample, each with 11 attributes determined by the oncologist from the FNA, separated by spaces, as follows: Field Attribute 1 Sample code number 2 Class: 2 for benign, 4 for malignant 3 Clump Thickness 4 Uniformity of Cell Size 5 Uniformity of Cell Shape 6 Marginal Adhesion 7 Single Epithelial Cell Size 8 Bare Nuclei 9 Bland Chromatin 10 Normal Nucleoli 11 Mitoses For the purposes of the project, the data has been split into a "training set" and a "testing set". For both of these, the student should use attributes 3 to 11 (each of which take on values between 0 and 10) to form a 9-dimensional vector representing each case as a point in 9-dimensional real space. In generating the diagnostic program, the student should begin with the training set consisting of two disjoint point sets (benign and malignant) in R^9 and generate a discriminant function that distinguishes between the two sets. This function is subsequently used on the testing set to determine whether new aspirates are benign or malignant. (Include files: malignant.ps, benign.ps Figure 1: Nuclei of cells of malignant and benign fine needle aspirates taken from patients breasts ) Description of how the project uses computational science methods and techniques in an interdisciplinary context (200 words or less) ------------------------------------------------------------------ The project uses a single linear program to determine the discriminant function. The size of the database requires a reasonably sophisticated linear programming code that takes into account key numerical ideas from linear algebra and mathematical programming; these ideas are crucial for this application of computational science to medical diagnosis. The discriminant function $f(x) = w^T x - \gamma$ has the property f(x) > 0 if x is malignant, f(x) < 0 if x is benign. This determines a plane $w^T x = \gamma$ that separates malignant and benign points in R^9. $w$ and $\gamma$ are found by minimizing the sum of the distances to the dividing plane whenever a point is on the incorrect side of the plane. Using standard techniques, this can be expressed as the following linear program: minimize 1/m e^T y + 1/k e^T z subject to Mw - e\gamma + y >= e, -Bw + e\gamma + z >= e , y >= 0, z >= 0 Here $e$ is an appropriately dimensioned vector of 1's, m and k are the number of malignant and benign points which are stored as matrices M and B. The solution determines the plane, and hence the discriminant function. ( Further details of the approach can be found in: K.P. Bennett and O.L. Mangasarian (1992). Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software 1:23--34. ) Description of how the project will be implemented and evaluated in the classroom (200 words or less) ------------------------------------------------------ To implement this project, three things are required. 1. The data of the project is contained in two files "training.dat" and "testing.dat" and these ASCII files can be obtained by anonymous ftp. 2. A linear programming code (publicly available or an implementation of a version developed in the course). 3. Simple plotting routines. Currently, MATLAB, a state-of-the-art numerical package, is used for all of the above. The data can be loaded easily, a satisfactory linear programming code is easily implemented and the built-in plotting routines are more than adequate. The students are evaluated on the following assignments: 1. Formulate the problem as a linear program using the training set and solve. Construct the discriminant function from this solution. Test the discriminant function on the testing set by determining the number of misclassified points on this set. 2. Suppose that the oncologist wants to use only 2 of the 9 attributes in his diagnosis. Determine which pair of attributes is most effective in determining a correct diagnosis. Use only the training set (with these 2 features) to determine the discriminant function and use the testing set to determine the number of misclassifed points. Plot the testing points and the separating plane. (A postscipt version of the project that is currently being used is available. This gives pictures of the cells, and a more complete mathematical description of the problem).