%A W. Nick Street
%T Cancer Diagnosis and Prognosis via Linear-Programming-Based Machine Learning
%D August 1994
%R 94-14
%I COMPUTER SCIENCES DEPARTMENT, UNIVERSITY OF WISCONSIN
%C MADISON, WI
%X The purpose of this research is twofold. The first purpose is the
development of machine learning methods based on linear programming.
These advances come in the form of both novel learning algorithms and
generalization-improvement approaches that are applicable to a wide
range of learning algorithms. The second aim is the application of
these algorithms, along with ideas from the fields of statistics and
image processing, to problems arising in the diagnosis and treatment
of breast cancer. The first chapter provides background into the
relevant areas of research, most importantly machine learning and
breast cancer diagnosis and prognosis, and describes previous work
that has united the two fields.
The first major contribution of this research is an automated
cytological analysis system that we call Xcyt. This system
includes a partially automatic image segmentation facility for
isolating cell nuclei from a digital image. {\bf Xcyt} allows
computation of various size, shape and texture features of these
nuclei. A linear separator distinguishes benign cases from malignant
cases based on 569 training examples. Xcyt also provides an
estimate of the probability of malignancy for each case. The
Xcyt system is currently in use at the University of Wisconsin
Hospitals.
We also address the problem of cancer prognosis, that is, determining
when a cancer is likely to recur. A new learning method, the
recurrence surface approximation (RSA), is introduced for predicting
the time of recurrence. This procedure uses linear programming to
predict recurrence from right-censored input data. Various extensions
and variations of RSA are also explored, including a separation-based
procedure that we call implicit RSA (IRSA).
The topic of improving the generalization of a learning system is
first addressed in the context of RSA, where we introduce a procedure
for choosing the most relevant input features for use the the
predictive model. The final two chapters also describe
generalization-enhancement methods. The first is mock generalization,
which incorporates a tuning set into the optimization process. The
second is banded approximation, which explicitly defines a tolerance
band around the predictive surface in order to avoid overfitting the
training data.