%\renewcommand{\textfraction}{0.1}
%\renewcommand{\dbltopfraction}{0.5}
\renewcommand{\figfont}{}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\tableofcontents

\newpage
\setcounter{page}{1}
\pagestyle{plain}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Introduction}
\label{sec:intro}

The long-term objective of our research is the integration of visual
data acquired by many separate, uncalibrated video cameras moving
separately through a large-scale, occluded environment.  As a step
towards integrating the data, we propose developing algorithms for
automatically constructing three-dimensional representations of the
scene from the video streams.  Such three-dimensional reconstructions
are useful by themselves, and should also assist in [the global
alignment][solving global positioning issues for] of the many disparate
video streams.  [[Furthermore...[we will study related problems of data
integration][there are several related...] ]]  It should be emphasized
that, while much of this paper describes a broad vision of an important
general domain of research [[, namely integrating data from multiple,
independently moving cameras]] , our proposal is to investigate only a
few specific problems of general scientific interest that arise within
this domain, as described later.

As an example of the problem domain we propose to investigate, consider
the following scenario:  Several cameras are in motion through an urban
environment, some are perhaps carried by people walking through the
environment, others are perhaps mounted on vehicles driving through the
environment or flying above it.  Sometimes a camera is carried into a
building and moved around the halls and up and down the stairs in the
building.  The camera may even be pointed out a window on occasion to
capture additional views of the exterior scene.  Other cameras are
simply moved around the outsides of the various buildings or through
the streets.

We would like to be able to integrate all the data acquired by these
many moving cameras, and in particular we would like to automatically
construct a three-dimensional model of the environment directly from
the various video streams.  Such a reconstruction would include both
the insides and outsides of buildings as well as the streets and
significant landmarks.  [We would also like to integrate data acquired
by the moving cameras with data acquired by stationary, mounted cameras
in the scene...]

%\FIGUREWIDE{fig:many.cameras}{general.concept.eps}{1.5in}{many cameras, urban environment}

%\FIGUREWIDE{fig:old.wire.model}{old.wire.model/old.wire.model.eps}{1.5in}
%{previous work}

\begin{figure*}[t!]
 \begin{tabular}{cc}
%  \hspace{.1in}
  \begin{minipage}{3.5in}
   \centerline{ \epsfxsize=1.5in \epsffile{\FIGDIR/general.concept.eps} }
   \caption{\label{fig:many.cameras} \figfont Many independent cameras moving through an urban
     environment.  Cameras A and C view the outsides of various buildings, while camera B
     moves inside a building to model the interior.  Camera D starts at a different location
     from the rest.  How can the information from all these cameras be integrated and shared?}
  \end{minipage} &
  \begin{minipage}{3in}
%  \begin{minipage}{3in}
   \centerline{ \epsfxsize=1.5in \epsffile{\FIGDIR/old.wire.model/old.wire.model.eps} }
   \caption{\label{fig:old.wire.model} \figfont Earlier work showing one frame from an affine monocular
   sequence of a building and a wire model created from the sequence.}
  \end{minipage}
 \end{tabular}
\end{figure*}

What form would the reconstruction take?  We propose a coarse-to-fine
strategy for the reconstruction process.  At first only the most
important and reliable scene features, namely long lines and strong
contours \CITE{aloimonos:phd, spetsakis:iuw90}, would be
reconstructed.  The resulting coarse scene model, consisting mainly of
important scene lines, would be reminiscent of the wire-frame models
\CITE{fvd:book90} used in traditional computer graphics.  For this
reason, we will use the term \emph{wire model} to refer to these
coarse-level scene reconstructions.  Wire models, therefore, will
consist of important scene features and their positions in space and
are not intended to explicitly represent surfaces.

Following the coarse phase of model building, more detailed
reconstructions that incorporate all available scene texture
information [refs] could be created using the wire model as a
guide.  However, the wire model has many uses in its own right,
and, for many tasks, is superior to more detailed reconstructions [refs
to why reduced models are good].

Of course, scene reconstruction is only one possible way to use the
data acquired under the multiple-moving camera framework.  In
\secref{sec:additional.research} we discuss several other areas of
research arising from this framework, each of which is complementary to
the goal of scene reconstruction.

The ``wire model" representation introduced above arises naturally from the
problem domain.  In particular:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{itemize}

\item{Urban environments contain many significant straight lines.}

\item{Strong lines are easy to detect and track reliably.}

\item{By concentrating only on the longest, most distinct lines in the
scene, a more robust algorithm can be created that ignores distracting,
unreliable scene details \CITE{spet:ijcv90} and that also safely
ignores small dynamic objects in the scene, such as people walking.}

\item{Lower-resolution cameras can reliably track strong lines, whereas
high-resolution cameras are necessary to reliably track point-like
scene details.}

\item{Only part of a long line needs to be visible in any one view.  If
it can be determined reliably that the same line is visible in two
different views (though not necessarily the same \emph{part} of the
line), then this information can be used for many standard
reconstruction tasks like solving for the fundamental matrix or finding
the relative orientation between two views.}

\end{itemize}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\REM{
[[In addition to using straight line features, we plan to investigate the
use of other strong contours in the views.  For instance, a curved
letter painted on a wall in the scene might be easy to detect and
track, and though not a straight line, it might still be used directly
in determining relative orientation.  [Car window example..]]]

[[Note that a major problem associated with the use of straight lines,
and one which must also be investigated, is the potential confusion
caused by occluding contours arising from curved objects.  As a camera
moves around a curved object, the outer boundary of the object
registers as a strong line and will be tracked as such.  However, the
line tracked from view to view in the video stream will not be a
\emph{corresponding} line.  That is, it does not represent the same
line in space and cannot immediately be used in determining the
relationship between two views.  Failure to recognize an occluding
contour can lead to errors in calculating view relationships.]]
}

%%\subsection{Benefits of the wire Reconstruction}

Consider the benefits of even a simple wire reconstruction of the
environment.  First and most importantly, a wire model provides the
approximate three-dimensional geometry of the scene.  Additionally, if
a camera is at an unknown location in the scene, the features currently
viewed through the camera can be compared to the known wire model of
the scene to try and identify the camera's viewing position and
orientation \REM{[figure?]} .  For instance, this might be used to
determine the position and orientation of an overhead view of the scene
(\eg\ taken from a airplane) using only a model of the scene
constructed from cameras on the ground.

Using the wire model to determine the position and orientation of an
arbitrary view is one way to solve the problem of transferring
information between separate cameras.  It can also help in recognizing
when a camera has returned to a previous part of the scene, or has
entered a part of the scene that another camera has viewed and mapped
earlier \REM{ [refs on 'looping' problem??]}.  These capabilities in
turn help solve global alignment problems that are endemic to
large-scale scene reconstruction.

If the position and orientation of a camera is known or can be
determined in the scene model, then all the model features can be
projected into that camera view.  For example, suppose a camera is
being moved through a building and that part of the building's interior
has been modeled already.  As the camera continues to move through the
building, it continues to enlarge the current model.  While doing so,
the camera's position and orientation relative to the growing model are
constantly known (\secref{sec:general.approach}).  Hence any part of
the building model can be projected into the current camera view.  The
camera may be in a hallway and only the immediate walls may be visible,
but by the reprojection process, the rest of the building can be viewed
as if the walls were transparent \REM{(\figref{})}.

If the model is a wire model, then the reprojection will appear as
lines superimposed on the current view \REM{(\figref{})}.  If the model
is more complete (for instance, a voxelization of the scene
\CITE{fvd:book90,seitz:cvpr97}), then the effect would be more like the
occlusions in the scene were not there at all \REM{(\figref{})}.  Note
that to perform these kinds of reprojections, the model only needs to
be a projective model of the original scene \CITE{faugeras:eccv92}.

The ability to reproject the scene model into a particular view has
many potential \emph{augmented reality} [refs] applications when
combined with special goggles \CITE{spitzer:iswc} or a
retina-projecting system \CITE{pryor:hfes98}.  For example, a nearby
stairway could be projected onto the current view, helping to guide a
person to that stairway.  Note that in the case of a wire model, a
stairway will appear as a distinctive collection of line segments
[refs??].  Outer edges of a building could be projected into the
current view, giving a person a rough guide as to their position in the
building (\eg\ how much farther they could expect to travel in a
particular direction before reaching an outer wall).  A target room in
the building could be projected into the current view, helping to guide
the viewer towards that room [refs??].  Directions, perhaps in the form
of arrows, could be projected onto the current view to help guide
someone through the maze of internal passageways in an unfamiliar
building to reach a particular location [refs??].

The applications just described exist separately of the ability to
produce models of the scene.  If a model already exists, parts of the
model can be projected into the current view as long as the camera's
position can be determined from the visible features [refs].  This
capability represents a separate potential line of research.

Wire models have additional benefits stemming from their minimal
content.  First, new views of a wire model are easy to compute and
quick to render [refs].  Second, wire models require relatively little data
storage [refs].  These two features make it more likely that a person could
carry locally not only the camera for acquiring views [refs], but also the
small computer required to process the views and construct the scene
model [refs??].  In particular, view data would not need to be transmitted for
processing to a more powerful computer at a separate location [refs??].

As the image data is collected and the scene model is constructed,
special ``high information content" keyframes [refs??] could be saved
(\secref{sec:goal.keyframe}), along with the viewing location and
orientation of each keyframe.  This information could be used later to
construct more photo-realistic models of the scene, for instance
through methods like space carving [refs] and similar?? algorithms
[refs]. [[Andrew fills in this paragraph and references]]

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Overview of the General Problem Domain}
%%\section{Overview of the Roaming Cameras Problem Domain}
\label{sec:overview.problem}

In the preceeding section, we outlined a vision of a next-generation
machine vision project.  We have suggested a ``coarse-to-fine"
methodology that concentrates first on only the most significant, most
reliable, most identifiable features of the scene before reconstructing
the scene in greater detail.  In this section, we discuss the
multiple-moving-cameras problem domain in greater detail and then
outline why existing research is insufficient to tackle many of the
problems that arise within this framework.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{The Problem Domain}
\label{sec:problem.domain}

The key distinguishing features of the problem domain are:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{itemize}

\item{The scene is very large.}

\item{The scene has many occluding surfaces.}

\item{Several video cameras are moving independently through the
scene.}

\item{The cameras have unknown position and orientation.}

\item{Each camera can only capture a small part of the scene at any
instant.}

\item{The cameras can undergo significant translations in all three
dimensions.  They are also free to rotate in any direction.}

\item{The video acquisition process may take a substantial amount of
time\REM{ [[, meaning that lighting changes and dynamic elements of the scene
must be handled]] }.}

\item{The scene is not Lambertian.  For instance, windows reflect very
different light in different directions.}

\REM{

\item{Errors will propagate as the cameras are moved, introducing
significant global alignment problems in reconstructing the scene.}

\item{The same scene features will come in and out of view repeatedly
over time, and must not be mistaken for new features each time they are
viewed again.}

\item{The cameras must be modeled as perspective cameras; simpler
camera models such as orthographic and para-perspective are
insufficient for the small, confined spaces inside of buildings and for
unrestricted movement of the cameras in exterior sections of the
scene.}

\item{Information from several cameras must be correctly combined.}

\item{The scene is not static; however, moving objects in the scene
will be relatively small compared to most of the static objects.}

\item{The scene is not Lambertian, and scene lighting will change over
time (\eg\ as the sun crosses the sky).}

}

\end{itemize}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The characteristics listed above lead us to the following
observations.  First, because the scene is large and heavily occluded
and the cameras are moving over long distances, even small errors in
local reconstruction will propagate into large errors in the overall
scene reconstruction (\figref{fig:floor.plan}).  Consequently, a
significant global alignment problem arises.

If it were possible to accurately and easily determine the location and
orientation of each camera by some external means, such as by a global
positioning system or by locally triangulating position, then much of
the reconstruction problem could be solved using existing techniques
such as space carving [??].  However, this makes each camera reliant on
an external system and external communications channels for determining
its position.  The approach we are advocating could be entirely local
to the camera operator.

As a video camera is moved around the scene, the same scene features
will come in and out of view repeatedly.  This phenomenon is related to
the size of the scene and the amount of occlusions and also to the size
of the camera's field-of-view.  Therefore, another significant problem
for this domain is identifying when a new scene feature has come into
view and recognizing old scene features.  The recognition problem can
be somewhat alleviated by the use of wide-field-of-view cameras, such
as OmniCam \CITE{nayar:cvpr97}, but is still an inherent difficulty
with this domain due to occlusions and large scene size.

Sometimes the cameras will be in narrow spaces, like hallways and
stairwells, and sometimes cameras will be very close to objects in the
scene simply because we want to allow the camera controllers complete
freedom of movement.  This means that the cameras must be modeled as
perspective cameras; simpler camera models such as orthographic
and para-perspective are insufficient.

While one camera might be used to capture all the necessary views of
the static scene elements, in practice it is more likely that many
cameras will be used simultaneously because of the presumed scene
size.  The process of constructing the overall scene model also
produces information that could be beneficial to a group of camera
operators working together.  For instance, each camera operator would
know the location of the other camera operators relative to the growing
scene model at any instant, which would help them find each other and
also avoid redundant work.  Consequently, it is important to develop
techniques and algorithms for efficiently integrating the data from all
the video streams.

Because the scene is large the modeling process will take a substantial
amount of time.  Therefore, it can not be assumed that scene lighting
will remain constant; for example, the sun will move across the sky,
changing the positions of shadows and changing the apparent color of
scene objects.  Dynamic elements of the scene, such as people and
vehicles, will move and will disrupt the modeling process.  The use of
high-frequency line features helps alleviate problems caused by
different lighting conditions, and helps to ignore small and unstable
features such as people walking through the scene.

Finally, the shear amount of data generated by the cameras must be
handled efficiently, leading in part to the reduced ``wire" model
concept and to the use of ``keyframes" discussed in
\secref{sec:goal.keyframe}.

\REM{
[as a camera is moved around a complicated environment, such as the
inside of a building, it will probably be moved into the same area more
than once, perhaps even capture the exact same view more than once; a
crucial research problem is to reliably determine when features have
been seen before and to perform global alignment on the current scene
model to reflect the fact that the camera has returned to an earlier
position...]

We will assume that the position and orientation of the cameras is
unknown {\it a priori}.  If it were possible to accurately, reliably,
and easily determine the position and orientation of each camera at
each instant in time, the reconstruction problem could be solved by
various existing techniques [space carving, etc.].  Instead, we propose
investigating solutions that use only simple video streams as input.

Widely-separated correspondence problem...hard to solve!

In order for a video-stream-based algorithm to work, there must be
sufficient information available in the video streams...

solving the problem requires being able to transfer and share
information between several different cameras

Global alignment will be a major problem early in the model-building
process.  However, as the model becomes more complete and more correct
from earlier attempts at global alignment, future global alignment
becomes less and less necessary as the remaining small unexplored parts
of the scene become filled in.
}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Related Work}
\label{sec:related.work}

So far in this proposal we have outlined the multiple-moving-camera
framework and have given special attention to the goal of scene
reconstruction under this framework.  As camera technology becomes less
expensive, there have been many attempts to exploit multiple-camera
frameworks (\eg\ \CITE{seitz:cvpr97, yach:isrr86, zitn:tr96,
faug:icpr96, kang:cvpr96, okut:pami93, tsao:cvpr97} [more from
Andrew!]).  The ``Forest of Sensors'' \CITE{grimson:iuw97,
grimson:cvpr98, stein:iuw98} is a general research framework involving
many stationary cameras mounted for perhaps long periods of time.  Key
goals of this project are the analysis of patterns of activity and the
determination of relative camera relationships.  Since the cameras are
stationary, this work is only loosely related to our proposal.

More relevant is research on scene reconstruction from video sequences
\CITE{shapiro:book95, toma:phd, quan:cvpr96, morris:iccv98,
kanade:ptrsl98, bb20139, mclauchlan:iccv95, Fitzgibbon98a,
Fitzgibbon98, Fitzgibbon98b, Armstrong96a, bear:eccv96, faug:tr95,
moez:cga96, rob:ima97, seal:cvgip95, VanGool98}.  However, almost all work in this
area has assumed that much if not all of the scene is visible in each
frame of the video sequence.  This highlights the key difference
between past work and our current proposal:  We assume that only a
small part of the scene is visible in any one frame of the sequence, as
is discussed in greater detail in the following section (see
\figref{fig:move.closer}).  Furthermore, most work on modeling from
video has assumed simplified camera models such as orthographic
\CITE{toma:ijcv92}, para-perspective \CITE{poelman:eccv94}, or affine
\CITE{shap:ijcv95}.  These models can not be applied to the problem
domain under consideration here, as was discussed in
\secref{sec:problem.domain}.

It should be emphasized that none(???) of the work on modeling from
video has focused on the integration of multiple moving video sequences
and problems related to such integration, as we are proposing.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{General Approach to Scene Reconstruction}
\label{sec:general.approach}

%\FIGUREWIDE{fig:old.wire.model}{old.wire.model/old.wire.model.eps}{1.5in}
%{previous work}

Our current approach to scene reconstruction has grown out of our past
experience with this subject.  Previously we have researched many
aspects of scene reconstruction and view synthesis, including surface
reconstruction through active vision \CITE{kutu:cvpr94b, yu:aiu96,
kutulakos:cbvw94}, recovery of voxel scene models from multiple views
\CITE{seitz:iuw97b} including real-time voxel modeling
\CITE{prock:iuw98}, projective scene reconstruction and virtual view
creation for static \CITE{seitz:rvs95, seitz:sigg96} and dynamic scenes
\CITE{manning:iuw98}, handling noisy data in structure-from-motion
\CITE{bestor:phd}, and complete, consistent scene recovery from small
numbers of corresponding points \CITE{seitz:iccv95}.  In this latter
work, we created wire scene models from point and line features
identified in frames of a video sequence, requiring that only four
points be tracked through the entire sequence
(\figref{fig:old.wire.model}).  The technique was efficient, immune to
outliers and feature-drift and guaranteed the completeness of the
recovered scene.  However it used an affine camera model and assumed
the views were occlusion free.

Scene reconstruction depends on solving the correspondence problem
between various views in the video stream.  From correspondences alone,
a projective reconstruction of the scene can be created
\CITE{faugeras:eccv92, hartley:cvpr94, hart:iuw94a}, which is
sufficient for the augmented reality applications outlined earlier.
When the camera's internal calibration \CITE{tsai:ra87} is known, which
is a reasonable assumption for this problem domain, a Euclidean
reconstruction is possible.

Several moderately successful methods have been published for solving
the dense correspondence problem between pairs or triples of
closely-spaces views [Stein, etc.].  This suggests that a scene model
could be created from the video stream by first creating a series of
partial models from closely-spaced frames in the stream and then
fitting the models together.  However, even the best
dense-correspondence methods contain significant errors, especially in
homogeneous regions of little texture and shading, which would make it
difficult to piece-together the successive partial models.  In any
event, small errors will still accumulate over the large, heavily
occluded scenes we propose investigating, and methods for global
alignment still need to be developed.

In the rest of this section, we outline a different, hybrid approach to
scene reconstruction.  First notice that urban scenes typically contain
significant regions of undistinguished texture, such as uniformly
painted walls.  Furthermore, with low-resolution video equipment, much
scene detail is lost.  For instance, the surface of a road might be
highly textured, but will appear as a uniform dark gray in typical
video sequences.  Consequently, the most reliable source of information
in a video sequence is the high-frequency components--the strong lines
\CITE{hart:iuw94b} and contours, such as occur where two walls meet or
where the sidewalk touches the road.  The most reliable high-frequency
features are the longest lines in the scene, such as the edge of a
building, or the edge of a doorway in an interior view.  By solving the
correspondence problem for just these reliable, easy-to-track features,
a wire model of the extended scene can be created.  Of course,
significant global alignment steps will still need to be performed
since errors will accumulate for even these reliable features.

Some of the error inherent in the reconstruction process can be
minimized by increasing the base-line distance between pairs of
reference views.  If only still images are available, then this
involves solving the correspondence problem for widely-separated views,
one of the most-studied and most difficult problems in computer
vision.  The wide-baseline correspondence problem becomes easier when a
video sequence connecting the views is available, as in the problem
domain under consideration.  The problem becomes one of tracking
features through each frame of the video stream and thereby
establishing the correspondence between the widely-separated views.

Taking the above comments into consideration, the following basic
algorithm for extended scene reconstruction arises:

\smallskip \noindent \rule{6.75in}{.5pt} \smallskip

{\it

\noindent Starting at the first frame:

\noindent (Step 1) Identify reliable features to track.

\noindent (Step 2) Track these features into the next frame.  In the
case of line features, only monitor whether the same line is in view,
not whether the exact previous segment is still in view.

\noindent (Step 3) When earlier features disappear from view, attempt
to estimate what their current position would be in the current view
had the viewing field been wide enough or had the obstructions been
transparent. \REM{ For reference views, use frames that are separated by a
wide-baseline to help stabilize the computation. }

\noindent (Step 4) Compare predicted feature locations to features that
are currently in view.  Adjust the scene model to correct any errors.

\noindent (Step 5) Repeat from step 1.

}

\smallskip \noindent \rule{6.75in}{.5pt} \smallskip

\REM{
\FIGUREWIDE{fig:transfer}{feature.transfer.final.eps}{4in} {Feature
transfer.  Three frames from a sequence panning over objects on a
table.  Some feature information missing in the third frame is visible
in the first frame.  By tracking features common to all three frames,
the missing information can be transfered (\eg\ by the trifocal
tensor).}
}

\begin{figure*}[t!]
 \begin{tabular}{cc}
%  \hspace{.1in}
  \begin{minipage}{4.5in}
   \centerline{ \epsfxsize=4in \epsffile{\FIGDIR/feature.transfer.final.eps} }
  \end{minipage} &
  \begin{minipage}{2in}
   \caption{\label{fig:transfer} \figfont Feature
transfer.  Three frames from a sequence panning over objects on a
table.  Some feature information missing in the third frame is visible
in the first frame.  By tracking features common to all three frames,
the missing information can be transfered (\eg\ by the trifocal
tensor).}
  \end{minipage}
 \end{tabular}
\end{figure*}

Step 3 is known in photogrammetry as \emph{transfer}
\CITE{barrett:book92}; \figref{fig:transfer} illustrates the
concept.  The key idea of the algorithm is to always keep track of
every feature that has been viewed so far.  There are many results
available about how many conjugate features (and what type of features)
need to be known between two views in order to transfer
(\ie\ accurately predict) the locations of the remaining features
(\eg\ \CITE{}).  Of course, all of these theoretical results are
subject to noise problems in real applications \CITE{spetsakis:iuw90,
shapiro:book95}.

A key benefit of tracking line features is that only part of each line
needs to be visible in each view \CITE{chiba:tr97-tbc, spet:ijcv90}.
This is especially relevant to the problem domain considered in this
proposal.  In particular, we are assuming the camera's field-of-view is
small to begin with and is further restricted by occlusions in the
scene.  That is, only a very small part of the scene will be visible at
any one time.  Consequently, the ability to rely on long line features,
pieces of which might appear over many views of the video stream, is of
paramount importance.

Once a sufficient number of line and point correspondences have been
tracked between three views, the \emph{trifocal tensor}
\CITE{hart:iuw94b} for the views can be determined.  Hartley
\CITE{hart:iuw94b} has demonstrated that the trifocal tensor for three
uncalibrated views can be found via a linear algorithm if the following
inequality is met:

\vspace{-5mm}

$$
(number\ of\ known\ corresponding\ lines)\ +\ 2 \times (number\ of\ known\ corresponding\ points) \ge\ 13 
$$

\vspace{-1mm}

\noindent In the problem domain under consideration, we can assume that
the internal calibration of each camera is known and that the focal
length does not change.  Consequently, even fewer line and point
correspondences between the three views are necessary for transfering
the remaining scene features.  For instance, in such cases the
\emph{relative orientation} between two cameras can be found from five
point correspondences \CITE{horn:relative}.  This means that, since the
relative camera matrices can be determined, the trifocal tensor between
three views can also be found from five point correspondences (for
instance, see \CITE{avid:cvpr97} for how to calculate the trifocal
tensor directly from the relative camera matrices).

\REM{
Note that nonlinear methods for determining the trifocal tensor from
fewer point and line correspondences may exist.  A nonlinear algorithm
\CITE{liu:icpr86} for reconstructing lines in space from 6 line
correspondences between three views has been demonstrated, as opposed
to the 13 line correspondences needed by linear methods
\CITE{spet:ijcv90, hart:iuw94b}.
}

We now discuss the second part of the hybrid approach.  There is no
need to ignore texture information if that information can assist in
the feature transfer process.  In particular, we will investigate the
use of tracking textured planar surface patches \CITE{gleicher:cvpr97}
to help estimate camera motion, because planar surfaces will be common
in urban environments, especially in indoor views.  Furthermore, in the
problem domain under consideration there will be occasions when simply
not enough extended, high-frequency features can be tracked between
frames; \figref{fig:move.closer} illustrates the problem.  In such
situations, only brightness gradients and local surface texture will be
available for estimating the camera motion and ``direct methods", as
pioneered in \CITE{horn:ijcv88}, must be employed.  Recently, Stein
\CITE{g.stein:cvpr97a, g.stein:phd} has shown that surface shading by
itself can be used to find the trifocal tensor between three
closely-spaced views.  His method first assumes constant surface
brightness and a small motion model \CITE{longuett:prsl80} for the
camera.  He then uses the constraints that arise from trying to
correspond virtually every pixel in the three views, with the idea that
such massive redundancy will serve to stabilize the computation.
Stein's approach was designed for a three-camera stereo rig and can not
be directly applied to our problem domain.

Following the ``direct methods" ideology, we will investigate how to
exploit the brightness gradient information to derive a rough estimate
of camera motion in areas devoid of other features.  This will in turn
help to bridge the gap between widely-separated views in cases where
significant line features are unavailable.  We are also interested in
combining conjugate line and point information with surface brightness
and texture information to directly calculate the trifocal tensor.  For
example, if two conjugate lines and one conjugate point are known
between three views, can the brightness gradient be used directly to
make up for the lack of macro-feature information?

The success of the feature transfer process will occasionally be
checked when old features that have gone out of view for awhile
reappear in later views, perhaps seen from new vantage points.  The
predicted location can then be compared to the current true location of
the features, and global adjustments can be made to correct for any
errors \CITE{bajura:cga95}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Related (Allied??) Research Goals}
\label{sec:additional.research}

We now discuss some additional research goals that arise naturally from
the multiple-moving-camera framework.  These objectives are
complementary to, and in some case integral to, the wire modeling
process outlined above.  We feel that each of the following topics
could be researched independently, and that each could lead to
important advances in machine vision.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Recognizing Previously Acquired Views and Regions}
\label{sec:goal.looping}

Suppose a video camera is moved around a room, sometimes viewing new
features of the room, often seeing old features again.  As the camera
pans, previously-seen features will continually reenter the view, and
current features will continually exit the view.  Any algorithm that
attempts to use these features, such as a scene-modeling algorithm,
must be able to identify which features are new and which have been
viewed before.

\REM{ [[Our general approach to this problem was outlined in the
previous section.  We propose continually estimating the location of
every reliable feature detected so far, ]] [Our approach to this
problem is to transfer all previously viewed scene features in to each
successive new view...Such an interframe transference of information
requires that a sufficient number of reliable scene features be tracked
between successive frames...] }

\REM{
Our approach to this problem, as outlined in the previous section, is
to transfer all previously viewed scene features into each successive
new view.  Of course, such an inter-frame transfer of information
requires that a sufficient number of reliable scene features are tracked
between successive frames.
}

\REM{ [[explicitly mention that we will work on this particular problem
with respect to rooms, and expect it to be solvable????]] }

We will first study this problem for a single room.  The camera will be
allowed to pan and translate freely around the room, and prominent
features will be tracked and transfered from frame to frame even when
they are out of direct view.  Techniques will be developed to
accomplish the transfer as described in \secref{sec:general.approach},
either by using the visible macro-features to find the trifocal tensor
directly, or by using hybrid indirect methods relying on surface
texture and brightness when the macro-features alone are insufficient.
When old features return to view, they will be compared to their
predicted locations to test the success of the transfer process, and
adjustments will be made automatically as needed.  We will also study
this problem for a camera moving around a stationary car.  Cars
typically have prominent contours to track but no prominent
straight-line features.  They also have highly reflective surfaces and
transparent regions, making this test subject particularly interesting
and challenging.

\begin{figure*}[t!]
 \begin{tabular}{cc}
%  \hspace{.1in}
  \begin{minipage}{2.5in}
   \centerline{ \epsfxsize=1.7in \epsffile{\FIGDIR/move.closer.eps} }
   \caption{\label{fig:move.closer} \figfont Camera in a hallway starts at position A and
   follows a path to B and C.  Most of the large features can not be seen from
   position B because it is too near the wall.  From A and C, the camera can view all the macro-features
   in circle 1.  From B, it can only view those features in circle 2.}
  \end{minipage} &
  \begin{minipage}{4in}
   \centerline{ \epsfxsize=2.5in \epsffile{\FIGDIR/one.camera.modeling/all.eps} }
   \caption{\label{fig:floor.plan} \figfont Camera starts in room A.  After panning around
   the room, the camera is moved out into the hallway, into various other rooms, and then returned
   to room A via a different path, thus completing a loop.  Since small local errors have
   accumulated in the growing scene model, a global alignment step is necessary to accurately reflect the loop}
  \end{minipage}
 \end{tabular}
\end{figure*}

The problem of recognizing when features have been viewed before
becomes more complicated if the camera is allowed to move over a large
area of the scene, especially when occlusions hide most of the
previously-viewed features.  For instance, imagine that a camera is
first moved around a room and is then subsequently moved out into a
hallway and around the inside of the building, finally returning to the
original room via a new path (\figref{fig:floor.plan}).  Any algorithm
that uses the video information must be able to identify that a loop
has occurred and that the camera is now viewing features that have
already been identified.

%\FIGUREWIDE{fig:floor.plan}{one.camera.modeling/all.eps}{3in}{floor
%plan}

If an algorithm is constructing a scene model and the model is very
accurate, then the algorithm already knows the position of the camera
relative to the current scene model and can easily recognize that a
loop has occurred.  However, it is more likely that small errors in the
model construction have been propagating as the camera was moved
throughout the building, so that the current model is not accurate
enough to reliably identify the loop.

For these occasions, when the model is not yet sufficiently accurate,
alternative methods must be developed for identifying loops in the
camera's journey.  Our initial approach to this problem will be to
store qualitative representations of unusual and important keyframes
as the camera is moved through the scene, following the ratio-template
technique of \CITE{sinha:iovs94, lipson:cvpr97}.  This technique
incorporates spatial relationships with color and texture information
and is more reliable than pure color-texture signature techniques
(\eg\ \CITE{smith:icip96}).  One advantage of qualitative techniques in
general is their potential for great speed, which will be important
when dealing with large-scale scenes.

\REM{
The algorithm might, for instance, store qualitative
representations of unusual and important keyframes as the camera is
moved through the scene, and it might refer back to these keyframes to
identify loops.  One advantage of qualitative methods, such as
color-and-texture signature matching \CITE{smith:icip96} and
ratio-template matching \CITE{sinha:iovs94, lipson:cvpr97}, is their
potential for great speed.
}

Whenever a loop is identified and the current scene model does not
predict that a loop should have occurred, then a global alignment step
must be performed on the scene model to bring it up-to-date.  Over
time, as more and more global alignment steps have been successfully
carried out and much of the scene has been accurately modeled, there
will be less and less need for new global alignment steps as the
remaining few unexplored parts of the scene are added to the model.
Gobal alignment has been studied extensively with respect to
mosaicing.  In particular, Sawhney \etal\ \CITE{bb26368} describe a
general approach that first requires identifying loops in a mosaic
image sequence before globally aligning the images.  We will study
extending this approach to the global model alignment problem discussed
above.

There is another situation for which the ``loop identification" problem
can not be solved by simply referring back to the current scene model.
Imagine that the hypothetical camera is moved around the inside of the
building as before, adding to the scene model as it goes, when it
encounters a region without sufficient features to continue the
modeling process.  For example, maybe the camera is moved through an
unlit hallway for awhile.  When suitable scene features are once again
available, a new scene model must be started.  Eventually, when the
camera returns to the original room again, the old scene model and the
new scene model must be joined together.  The fact that the camera has
returned to its starting place must be determined by means other than
the model.

Similarly, when several cameras are moving through a large, complicated
environment, cameras will sometimes enter regions of the scene that
other cameras have already viewed and modeled.  Since each camera is
constructing its own growing model of the scene, whenever such
crossings occur a larger scene model can be created by combining the
smaller, camera-centered models.  A camera might be able to compare its
own scene model to every other camera's scene model to identify such
``path crossings." However, these model-based comparisons may not be
very efficient and may not be reliable by themselves.  Hence the
alternative, non-model-based methods discussed above may help to
identify these kinds of path crossings, too.

Note that a significant global alignment step will probably need to be
carried out in order to merge the scene models created by separate
cameras.  Again, as time goes by and more of the complete scene is
uncovered and successfully modeled, fewer mergers will need to be
performed, and they will be performed more easily and reliably.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Combining Information from Several Moving Video Streams}
\label{sec:goal.multi.cam}

In theory, one video camera could be used to capture all the necessary
views of a static scene.  In practice, the simultaneous use of many
separate cameras effectively parallelizes the data acquisition
process.

Suppose, for instance, that a group of people enters a large extended
scene and each carries a camera-computer system for the purpose of
mapping the scene.  If the scene is a building, some of the people may
walk around the outside of the building intending to acquire an
exterior model.  Of these, some may proceed clockwise around the
building and some may move counter-clockwise; some may aim their
cameras at lower regions of the building, others at the top of the
building.  The people that enter the building may proceed along
different paths through the interior.  Each person, then, acquires
different information about the scene.  This information could be
continually shared between the group, so that a rapidly-growing
building model gets projected into each separate camera view.

For the people moving inside the building, if the modeling process is
successful, each will have an idea of where other members of the team
have been before and where they are currently.  This could help them
avoid redundantly covering known territory and also help them find each
other.

In order for the separate cameras to share information, there must be
some time when scene features are successfully transferred
\emph{between cameras} rather than just between frames in a single
video stream.  An easy way to accomplish feature transference between
two separate cameras is to first have the two cameras capture the same
view (perhaps at different times), and then to recognize that the
features thus captured are exactly the same.  We refer to the process
of transfering scene features between cameras as \emph{registering}
the cameras relative to each other.

The problem of registering two cameras easily and reliably is an
important area of research.  For instance, do the two cameras have to
capture the exact same view in order for the registration to be
successful?  How far off can the two views be?  Can the computer guide
the camera operator to the best viewing position for the transference
to occur (\eg\ as in the active vision paradigm)?  We will investigate
an approach based on the technique of \emph{snakes}
\CITE{kass:ijcv88}.  In particular, the prominent features captured in
one view by one of the cameras will become a multi-component snake.  As
the second camera approaches the viewing position of the first camera,
this multi-component snake will jump to the corresponding scene
features in the second camera as soon as the jump can be performed
reliably.  Corresponding scene features will thus be identified between
the two cameras.

Once two cameras are registered, they can travel independently of each
other through the rest of the scene and still share their information,
as long as each camera's inter-frame feature transference process is
never interrupted (\eg\ by a dearth of reliable scene features).  As
the cameras travel farther from each other, small errors will
accumulate leading to global alignment issues as discussed in the
previous section.  An important problem to solve is recognizing when
the two cameras are viewing the same part of the scene again (\ie\ when
their paths cross), so that global alignment can be carried out.

If two cameras start at different locations in a scene and therefore
never have the opportunity to register themselves, then each camera may
map out a large section of the scene before one camera finally crosses
the other's path.  Consequently, two separate problems emerge:  First,
it must be possible to identify when when one camera has crossed the
other's path without any {\it a priori} registration.  Second, when
path crossings are identified, the two, perhaps large, scene models
must be successfully merged during the subsequent global alignment
step.  This is a longer-term research goal.

When several cameras are employed simultaneously, non-static events in
the scene can be viewed from several different directions
concurrently.  The information thus gained might also be used to
register the cameras relative to each other; similar work has already
been done with regard to stationary sensors \CITE{stein:iuw98}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Identifying Keyframes That Contain the Maximum Useful Information}
\label{sec:goal.keyframe}

The approach advocated so far involves first constructing a wire scene
model in which scene texture information is discarded.  The advantages
of this representation were discussed in \secref{sec:intro} and include
reducing the storage space, computing power, and bandwidth needed to
process and communicate the acquired data.

We have identified two tasks that might benefit from long-term storage
of at least some scene texture information.  First is the task of
recognizing loops and crossings in the various camera paths.  Second is
the construction of finer-level scene models.

It may not be possible to recognize loops and path crossings from the
nascent scene model itself.  Alternative methods need to be developed
for recognizing when a camera is viewing a part of the scene that has
already been viewed \REM{, either by the same camera in the case of a
loop, or by another camera in the case of a path crossing}.  Many views
of the scene will be so devoid of information as to make them
indistinguishable from similar views (\eg\ a view of a blank wall).
Other views will be rich with information and perhaps highly unique,
making them ideal markers for recognizing path crossings.  We refer to
these special views as \emph{keyframes}.

As each camera moves through the scene, it could automatically acquire
a series of keyframes to help identify path crossings.  The problem of
automatically determining which views to store as keyframes is a rich
machine-vision problem; it could be as simple as using histograms to
store views with unique color and texture combinations, or as
complicated as using higher level processing to more completely
``understand" a view.  Our initial approach to this problem will be
based on the ratio-template concept pioneered in \CITE{sinha:iovs94},
used in conjunction with statistical tests to identify unusual views.

Making use of the stored keyframes in an efficient manner is a
separate research problem, related in part to the task of retrieving
images from a large database soley from image characteristics
\CITE{smith:icip96, lipson:cvpr97}.  It is doubtful that two cameras
will ever capture the \emph{exact} same view, so that the use of
keyframes must be flexible enough to identify near-matches or
otherwise similar views.  For example, with an ideal keyframe
technique, the algorithm would be able to recognize, with high
probability, the view of a room from one of its entrances using only a
keyframe captured from another, distant entrance.  \REM{[In what form
the keyframes are stored is intimately related to how they will be
used and retrieved.]}

\REM{
A keyframe identification algorithm might also be used to generate
directions from one locality to another based on important visual
cues.  This is akin to the act of giving someone directions based on
important landmarks.
}

It is interesting to note that, if the keyframe methods described above
could be developed separately from model building, then they could be
used to create a ``connected-graph" representation of the scene and the
various camera paths.  Nodes of the graph would correspond to
keyframes, and edges would represent paths taken by the cameras
between keyframes.  \REM{[[Some psychological research indicates that
this is how people represent a scene...???]]}

The second use of keyframes is to generate detailed surface
reconstructions and photorealistic virtual views.  Several techniques
are known for combining a collection of widely separated scene views
into a realistic textured-mapped model [refs--Andrew].  These
techniques usually require knowledge of the viewing positions and
orientations relative to the scene.  Note that once an accurate wire
model of the scene has been constructed, the viewing positions and
orientations can be determined relative to the wire model.  This is the
heart of the ``coarse-to-fine" approach.

During the process of video acquisition, a great many views of the scene
will be captured, and it would be unfeasible and unnecessary to store
all of them for later, finer scene reconstruction.  In the long term,
we will investigate algorithms for automatically storing those
keyframes that will best serve the task of fine-level scene
reconstruction.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Photorealistic Scene Reconstructions}
\label{sec:finer.model}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%