Dynamic View Interpolation without Affine Reconstruction
R. A. Manning and C. R. Dyer, in
Confluence of Computer Vision and Computer Graphics,
A. Leonardis, F. Solina and R. Bajcsy, eds., Kluwer, Dordrecht,
The Netherlands, 2000, 123-142.
[Abstract]
[BibTex]
[Postscript]
[Gzip'ed Postscript]
[Pdf]
[Earlier Versions]
Photorealistic Scene Reconstruction by Voxel Coloring
S. M. Seitz and C. R. Dyer,
Int. J. Computer Vision 35, No. 2, 1999, 151-173
[Abstract]
[BibTex]
[Postscript]
[Gzip'ed Postscript]
[Pdf]
[Earlier Versions]
Active 3D Surface Modeling using Perception-Based, Differential Geometric Primitives
L-Y. Yu, Ph.D. Dissertation,
Computer Sciences Department,
University of Wisconsin - Madison, August 1999.
[Abstract]
[BibTex]
[Postscript]
[Gzip'ed Postscript]
[Pdf]
[Earlier Versions]
In Computer Sciences Department Technical Report 1397, the authors introduced a linear algorithm for determining the affine calibration between two camera views of a dynamic scene. In this paper, we expand upon the algorithm and investigate its performance experimentally. The algorithm computes affine calibration directly from the fundamental matrices associated with various moving objects in the scene, as well as from the fundamental matrix for the static background if the cameras are at different locations. A minimum of two fundamental matrices are required, but any number of additional fundamental matrices can be incorporated into the linear system to improve computational stability. The technique is demonstrated on both real and synthetic data.
This chapter presents techniques for view interpolation between two reference views of a dynamic scene captured at different times. The interpolations produced portray one possible physically-valid version of what transpired in the scene during the time between when the two reference views were taken. We show how straight-line object motion, relative to a camera-centered coordinate system, can be achieved, and how the appearance of straight-line object motion relative to the background can be created. The special case of affine cameras is also discussed. The methods presented work with widely-separated, uncalibrated cameras and sparse point correspondences. The approach does not involve finding the camera-to-camera transformation and thus does not implicitly perform affine reconstruction of the scene. For circumstances in which the camera-to-camera transformation can be found, we introduce a vector-space of possible synthetic views that follows naturally from the given reference views. It is assumed that the motion of each object in the original scene consists of a series of rigid translations.
We introduce the problem of view interpolation for dynamic scenes. Our solution to this problem extends the concept of view morphing and retains the practical advantages of that method. We are specifically concerned with interpolating between two reference views captured at different times, so that there is a missing interval of time between when the views were taken. The synthetic interpolations produced by our algorithm portray one possible physically-valid version of what transpired in the scene during the missing time. It is assumed that each object in the original scene underwent a series of rigid translations. Dynamic view morphing can work with widely-spaced reference views, sparse point correspondences, and uncalibrated cameras. When the camera-to-camera transformation can be determined, the synthetic interpolation will portray scene objects moving along straight-line, constant-velocity trajectories in world space.
This report describes image-based visualization research in support of video surveillance and monitoring systems. Our primary goal is to develop methods so a user can interactively visualize a 3D environment from images captured by a set of widely-separated cameras. Results include view interpolation of dynamic scenes, coarse-to-fine voxel coloring for efficient scene reconstruction, and recovering scene structure and camera motion.
Techniques for constructing three-dimensional scene models from two-dimensional images are often slow and unsuitable for interactive, real-time applications. In this paper we explore three methods of enhancing the performance of the voxel coloring reconstruction method. The first approach uses texture mapping to leverage hardware acceleration. The second approach uses spatial coherence and a coarse-to-fine strategy to focus computation on the filled parts of scene space. Finally, the multi-resolution method is extended over time to enhance performance for dynamic scenes.
This thesis addresses the problem of synthesizing images of real scenes under three-dimensional transformations in viewpoint and appearance. Solving this problem enables interactive viewing of remote scenes on a computer, in which a user can move a virtual camera through the environment and virtually paint or sculpt objects in the scene. It is demonstrated that a variety of three-dimensional scene transformations can be rendered on a video display device by applying simple transformations to a set of basis images of the scene. The virtue of these transformations is that they operate directly on images and recover only the scene information that is required in order to accomplish the desired effect. Consequently, they are applicable in situations where accurate three-dimensional models are difficult or impossible to obtain.A central topic is the problem of view synthesis, i.e., rendering images of a real scene from different camera viewpoints by processing a set of basis images. Towards this end, two algorithms are described that warp and resample pixels in a set of basis images to produce new images that are physically-valid, i.e., they correspond to what a real camera would see from the specified viewpoints. Techniques for synthesizing other types of transformations, e.g., non-rigid shape and color transformations, are also discussed. The techniques are found to perform well on a wide variety of real and synthetic images.
A basic question is uniqueness, i.e., for which views is the appearance of the scene uniquely determined from the information present in the basis views. An important contribution is a uniqueness result for the no-occlusion case, which proves that all views on the line segment between the two camera centers are uniquely determined from two uncalibrated views of a scene. Importantly, neither dense pixel correspondence nor camera information is needed. From this result, a view morphing algorithm is derived that produces high quality viewpoint and shape transformations from two uncalibrated images.
To treat the general case of many views, a novel voxel coloring framework is introduced that facilitates the analysis of ambiguities in correspondence and scene reconstruction. Using this framework, a new type of scene invariant, called color invariant, is derived, which provides intrinsic scene information useful for correspondence and view synthesis. Based on this result, an efficient voxel-based algorithm is introduced to compute reconstructions and dense correspondence from a set of basis views. This algorithm has several advantages, most notably its ability to easily handle occlusion and views that are arbitrarily far apart, and its usefulness for panoramic visualization of scenes. These factors also make the voxel coloring approach attractive as a means for obtaining high-quality three-dimensional reconstructions from photographs.
This report summarizes the research effort at the University of Wisconsin in support of the VSAM Program. Our primary goal is to develop technologies so a user can interactively visualize and virtually modify a 3D environment from a set of images. Current approaches are described for image-based scene rendering, scene manipulation, and appearance modeling.
A novel scene reconstruction technique is presented, different from previous approaches in its ability to cope with large changes in visibility and its modeling of intrinsic scene color and texture information. The method avoids image correspondence problems by working in a discretized scene space whose voxels are traversed in a fixed visibility ordering. This strategy takes full account of occlusions and allows the input cameras to be far apart and widely distributed about the environment. The algorithm identifies a special set of invariant voxels which together form a spatial and photometric reconstruction of the scene, fully consistent with the input images. The approach is evaluated with images from both inward- and outward-facing cameras.
This paper presents a new class of interactive image editing operations designed to maintain physical consistency between multiple images of a physical 3D object. The distinguishing feature of these operations is that edits to any one image propagate automatically to all other images as if the (unknown) 3D object had itself been modified. The approach is useful first as a power-assist that enables a user to quickly modify many images by editing just a few, and second as a means for constructing and editing image-based scene representations by manipulating a set of photographs. The approach works by extending operations like image painting, scissoring, and morphing so that they alter an object's plenoptic function in a physically-consistent way, thereby affecting object appearance from all viewpoints simultaneously. A key element in realizing these operations is a new volumetric decomposition technique for reconstructing an object's plenoptic function from an incomplete set of camera viewpoints.
Image morphing techniques can generate compelling 2D transitions between images. However, differences in object pose or viewpoint often cause unnatural distortions in image morphs that are difficult to correct manually. Using basic principles of projective geometry, this paper introduces a simple extension to image morphing that correctly handles 3D projective camera and scene transformations. The technique, called view morphing, works by prewarping two images prior to computing a morph and then postwarping the interpolated images. Because no knowledge of 3D shape is required, the technique may be applied to photographs and drawings, as well as rendered scenes. The ability to synthesize changes both in viewpoint and image structure affords a wide variety of interesting 3D effects via simple image transformations.
Photographs and paintings are limited in the amount of information they can convey due to their inherent lack of motion and depth. Using image morphing methods, it is now possible to add 2D motion to photographs by moving and blending image pixels in creative ways. We have taken this concept a step further by adding the ability to convey three-dimensional motions, such as scene rotations and viewpoint changes, by manipulating one or more photographs of a scene. The effect transforms a photograph or painting into an interactive visualization of the underlying object or scene in which the world may be rotated in 3D. Several potential applications of this technology are discussed, in areas such as virtual reality, image databases, and special effects.
This paper analyzes the conditions when a discrete set of images implicitly describes scene appearance for a continuous range of viewpoints. It is shown that two basis views of a static scene uniquely determine the set of all views on the line between their optical centers when a visibility constraint is satisfied. Additional basis views extend the range of predictable views to 2D or 3D regions of viewpoints. A simple scanline algorithm called view morphing is presented for generating these views from a set of basis images. The technique is applicable to both calibrated and uncalibrated images.
The question of which views may be inferred from a set of basis images is addressed. Under certain conditions, a discrete set of images implicitly describes scene appearance for a continuous range of viewpoints. In particular, it is demonstrated that two basis views of a static scene determine the set of all views on the line between their optical centers. Additional basis views further extend the range of predictable views to a two- or three-dimensional region of viewspace. These results are shown to apply under perspective projection subject to a generic visibility constraint called monotonicity. In addition, a simple scanline algorithm is presented for actually generating these views from a set of basis images. The technique, called view morphing may be applied to both calibrated and uncalibrated images. At a minimum, two basis views and their fundamental matrix are needed. Experimental results are presented on real images. This work provides a theoretical foundation for image-based representations of 3D scenes by demonstrating that perspective view synthesis is a theoretically well-posed problem.
Image warping is a popular tool for smoothly transforming one image to another. ``Morphing'' techniques based on geometric image interpolation create compelling visual effects, but the validity of such transformations has not been established. In particular, does 2D interpolation of two views of the same scene produce a sequence of physically valid in-between views of that scene? In this paper, we describe a simple image rectification procedure which guarantees that interpolation does in fact produce valid views, under generic assumptions about visibility and the projection process. Towards this end, it is first shown that two basis views are sufficient to predict the appearance of the scene within a specific range of new viewpoints. Second, it is demonstrated that interpolation of the rectified basis images produces exactly this range of views. Finally, it is shown that generating this range of views is a theoretically well-posed problem, requiring neither knowledge of camera positions nor 3D scene reconstruction. A scanline algorithm for view interpolation is presented that requires only four user-provided feature correspondences to produce valid orthographic views. The quality of the resulting images is demonstrated with interpolations of real imagery.
Recovering three-dimensional information from images is a principal goal of computer vision. An approach called Structure From Motion (SFM) does so without imposing strict requirements on the observer or scene. In particular, SFM assumes camera motion is unknown and the scene is only required to be static. This thesis describes a new SFM technique called Projected Error Refinement that computes the positions of feature points (i.e., structure) and the locations of the camera or observer (i.e., motion) from a noisy image sequence. The technique addresses limitations of existing SFM techniques that make them unsuitable except in controlled environments; the approach presented in this thesis models perspective projection, allows unconstrained camera motion, deals with outliers and occlusion, and is scalable. This new technique is recursive and thus is suitable for video image streams because new images can be added at any time.Projected Error Refinement views SFM as a geometric inverse projection problem, with the goal of determining the positions of the cameras and feature points such that the projectors defined by each image optimally intersect (projectors are the lines of projection specifying the direction of each feature point from the camera's optical center). This is expressed as a global optimization problem with the objective function minimizing the mean-squared angular projection error between the solution and the observed images. Occlusion is dealt with naturally in this approach because only visible feature points define projectors that are considered during optimization - occluded features are ignored. The technique models true perspective projection and is scalable to an arbitrary number of feature points and images. Projected Error Refinement is non-linear and uses an efficient parallel iterative refinement algorithm that takes an initial estimate of the structure and motion parameters and alternately refines the cameras' poses and the positions of the feature points in parallel. The solution can be refined to an arbitrary precision or refinement can be terminated prematurely due to limited processing time. The solution converges rapidly towards the global minimum even when started from a poor initial estimate. Experimental results are given for both 2D and 3D perspective projection using real and synthetic images sequences.
The projected deformation of stationary contours and markings on object surfaces is analyzed in this paper. It is shown that given a marked point on a stationary contour, an active observer can move deterministically to the osculating plane for that point by observing and controlling the deformation of the projected contour. Reaching the osculating plane enables the observer to recover the object surface shape along the contour as well as the Frenet frame of the contour. Complete local surface recovery requires either two intersecting surface contours and the knowledge of one principle direction, or more than two intersecting contours. To reach the osculating plane, two strategies involving both pure translation and a combination of translation and rotation are analyzed. Once the Frenet frame for the marked point on the contour is recovered, the same information for all points on the contour can be recovered by staying on osculating planes while moving along the contour. It is also shown that occluding contours and stationary contours deform in a qualitatively different way and the problem of discriminating between these two types of contours can be resolved before the recovery of local surface shape.
We present an approach for recovering surface shape from the occluding contour using an active (i.e., moving) observer. It is based on a relation between the geometries of a surface in a scene and its occluding contour: If the viewing direction of the observer is along a principal direction for a surface point whose projection is on the contour, surface shape (i.e., curvature) at the surface point can be recovered from the contour. Unlike previous approaches for recovering shape from the occluding contour, we use an observer that purposefully changes viewpoint in order to achieve a well-defined geometric relationship with respect to a 3D shape prior to its recognition. We show that there is a simple and efficient viewing strategy that allows the observer to align the viewing direction with one of the two principal directions for a point on the surface. This strategy depends on only curvature measurements on the occluding contour and therefore demonstrates that recovering quantitative shape information from the contour does not require knowledge of the velocities or accelerations of the observer. Experimental results demonstrate that our method can be easily implemented and can provide reliable shape information from the occluding contour.
We present an approach for identifying the occluding contour and determining its sidedness using an active (i.e., moving) observer. It is based on the non-stationarity property of the visible rim: When the observer's viewpoint is changed, the visible rim is a collection of curves that ``slide,'' rigidly or non-rigidly, over the surface. We show that the observer can deterministically choose three views on the tangent plane of selected surface points to distinguish such curves from stationary surface curves (i.e., surface markings). Our approach demonstrates that the occluding contour can be identified directly, i.e., without first computing surface shape (distance and curvature).
What viewpoint-control strategies are important for performing global visual exploration tasks such as searching for specific surface markings, building a global model of an arbitrary object, or recognizing an object? In this paper we consider the task of purposefully controlling the motion of an active, monocular observer in order to recover a global description of a smooth, arbitrarily-shaped object. We formulate global surface reconstruction as the task of controlling the motion of the observer so that the visible rim slides over the maximal, connected, reconstructible surface regions intersecting the visible rim at the initial viewpoint. We show that these regions are bounded by a subset of the visual event curves defined on the surface.By studying the epipolar parameterization, we develop two basic strategies that allow reconstruction of a surface region around any point in a reconstructible surface region. These strategies control viewpoint to achieve and maintain a well-defined geometric relationship with the object's surface, rely only on information extracted directly from images (e.g., tangents to the occluding contour), and are simple enough to be performed in real time. We then show how global surface reconstruction can be provably achieved by (1) appropriately integrating these strategies to iteratively ``grow'' the reconstructed regions, and (2) obeying four simple rules.
We present an approach for recovering a global surface model of an object from the deformation of the occluding contour using an active (i.e., mobile) observer able to control its motion. In particular, we consider two problems: (1) How can the observer's viewpoint be controlled in order to generate a dense sequence of images that allows incremental reconstruction of an unknown surface, and (2) how can we construct a global surface model from the generated image sequence? Solving these two problems is crucial for automatically constructing models of objects whose surface is non-convex and self-occludes. We achieve the first goal by purposefully and qualitatively controlling the observer's instantaneous direction of motion in order to control the motion of the visible rim over the surface. We achieve the second goal by using a calibrated trinocular camera rig and a mechanism for controlling the relative position and orientation of the viewed surface with respect to the trinocular rig.
In this thesis we study how controlled movements of a camera can be used to infer properties of a curved object's three-dimensional shape. The unknown geometry of an environment's objects, the effects of self-occlusion, the depth ambiguities caused by the projection process, and the presence of noise in image measurements are a few of the complications that make object-dependent movements of the camera advantageous in certain shape recovery tasks. Such movements can simplify local shape computations such as curvature estimation, allow use of weaker camera calibration assumptions, and enable the extraction of global shape information for objects with complex surface geometry. The utility of object-dependent camera movements is studied in the context of three tasks, each involving the extraction of progressively richer information about an object's unknown shape: (1) detecting the occluding contour, (2) estimating surface curvature for points projecting to the contour, and (3) building a three-dimensional model for an object's entire surface. Our main result is the development of three distinct active vision strategies that solve these three tasks by controlling the motion of a camera.Occluding contour detection and surface curvature estimation are achieved by exploiting the concept of a special viewpoint: For any image there exist special camera positions from which the object's view trivializes these tasks. We show that these positions can be deterministically reached, and that they enable shape recovery even when few or no markings and discontinuities exist on the object's surface, and when differential camera motion measurements cannot be accurately obtained.
A basic issue in building three-dimensional global object models is how to control the camera's motion so that previously-unreconstructed regions of the object become reconstructed. A fundamental difficulty is that the set of reconstructed points can change unpredictably (e.g., due to self-occlusions) when ad hoc motion strategies are used. We show how global model-building can be achieved for generic objects of arbitrary shape by controlling the camera's motion on automatically-selected surface tangent and normal planes so that the boundary of the already-reconstructed regions is guaranteed to "slide" over the object's entire surface.
Our work emphasizes the need for (1) controlling camera motion through efficient processing of the image stream, and (2) designing provably-correct strategies, i.e., strategies whose success can be accurately characterized in terms of the geometry of the viewed object. For each task, efficiency is achieved by extracting from each image only the information necessary to move the camera differentially, assuming a dense sequence of images, and using 2D rather than 3D information to control camera motion. Provable correctness is achieved by controlling camera motion based on the occluding contour's dynamic shape and maintaining specific task-dependent geometric constraints that relate the camera's motion to the differential geometry of the object.
We consider the following problem: How should an observer change viewpoint in order to generate a dense image sequence of an arbitrary smooth surface so that it can be incrementally reconstructed using the occluding contour and the epipolar parameterization? We present a collection of qualitative behaviors that, when integrated appropriately, purposefully control viewpoint based on the appearance of the surface in order to provably solve this problem.
We present a viewing strategy for exploring the surface of an unknown object (i.e., making all of its points visible) by purposefully controlling the motion of an active observer. It is based on a simple relation between (1) the instantaneous direction of motion of the observer, (2) the visibility of points projecting to the occluding contour, and (3) the surface normal at those points: If the dot product of the surface normal at such points and the observer's velocity is positive, the visibility of the points is guaranteed under an infinitesimal viewpoint change. We show that this leads to an object exploration strategy in which the observer purposefully controls its motion based on the occluding contour in order to impose structure on the set of surface points explored, make its representation simple and qualitative, and provably solve the exploration problem for smooth generic surfaces of arbitrary shape. Unlike previous approaches where exploration is cast as a discrete process (i.e., asking where to look next?) and where the successful exploration of arbitrary objects is not guaranteed, our approach demonstrates that dynamic viewpoint control through directed observer motion leads to a qualitative exploration strategy that is provably-correct, depends only on the dynamic appearance of the occluding contour, and does not require the recovery of detailed three-dimensional shape descriptions from every position of the observer.
An approach is presented for exploring an unknown, arbitrary surface in three-dimensional (3D) space by a mobile robot. The main contributions are (1) an analysis of the capabilities a robot must possess and the trade-offs involved in the design of an exploration strategy, and (2) two provably-correct exploration strategies that exploit these trade-offs and use visual sensors (e.g., cameras and range sensors) to plan the robot's motion. No such analysis existed previously for the case of a robot moving freely in 3D space. The approach exploits the notion of the occlusion boundary, i.e., the points separating the visible from the occluded parts of an object. The occlusion boundary is a collection of curves that ``slide'' over the surface when the robot's position is continuously controlled, inducing the visibility of surface points over which they slide. The paths generated by our strategies force the occlusion boundary to slide over the entire surface. The strategies provide a basis for integrating motion planning and visual sensing under a common computational framework.
We present an approach for solving the path planning problem for a mobile robot operating in an unknown, three dimensional environment containing obstacles of arbitrary shape. The main contributions of this paper are (1) an analysis of the type of sensing information that is necessary and sufficient for solving the path planning problem in such environments, and (2) the development of a framework for designing a provably-correct algorithm to solve this problem. Working from first principles, without any assumptions about the environment of the robot or its sensing capabilities, our analysis shows that the ability to explore the obstacle surfaces (i.e., to make all their points visible) is intrinsically linked with the ability to plan the motion of the robot. We argue that current approaches to the path planning problem with incomplete information simply do not extend to the general three-dimensional case, and that qualitatively different algorithms are needed.
This paper presents a general framework for image-based analysis of 3D repeatingmotions that addresses two limitations in the state of the art. First, the assumption that a motion be perfectly even from one cycle to the next is relaxed. Real repeating motions tend not to be perfectly even, i.e., the length of a cycle varies through time because of physically important changes in the scene. A generalization of {\em period} is defined for repeating motions that makes this temporal variation explicit. This representation, called the period trace, is compact and purely temporal, describing the evolution of an object or scene without reference to spatial quantities such as position or velocity. Second, the requirement that the observer be stationary is removed. Observer motion complicates image analysis because an object that undergoes a 3D repeating motion will generally not produce a repeating sequence of images. Using principles of affine invariance, we derive necessary and sufficient conditions for an image sequence to be the projection of a 3D repeating motion, accounting for changes in viewpoint and other camera parameters. Unlike previous work in visual invariance, however, our approach is applicable to objects and scenes whose motion is highly non-rigid. Experiments on real image sequences demonstrate how the approach may be used to detect several types of purely temporal motion features, relating to motion trends and irregularities. Applications to athletic and medical motion analysis are discussed.
A new technique is presented for computing 3D scene structure from point and line features in monocular image sequences. Unlike previous methods, the technique guarantees the completeness of the recovered scene, ensuring that every scene feature that is detected in each image is reconstructed. The approach relies on the presence of four or more reference features whose correspondences are known in all the images. Under an orthographic or affine camera model, the parallax of the reference features provides constraints that simplify the recovery of the rest of the visible scene. An efficient recursive algorithm is described that uses a unified framework for point and line features. The algorithm integrates the tasks of feature correspondence and structure recovery, ensuring that all reconstructible features are tracked. In addition, the algorithm is immune to outliers and feature-drift, two weaknesses of existing structure-from-motion techniques. Experimental results are presented for real images.
Real cyclic motions tend not to be perfectly even, i.e., the period varies slightly from one cycle to the next, because of physically important changes in the scene. A generalization of period is defined for cyclic motions that makes periodic variation explicit. This representation, called the period trace, is compact and purely temporal, describing the evolution of an object or scene without reference to spatial quantities such as position or velocity. By delimiting cycles and identifying correspondences across cycles, the period trace provides a means of temporally registering a cyclic motion. In addition, several purely temporal motion features are derived, relating to the nature and location of irregularities. Results are presented using real image sequences and applications to athletic and medical motion analysis are discussed.
Current approaches for detecting periodic motion assume a stationary camera and place limits on an object's motion. These approaches rely on the assumption that a periodic motion projects to a set of periodic image curves, an assumption that is invalid in general. Using affine-invariance, we derive necessary and sufficient conditions for an image sequence to be the projection of a periodic motion. No restrictions are placed on either the motion of the camera or the object. Our algorithm is shown to be provably-correct for noise-free data and is extended to be robust with respect to occlusions and noise. The extended algorithm is evaluated with real and synthetic image sequences.
To date, the overwhelming use of motion in computational vision has been to recover the three-dimensional structure of the scene. We propose that there are other, more powerful, uses for motion. Toward this end, we define dynamic perceptual organization as an extension of the traditional (static) perceptual organization approach. Just as static perceptual organization groups coherent features in an image, dynamic perceptual organization groups coherent motions through an image sequence. Using dynamic perceptual organization, we propose a new paradigm for motion understanding and show why it can be done independently of the recovery of scene structure and scene motion. The paradigm starts with a spatiotemporal cube of image data and organizes the paths of points so that interactions between the paths and perceptual motions such as common, relative and cyclic are made explicit. The results of this can then be used for high-level motion recognition tasks.
We address the problem of qualitative shape recovery from moving surfaces. Our analysis is unique in that we consider specular interreflections and explore the effects of both motion parallax and changes in shading. To study this situation we define an image flow field called the reflection flow field, which describes the motion of reflection points and the motion of the surface. From a kinematic analysis, we show that the reflection flow is qualitatively different from the motion parallax because it is discontinuous at or near parabolic curves. We also show that when the gradient of the reflected image is strong, gradient-based flow measurement techniques approximate the reflection flow field and not the motion parallax. We conclude from these analyses that reliable qualitative shape information is generally available only at discontinuities in the image flow field.
Recovering a hierarchical motion description of a long image sequence is one way to recognize objects and their motions. Intermediate-level and high-level motion analysis, i.e., recognizing a coordinated sequence of events such as walking and throwing, has been formulated previously as a process that follows high-level object recognition. This thesis develops an alternative approach to intermediate-level and high-level motion analysis. It does not depend on complex object descriptions and can therefore be computed prior to object recognition. Toward this end, a new computational framework for low and intermediate-level processing of long sequences of images is presented.Our new computational framework uses spatiotemporal (ST) surface flow and ST flow curves. As contours move, their projections into the image also move. Over time, these projections sweep out ST surfaces. Thus, these surfaces are direct representations of object motion. ST surface flow is defined as the natural extension of optical flow to ST surfaces. For every point on an ST surface, the instantaneous velocity of that point on the surface is recovered. It is observed that arc length of a rigid contour does not change if that contour is moved in the direction of motion on the ST surface. Motivated by this observation, a function measuring arc length change is defined. The direction of motion of a contour undergoing motion parallel to the image plane is shown to be perpendicular to the gradient of this function.
ST surface flow is then used to recover ST flow curves. ST flow curves are defined such that the tangent at a point on the curve equals the ST surface flow at that point. ST flow curves are then grouped so that each cluster represents a temporally-coherent structure, i.e., structures that result from an object or surface in the scene undergoing motion. Using these clusters of ST flow curves, separate moving objects in the scene can be hypothesized and occlusion and disocclusion between them can be identified.
The problem of detecting cyclic motion, while recognized by the psychology community, has received very little attention in the computer vision community. In order to show the representational power of ST flow curves, cyclic motion is detected using ST flow curves without prior recovery of complex object descriptions.
Computational vision is about why a biological vision system functions as it does and how to emulate its performance on computers. The central topics of this thesis are how a differential geometry language can be used to describe the essential elements of visual perception in both 2D and 3D domains, and how the components of this geometric language can be computed in ways closely related to how the human visual system performs similar functions.The thesis starts by showing that at the earliest stage of vision, biological systems implement a mechanism that is computationally equivalent to computing local geometric invariants at the two-dimensional curve level. The availability of this information establishes the foundation for computing components of a differential geometry language from sensory inputs. The mathematical framework of scale space that makes this computational approach possible, likewise, has its biological basis.
On the other hand, visual perception is a global phenomenon that occurs generally in a 3D space. To understand this process and design computational systems that have comparable performance to humans requires specification of how a 2D local computational mechanism can be used in this global 3D environment. This goal is achieved through two steps. First, a global surface representation formulation is extended from the 2D framework. It is shown how local geometric features that are sparse and perceptually meaningful can be naturally used to represent global 3D surfaces. Second, active motion by an observer is introduced as an additional dimension to the data set so that the observer becomes mobile and can react to observations or verify hypotheses actively. This also makes dynamical data such as optical flow available to the observer. These added abilities enable the observer to perform tasks such as surface recovery and 3D navigation. In addition, the modeling process of 3D objects is naturally constrained by the computational resources available to the observer so that the model is inherently incremental.
This thesis contributes in the following areas: (1) direct computation of 2D differential geometric invariants from images using methods comparable to the human vision system, (2) perception-based global representations of 2D and 3D objects using geometric invariants, (3) novel methods for optical flow computation and segmentation, and (4) active methods for global surface recovery and navigation using both stationary contours, apparent contours and textured surfaces.
xxx
Currently the aspect graph is computed from the theoretical standpoint of perfect resolution in object shape, the viewpoint and the projected image. This means that the aspect graph may include details that an observer could never see in practice. Introducing the notion of scale into the aspect graph framework provides a mechanism for selecting a level of detail that is "large enough" to merit explicit representation. This effectively allows control over the number of nodes retained in the aspect graph. This paper introduces the concept of the scale space aspect graph, defines three different interpretations of the scale dimension, and presents a detailed example for a simple class of objects, with scale defined in terms of the spatial extent of features in the image.
In this paper we present the geometry and the algorithms for organizing a viewer-centered representation of the occluding contour of polyhedra. The contour is computed from a polyhedral boundary model as it would appear under orthographic projection into the image plane from every viewpoint on the view sphere. Using this representation, we show how to derive constraints on regions in viewpoint space from the relationship between detected image features and our precomputed contour model. Such constraints are based on both qualitative (viewpoint extent) and quantitative (angle measurements and relative geometry) information that has been precomputed about how the contour appears in the image plane as a set of projected curves and T-junctions from self-occlusion. The results we show from an experimental system demonstrate that features of the occluding contour can be computed in a model-based framework, and their geometry constrains the viewpoints from which a model will project to a set of occluding contour features in an image.
In this paper we present the geometry and the algorithms for organizing and using a viewer-centered representation of the occluding contour of polyhedra. The representation is computed from a polyhedral model under orthographic projection for all viewing directions. Using this representation, we derive constraints on viewpoint correspondences between image features and model contours. Our results show that the occluding contour, computed in a model-based framework, can be used to strongly constrain the viewpoints where a 3D model matches the occluding contour features of the image.
A fundamental problem common to both computer graphics and model-based computer vision is how to efficiently model the appearance of a shape. Appearance is obtained procedurally by applying a projective transformation to a three-dimensional object-centered shape representation. This thesis presents a viewer-centered representation that is based on the visual event, a viewpoint where a specific change in the structure of the projected model occurs. We present and analyze the basis of this viewer-centered representation and the algorithms for its construction. Variations of this visual-event-based representation are applied to two specific problems: hidden line/surface display, and the solution for model pose given an image contour.The problem of how to efficiently display a polyhedral scene over a path of viewpoints is cast as a problem of computing visual events along that path. A visual event is a viewpoint that causes a change in the structure of the image structure graph, a model's projected line drawing. The information stored with a visual event is sufficient to update a representation of the image structure graph. Thus the visible lines of a scene can be displayed as viewpoint changes by first precomputing and storing visual events, and then using those events at display time to interactively update the image structure graph. Display rates comparable to wire-frame display are achieved for large polyhedral models.
The rim appearance representation is a new, viewer-centered, exact representation of the occluding contour of polyhedra. We present an algorithm based on the geometry of polyhedral self-occlusion and on visual events for computing a representation of the exact appearance of occluding contour edges. The rim appearance representation, organized as a multi-level model of the occluding contour, is used to constrain the viewpoints of a three-dimensional model that can produce a set of detected occluding-contour features. Implementation results demonstrate that precomputed occluding-contour information efficiently and tightly constrains the pose of a model while consistently accounting for detected occluding-contour features.
This paper considers the problem of modeling and extracting arbitrary deformable contours from noisy images. We propose a global contour model based on a stable and regenerative shape matrix, which is invariant and unique under rigid motions. Combined with Markov random field to model local deformations, this yields prior distribution that exerts influence over a global model while allowing for deformations. We then cast the problem of extraction into posterior estimation and show its equivalence to energy minimization of a generalized active contour model. We discuss pertinent issues in shape training, energy minimization, line search strategies, minimax regularization and initialization by generalized Hough transform. Finally, we present experimental results and compare its performance to rigid template matching.
Recently, we proposed the generalized active contour model (g-snake) to model and extract deformable contours from noisy images. This paper demonstrates the usefulness of g-snake in classifying among several candidate deformable contours. The g-snake is suitable for this task because its shape representation is unique, affine invariant and possesses metric properties. We derive the optimal classification test and show that this requires marginalization of the distribution. However, as the summation is peaked around the posterior estimate in most practical applications, only small regions need to be considered. Finally, we performed extensive experimentations and report significant improvement over matched template in handwritten numeral recognition.
This thesis presents an integrated approach in modeling, extracting, detecting and classifying deformable contours directly from noisy images. We begin by conducting a case study on regularization, formulation and initialization of the active contour models (snakes). Using minimax principle, we derive a regularization criterion whereby the values can be automatically and implicitly determined along the contour. Furthermore, we formulate a set of energy functionals which yield snakes that contain Hough transform as a special case. Subsequently, we consider the problem of modeling and extracting arbitrary deformable contours from noisy images. We combine a stable, invariant and unique contour model with Markov random field to yield prior distribution that exerts influence over an arbitrary global model while allowing for deformation. Under the Bayesian framework, contour extraction turns into posterior estimation, which is in turn equivalent to energy minimization in a generalized active contour model. Finally, we integrate these lower level visual tasks with pattern recognition processes of detection and classification. Based on the Nearman-Pearson lemma, we derive the optimal detection and classification tests. As the summation is peaked in most practical applications, only small regions need to be considered in marginalizing the distribution. The validity of our formulation have been confirmed by extensive and rigorous experimentations.
In snake formulation, large regularization enhances the robustness against noise and incomplete data, while small values increase the accuracy in capturing boundary variations. We present a local minimax criterion which automatically determines the optimal regularization at every locations along the boundary with no added computation cost. We also modify existing energy formulations to repair deficiencies in internal energy and improve performance in external energy. This yields snakes that contain Hough transform as a special case. We can therefore initialize the snake efficiently and reliably using Hough transform.
In this thesis we develop a system that makes scientific computations visible and enables physical scientists to perform visual experiments with their computations. Our approach is unique in the way it integrates visualization with a scientific programming language. Data objects of any user-defined data type can be displayed, and can be displayed in any way that satisfies broad analytic conditions, without requiring graphics expertise from the user. Furthermore, the system is highly interactive.In order to achieve generality in our architecture, we first analyze the nature of scientific data and displays, and the visualization mappings between them. Scientific data and displays are usually approximations to mathematical objects (i.e., variables, vectors and functions) and this provides a natural way to define a mathematical lattice structure on data models and display models. Lattice-structured models provide a basis for integrating certain forms of scientific metadata into the computational and display semantics of data, and also provide a rigorous interpretation of certain expressiveness conditions on the visualization mapping from data to displays. Visualization mappings satisfying these expressiveness conditions are lattice isomorphisms. Applied to the data types of a scientific programming language, this implies that visualization mappings from data aggregates to display aggregates can always be decomposed into mappings of data primitives to display primitives.
These results provide very flexible data and display models, and provide the basis for flexible and easy-to-use visualization of data objects occurring in scientific computations.
We describe techniques that enable Earth and space scientists to interactively visualize and experiment with their computations. Numerical simulations of the Earth's atmosphere and oceans generate large and complex data sets, which we visualize in a highly interactive virtual Earth environment. We use data compression and distributed computing to maximize the size of simulations that can be explored, and a user interface tuned to the needs of environmental modelers. For the broader class of computations used by scientists we have developed more general techniques, integrating visualization with an environment for developing and executing algorithms. The key is providing a flexible data model that lets users define data types appropriate for their algorithms, and also providing a display model that lets users visualize those data types without placing a substantial burden of graphics knowledge on them.
In order to develop a foundation for visualization, we develop lattice models for data objects and displays that focus on the fact that data objects are approximations to mathematical objects and real displays are approximations to ideal displays. These lattice models give us a way to quantize the information content of data and displays and to define conditions on the visualization mappings from data to displays. Mappings satisfy these conditions if and only if they are lattice isomorphisms. We show how to apply this result to scientific data and display models, and discuss how it might be applied to recursively defined data types appropriate for complex information processing.
We present a technique for defining graphical depictions for all the data types defined in an algorithm. The ability to display arbitrary combinations of an algorithm's data objects in a common frame of reference, coupled with interactive control of algorithm execution, provides a powerful way to understand algorithm behavior. Type definitions are constrained so that all primitive values occurring in data objects are assigned scalar types. A graphical display, including user interaction with the display, is modeled by a special data type. Mappings from the scalar types into the display model type provide a simple user interface for controlling how all data types are depicted, without the need for type-specific graphics logic.