Figure-Ground Organization in Natural Images



Introduction

Figure-ground organization is a step of perceptual organization which assigns a contour to one of the two abutting regions. Commonly thought to follow region segmentation, it is an essential step in forming our perception of surfaces, shapes and objects, as demonstrated by the pictures in Figure 1. These pictures are highly ambiguous and we may perceive either side as the figure and ``see'' its shape. We always perceive the ground side as being shapeless and extended behind the figure, never seeing both shapes simultaneously.
Figure 1: classical illusions in figure/ground organization.

In computer vision, figure-ground organization is an all quiet front and virtually no effort has been made to understand its roles and implications. Perhaps this is for a good reason: figure-ground organization is a difficult mid-level vision problem, and, in the context of complex natural scenes, we do not know whether figure-ground organization is even possible from bottom-up. This leads to the belief that figure-ground organization is merely the result of a top-down process, after we recognize and understand object layout in a scene.

Recently a large figure-ground dataset of natural images have been collected and labeled by human subjects. Such large-scale groundtruth data enable us to quantitatively study the figure-ground problem in natural images. In this work we develop a bottom-up figure-ground approach that combines local cues (e.g. convexity) and global consistency (e.g. T-junction analysis).

We show that there is rich figure-ground information available at mid-level, both locally and globally. Quantitatively our approach produces promising figure-ground labelings without recognizing objects or estimating depth. Hence we "prove" that bottom-up figure-ground organization is feasible. Such mid-level processing holds great potential for scene understanding and object recognition.

Local Figure-Ground Cues

The classical Gestalt theory on figure-ground lists a number of "principles" or cues, such convexity, parallelism, size or symmetry. Many of these cues may be defined locally, without requiring a full segmentation. It is, however, not easy to translate these intuitive principles into mathematical definitions.

In this work, we take a learning approach to local figure-ground, by grouping local shapes into shapemes and collecting figure-ground statistics for these shape clusters.

Figure 2: shapemes, or local shape clusters, learned from data using the Geometric Blur descriptor. A simple clustering of local shapes reveals interesting structures such as convexity, parallelism as well as straight lines, corners and line endings.

93.8% 89.6% 66.5% 49.8% 11.7% 05.0%
Figure 3: shapemes encode rich figure-ground information. Here are some shapemes and their figure-ground statistics: we align each shapeme such as the contour orientation at center is vertical, and count the percentage that the figure side is to the left. As Gestalt theories predict, parallelism is a strong figure-ground cue, hence for shapeme 1, the figure is between the two parallel lines, hence to the left of the center. Convexity is also a strong cue (shapeme 2), and a straight line gives no information (shapeme 4).

Global Figure-Ground Consistency

Our local model of figure-ground (probabilistically) assigns labels on each contour. For any valid labeling, when contours join at a junction, their figure-ground labels need to be consistent with one another, forming "T-junctions".

We use a conditional random field to enforce such consistencies at junctions. Specifically, we enumerate all possible junction labelings, and learn the weights of each junction type from data. Figure 4 shows examples of some "likely" junctions and "unlikely" junctions and the learned weights.

weight=0.185 weight=-0.611 weight=-0.857 weight=0.428
Figure 4: learning valid and invalid junctions. Junction 1: continuation of a contour, "likely". Junction 2: reversal of figure-ground labeling, "unlikely". Junction 3: cyclic labeling at a 3-junction, "likely". Junction 4: classical T-junction, "likely".

Quantitative performance evaluation

We evaluate the performance of our approach using human-marked figure-ground labels. We consider three labelings: (1) local cues only; (2) local cues, averaged over contour segments; (3) local cues plus global consistency.

We observe that local cues perform quite well, even though the natural scenes in the dataset are fairly complex. If we have a "perfect" segmentation (such as one marked by human subjects), we may perform global inference on a "perfect" junction graph. This global consistency inference greatly improves figure-ground accuracy.

Labeling accuracy with groundtruth segmentation
Chance Local cues Local averaged on Contours Local + Global Human Consistency
50% 64.8% 72.0% 78.3% 88%

On the other hand, we may apply our approach when there is "no" segmentation. In this case we compute edges from bottom-up, and form junction structures by tracing edges. Junction structures are much more noisy, hence the benefit of global inference much smaller. Nevertheless, we still observe a significant increase in accuracy.

Labeling accuracy with bottom-up boundary detection
Chance Local cues Local averaged on Contours Local + Global Human Consistency
50% 64.9% 66.5% 68.9% 88%

Sample Results with Groundtruth Segmentation

What we can do with figure-ground labeling if we have "perfect" segmentation. Column 1: images. Column 2: groundtruth figure-ground lables, white being the figure side and black the ground side. Column 3: results from local cues, red indicating correct labelings and blue incorrect. Column 4: results from local+global inference.

Sample Results with Bottom-up Boundary Detection

What we can do with figure-ground with "no" segmentation. In this case, junction structures are derived from bottom-up boundary detection (Column 2). Local figure-ground cues perform as effectively as before (Column 3). Global inference about T-junctions is much harder without perfect junctions, but we still see a significant improvement (Column 4).

References

  1. Figure/Ground Assignment in Natural Images.   [abstract] [pdf] [ps] [bibtex]
      Xiaofeng Ren, Charless Fowlkes and Jitendra Malik, in ECCV '06, volume 2, pages 614-627, Graz 2006.

  2. Familiar Configuration Enables Figure/Ground Assignment in Natural Scenes.   [abstract] [poster] [bibtex]
      Xiaofeng Ren, Charless Fowlkes and Jitendra Malik, in VSS 05, Sarasota, FL 2005.