EPFL CVLab

A Semi-Naive Bayesian Classifier
for Fast Patch Classification

We show that it is possible to train a Semi-Naive Bayesian Classifier to recognize keypoints without the need for intensive preprocessing stages like fine-scale selection, dominant orientation and affine parameter estimation. We also demonstrate that this approach is superior to previous methods that use weighted averaging of Randomized-Tree based classifier ensembles.

Fast Patch Recognition

Can we match keypoints using simple binary features without intensive preprocessing (fine scale selection, orientation estimation, affine correction)?

Given a deformed patch, can we guess the original?

Assuming that we have the possibility to train a classifier to classify each keypoint class.

A keypoint class constitutes all possible views of the patch around itself.

Semi-Naive Formulation

We model the class conditional probabilities of a large number of binary features which are estimated by a training phase. At run-time these probabilities are used to select the best match for a given image patch.

$ f_i$ : Binary feature (We use sign of intensity difference of two pixels).
$ N_f$ : Total number of features in the model.
$ C_k$ : Class representing all views of an image patch around a keypoint.

Given $f_1, f_2, \ldots, f_{N_f} \rightarrow$ Select class $k$ such that
$\displaystyle k = \underset{k}{\text{argmax}}\text{ } P(C_k \vert f_1, f_2, \ld
                  = \underset{k}{\text{argmax}}\text{ } P(f_1, f_2, \ldots, f_{N_f} \vert C_k),$ assuming uniform prior over classes.

However it is not practical to model the joint distribution of all features. We compromise by grouping features into sets of small size and assume independence between these sets.

$F_j$ : A Fern defined to be set of $S$ binary features $\{ f_l, \dots$, $f_{l + S} \}$.
$M$ : Number of ferns, $N_f = S \times M$.

Implementing Ferns

For each patch a fern node outputs a binary number, and a fern with S nodes outputs a number between 0 and 2S-1. When we have multiple patches of the same class we can model the output of a fern with a multinomial distribution.

Legend for following figures

The distributions are initialized with a uniform Dirichlet prior, assuming one prior example for each class with each possible fern output. Then the training data starts to update the distributions.

Uniform prior and an initial update

Each patch in the training set updates the distribution for the corresponding class.

Update another class

At the end of the training we have distributions over possible fern outputs for each class.

Training completed

To recognize a new patch the outputs selects rows of distributions for each fern and these are then combined assuming independence between distributions.

Testing a new patch

Comparison with Randomized Trees

Comparison of classification rates:

For two different images we have compared the correct classification rate of both the Randomized Trees and Ferns classifying 500 keypoints on randomly generated test samples. The tests are repeated 10 times and error bars show the range of results (Click on thumbnails to see larger graphs).

Classification rate comparison of Ferns and Trees on image0 Classification rate comparison of Ferns and Trees on image1

We also compare the performance of both methods as the number of keypoints are increased. We train the methods up to 2100 keypoints selected from three images. The first graph shows the case with 20 trees/ferns and the second for 30.

Classification rate comparison of Ferns and Trees on image0 Classification rate comparison of Ferns and Trees on image1

Comparison of test structures:

The tests for both methods are selected completely at random. Hence we can consider a tree which contains the same test at each level of the tree as shown below. This structure is equivalent to a single Fern. Therefore a Fern is simply a constrained tree.

A tree, a constrained tree, and a Fern

Comparison of model combination:

The Randomized Trees model the posterior class probabilities and averages the results from different trees. For ferns the class conditional densities are modelled and assumed to be independent between ferns.

Comments on performance difference:

In both cases the test structures generate codes for each patch using randomly selected binary features and the code probabilities are modeled. Therefore we expect the main difference between the methods to be the probabilistic models. We verified this through experiments and observed that the tree structures with independence assumption has the same performance as Ferns. Hence it is possible to use Independent Trees but the Ferns are simpler and have a more intuitive probabilistic interpretation.

Results

We did experiments on a 1074 frame sequence comparing the number of inliers for Fern based matching and Orientation Estimation + SIFT description. A frontal view of a planar target is used to train Ferns by generating random deformations.

Mouse Pad Sequence Mouse Pad Sequence Mouse Pad Sequence Mouse Pad Sequence

The results show that both methods return comparable matches, and when perspective distortion is large Ferns perform better.

Inlier Comparison

The training phase assumes local planarity. To test the robustsness of the algorithm when this is violated, we performed tests on tracking a specific face image as shown above.

Face Test 0 Face Test 1 Face Test 2 Face Test 3

Conclusion and Discussion

Although Ferns require a training phase, they take only 13.5 x 10-3 milliseconds to classify one keypoint while at the same time handle 360 degrees rotation and affine deformations without the need for preprocessing stages.

It is possible to see the training phase as a marginalization over pose parameters. While this results in smoother distributions than estimating and conditioning on a specific pose, it does not need to estimate pose at run-time and will work better when there is no canonical pose that can be accurately estimated.

It is an open question whether we can do something in between, estimate and condition when it is easy/accurate or use a marginal distribution when necessary.

One limitation of the method is the at least linear dependence of the computation on the number of classes albeit with a very small constant in front. See the paper for a few heuristic methods to decrease the total time (they still have the same complexity for the first few Ferns). It might be possible to extend the method to work on clusters of keypoints.

Another limitation is the memory requirements since the probability tables take memory linear in the number of classes as well. If possible parametric distributions with less degrees of freedom can be used but it depends on the type of features and their distribution.

Related Publications

M. Özuysal, M. Calonder, V. Lepetit, P. Fua
IEEE Transactions on Pattern Analysis and Machine Intelligence,
Vol. 32, Nr. 3, pp. 448 - 461, March 2010
M. Özuysal, P. Fua, V. Lepetit
Conference on Computer Vision and Pattern Recognition, Minneapolis, 2007

Valid XHTML 1.0 Strict Valid CSS