We show that it is possible to train a Semi-Naive Bayesian Classifier to recognize keypoints without the need for intensive preprocessing stages like fine-scale selection, dominant orientation and affine parameter estimation. We also demonstrate that this approach is superior to previous methods that use weighted averaging of Randomized-Tree based classifier ensembles.
Can we match keypoints using simple binary features without intensive preprocessing (fine scale selection, orientation estimation, affine correction)?
Assuming that we have the possibility to train a classifier to classify each keypoint class.
We model the class conditional probabilities of a large number of binary features which are estimated by a training phase. At run-time these probabilities are used to select the best match for a given image patch.
| : | Binary feature (We use sign of intensity difference of two pixels). | |
| : | Total number of features in the model. | |
| : | Class representing all views of an image patch around a keypoint. |
Given
Select class
such that
assuming uniform prior over classes.
However it is not practical to model the joint distribution of all features. We compromise by grouping features into sets of small size and assume independence between these sets.
| : | A Fern defined to be set of
|
|
| : | Number of ferns, |
For each patch a fern node outputs a binary number, and a fern with S nodes outputs a number between 0 and 2S-1. When we have multiple patches of the same class we can model the output of a fern with a multinomial distribution.
The distributions are initialized with a uniform Dirichlet prior, assuming one prior example for each class with each possible fern output. Then the training data starts to update the distributions.
Each patch in the training set updates the distribution for the corresponding class.
At the end of the training we have distributions over possible fern outputs for each class.
To recognize a new patch the outputs selects rows of distributions for each fern and these are then combined assuming independence between distributions.
For two different images we have compared the correct classification rate of both the Randomized Trees and Ferns classifying 500 keypoints on randomly generated test samples. The tests are repeated 10 times and error bars show the range of results (Click on thumbnails to see larger graphs).
We also compare the performance of both methods as the number of keypoints are increased. We train the methods up to 2100 keypoints selected from three images. The first graph shows the case with 20 trees/ferns and the second for 30.
The tests for both methods are selected completely at random. Hence we can consider a tree which contains the same test at each level of the tree as shown below. This structure is equivalent to a single Fern. Therefore a Fern is simply a constrained tree.
The Randomized Trees model the posterior class probabilities and averages the results from different trees. For ferns the class conditional densities are modelled and assumed to be independent between ferns.
In both cases the test structures generate codes for each patch using randomly selected binary features and the code probabilities are modeled. Therefore we expect the main difference between the methods to be the probabilistic models. We verified this through experiments and observed that the tree structures with independence assumption has the same performance as Ferns. Hence it is possible to use Independent Trees but the Ferns are simpler and have a more intuitive probabilistic interpretation.
We did experiments on a 1074 frame sequence comparing the number of inliers for Fern based matching and Orientation Estimation + SIFT description. A frontal view of a planar target is used to train Ferns by generating random deformations.
The results show that both methods return comparable matches, and when perspective distortion is large Ferns perform better.
The training phase assumes local planarity. To test the robustsness of the algorithm when this is violated, we performed tests on tracking a specific face image as shown above.
Although Ferns require a training phase, they take only 13.5 x 10-3 milliseconds to classify one keypoint while at the same time handle 360 degrees rotation and affine deformations without the need for preprocessing stages.
It is possible to see the training phase as a marginalization over pose parameters. While this results in smoother distributions than estimating and conditioning on a specific pose, it does not need to estimate pose at run-time and will work better when there is no canonical pose that can be accurately estimated.
It is an open question whether we can do something in between, estimate and condition when it is easy/accurate or use a marginal distribution when necessary.
One limitation of the method is the at least linear dependence of the computation on the number of classes albeit with a very small constant in front. See the paper for a few heuristic methods to decrease the total time (they still have the same complexity for the first few Ferns). It might be possible to extend the method to work on clusters of keypoints.
Another limitation is the memory requirements since the probability tables take memory linear in the number of classes as well. If possible parametric distributions with less degrees of freedom can be used but it depends on the type of features and their distribution.