To provide automatic initialisation and recovery capabilities to recursive tracking systems, we are developing methods to learn the appearance of a specific object from video sequences. During an off-line training phase, we build a Randomized Tree based classifier to recognise small patches centered around the keypoints detected on the images.
The training video is processed without user intervention. The user only provides a rough 3D model of the object and the camera pose for the initial frame of the training video. The resulting object detector can run at real-time and handle partial or complete occlusions of the object while at the same time being robust to lighting changes.
The classifier is trained in a supervised manner hence every keypoint detected on training images has to be labelled. To solve this problem we use an incremental training strategy that we call as "Feature Harvesting". An initial classifier is built from only the first frame and synthetic examples obtained by small affine deformations. This classifier then labels the keypoints in the next frame and detects the object. This new frame generates more examples to update the classifier and the procedure is repeated until there are no more frames to process.
Randomized Trees are particularly fit for this approach since the updates can be done very easily and keypoints can be added or removed during training. This allows to remove unstable keypoints and only keypoints that are reliably tracked are incorporated into the final classifier.
We have used a simple ellipsoid model to train classifier for different objects and then tested the classifier on a new video sequence showing the object from viewpoints that are present in the training video. In principle simple 3D models can be used for training as long as the deviation from the real surface is not too much on textured areas. The roughness of the model starts to become a problem only if there is a wide enough change in viewpoint. The test results are shown below (Click on images to download videos).
We have also performed tests on different kinds of objects to evaluate the quality of training. The face tests are include more intensive light changes in both the training and the test sequences. The next suite of tests are done using a partially textured transparent glass. We have used a cylindrical model that covers the glass from top to the bottom. The video used for training shows the glass moving against a complex background. The harvesting procedure filters out the keypoints on the background and keeps the ones on the texture of the glass since the background keypoints are not stable.
| Training | Test |
|---|---|
![]() |
![]() |
![]() |
![]() |