In this video, I will present several examples of multiple object tracking algorithms.
I will consider undertaking by detection approach,
which is most practical useful and actively researched.
The tracking is divided into two steps.
The first step is to apply object detector to each video frame or keyframes.
Second step is to associate these detection's to tracks.
The typical problem of
multiple object tracking is limited performance of object detector,
it means detection's and false positives.
In theory, good tracker should handle both of these flaws.
It should fill gaps in detection's by propagating information from neighboring frames.
And it should filter false positive detections
based on the information from other frames.
However, as it has been found out during multiple object tracking challenges,
in practice it is not completely so.
For example, in more challenge data sets from 2016,
18 percent of tracks are not covered by detections at all.
37 percent of tracks are covered only by low confidence detections.
Therefore, particular ground lose track.
There is no high quality detection then tracker cannot resolve this problem at all.
So trackers usually reduce only false positives
and raise false negatives by removing low confidence detections.
So many researchers states that good detector is the key to the good tracking.
It's easy to illustrate this fact using more challenge results.
Using public detections from deformable part model detector,
the best tracking algorithm,
which is less than 50 percent accuracy on more challenge 2016.
Using strong detections from methods like
faster R-CNN which was fine tuned for pedestrian detection.
Even very simple approaches easily exceed 60 percent accuracy.
Detection by tracking messes can be divided into two classes.
Online tracking, where only current and previous frames are
available and offline or batch tracking,
when all frames of video sequence are available.
So tracking in current frame can rely on information from both previous and next frames.
It's much easier to resolve complex situation in the latter case but in this video,
I will consider only online methods.
The main part of tracking is association of detection between frames.
Based on the number of frames can see that in each particular step,
you can divide methods into two frames in multiple frames methods.
In two frame methods,
the associated detection in current frame is available tracks from the previous frame.
In multi-frame methods, we consider a whole video or
temporal window and we can change
detection to tracks association for any frame in the window.
I will now consider only two frame algorithm
because they are simpler and very suitable for online tracking.
To associate detection and tracks,
we need to estimate similarity or affinity between them.
We can use several features to compute affinity scroll.
This visual similarity, motion similarity and
interaction between objects and objects within the scene.
The simplest form of affinity is overlap
between detection in the current frame and in the previous frame.
If video frame rate is high and object detection is good and
objects are moving slowly between frames and through detections of the highest overlap.
The validity of this simple approach has been
demonstrated in the recent high speed tracking by
detection result using image information paper from AVSS 2017.
The idea of the algorithm is very simple.
For each track, we select detection with the highest IOU.
If this IOU is larger than the threshold we,
add it to the track and remove it from the list of unassociated detections.
If the best overlap is still lower than the threshold, then finish the track.
If the track is too short or had no higher confidence directions we
consider the track false positive and then remove it from
the final list of tracks. This is all.
Despite the simplicity of this approach,
it reaches state of that result on DETRACK data set,
outperforming much more elaborate methods.
The results on more challenge are no better as good detector either.
But if detector fails a frame and misses the object,
then tracking of this object is stopped.
False negative ID speech errors will be produced.
The problem can be alleviated if it can use
predicted object position instead of a missed detection in this frame.
For example, we can use kalman filter to predict
objects based on its positions in previous frames.
Then we can associate detections in current frame these predictions from previous frames.
If object is not detected in the current frame,
you don't finish the track immediately but
continue to track this predictions for several frames.
We hope that eventually the object will be detected.
And if predictions are good enough,
we will successfully associate
this detection with predicted position and assume tracking.
Because predictions are less precise than detection,
you should replace simple grid based with old method.
It's the optimal Hungarian algorithm for the assignment problem,
as it was done in simple online in
real time tracking these high quality detections paper.
From a result of this algorithm in more challenge 2016,
we see that movement predictions significantly reduce fragmentation and ID switches.
Resulting by the number of false positives are also increased.
Up till now, we haven't considered object appearance and rely only on object motion.
In this case, we can easily merge two tracks of
different people if they move closer to each other.
To add visual similarity to affinity cost,
you can use imagery table method which is
essentially a visual similarity estimation between images.
Personal identification is a variant of
imagery table problem that can only see there's images of pedestrian during tracking.
The problem was originally stated in context of
tracking but later has been stated as a separate problem.
This and other imagery tasks
currently deploying methods that are in top performance of this problem.
So we can use convolutional neural network to extract the distance and
appearance features and measure similarity based on these neural network features.
Additional for appearance similarity to affinity cost to this simple sort algorithm,
almost half the number ID switches and improves other parameters as well.
Previously, to compute and finish the score
from different feature combination has been used.
But recently it has been proposed to use a recurrent neural network
to long affinity score function specifically for each track.
In tracking the untrackable paper LSTM are used to combine appearance,
motion and interaction information.
This model is compressed from three RNN,
appearance, motion and interaction.
All models are combined through another RNN.
This RNN is used to compute the similarities score between tracks and detection.
The model are pre trained and then fine tuned for each track in the test video.
Experiments show that all components appearance,
motion and interaction are complimentary.
So for the best performance,
they should be used jointly.
As if all goes 2017,
this approach is second best from all online tracking methods in more challenge
2016 and you must first only by zero point two accuracy.
For the conclusion, I want to say the detector and affinity score
functions are two main components often line multiple objects tracking methods.
For both of these components,
deep learning now demonstrate the best results.
If we use a real time detector and the affinity estimate,
then simple association method can be used in
online tracking framework to achieve
both real time performance and good tracking accuracy.