NFT important to the music industry?

Music NFTs are allowing musician and artistes to create new possibilities and experiences with their fans. Musicians are continuing to be creative with what they offer to their fan bases…


独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

Introduction to Trackers

Object tracking is one of the foremost important and common problem in computer vision that has shown to have various applications in the areas such as traffic monitoring, robotics, autonomous vehicle tracking, and so on. From the beginning the object tracking, in general, was a challenging problem. Problems such as abrupt object motion, changing appearance patterns of both the object and the scene, nonrigid object structures, object-to-object and object-to-scene occlusions, and camera motion are some of the common problems faced in object tracking. The goal of this article is to review some of the popular and state-of-the-art tracking methods, discuss their methodology to have better depth and understanding of their approaches. Generally tracking is usually performed in the context of higher-level applications that require the location and/or shape of the object in every frame. Typically, in the context of any application assumptions are made to constrain the tracking problem.

Object tracking is among very active research areas in the field of computer vision. Object tracking sometimes also referred to as Multiple Object Tracking (MOT), also called Multi-Target Tracking (MTT), aims to analyze videos to identify and track objects belonging to one or more categories, such as pedestrians, cars, animals, and inanimate objects, without any prior knowledge about the appearance and number of targets. Differently from object detection algorithms, whose output is a collection of rectangular bounding boxes identified by their coordinates, height and width, MOT algorithms also associate a target ID to each box (known as a detection), to distinguish among intra-class objects. When evaluating the performance on this task, the object of interest is identified solely using an axis-aligned bounding box (BBOX) in the first frame. The tracking algorithm should then preserve the assigned object’s identity in future frames. The MOT task plays an important role in computer vision: from video surveillance to autonomous cars, from action recognition to crowd behavior analysis, many of these problems would benefit from a high-quality tracking algorithm.

The aspects of tracking have serious challenges and there are some expectations from the tracking system, so when designing each tracking algorithm, we should try to address those expectations as much as possible. Some of these features are listed below.

Robustness: Robustness means that the tracking system can track the target even in complicated conditions such as background clutters, occlusion, and illumination variation.

Adaptability: In addition to the environment changes, the target is to changes, such as the complex and sudden movement of the target. To solve this problem, the tracking system must be able to detect and track the current apparent characteristics of the target.

Real-time processing of information: A system that deals with image sequences should have high processing speeds.
So, there is a need to implement a high-performance algorithm with low latency.

Object tracking methods have different categories, below are the various techniques and approaches that are used to develop any tracker algorithm. Figure 1 also gives an illustration of comprehensive classification of object tracking methods.

This method is one of the simple ways of object tracking. To track objects, features, such as color, texture, optical flow, etc., are extracted first. These extracted features must be unique so that objects can be easily distinguishable in the feature space. Once the features are extracted, then the next step is to find the most similar object in the next frame, using those features, by exploiting some similarity criterion.

One of the problems with these methods is at the extraction step because the unique, exact, and reliable features of the object should be extracted so that it can distinguish the target from other objects. Here are some of the features that are used for object tracking.

Color: The color features can show the appearance of the object. This feature can be used in different ways, one of the most common methods to use this feature is a color histogram.

Texture: Texture is a repeated pattern of information or arrangement of the structure with regular intervals. Texture features are not obtained directly. These features are generated by image preprocessing techniques.

Optical Flow: It is the apparent motion of brightness patterns in the image. apparent motion can be caused by lighting changes without any actual motion. The optical flow algorithm calculates the displacement of brightness patterns from one frame to another. Algorithms that calculate displacement for all image pixels are known as dense optical streaming algorithms, while sparse algorithms estimate displacement tension for a selective number of pixels in an image

Figure 1: Classification of object tracking methods

Segmenting foreground objects from a video frame is fundamental and the most critical step in visual tracking. Foreground segmentation is done to separate foreground objects from the background scene. Normally, the foreground objects are the objects that are moving in a scene. To track these objects, they are to be separated from the background scene. In the following, some of the object tracking methods based on the segmentation are examined.

Bottom-Up Based Method: In this type of tracking, there must be two separate tasks, first the foreground segmentation and then the object tracking. The foreground segmentation uses a low-level segmentation to extract regions in all frames, and then some features are extracted from the foreground regions and tracked according to those features.

Joint Based Method: In the bottom-up method, foreground segmentation and tracking are two separate tasks; one of the problems with this method is that the segmentation error propagates forward, causing error tracking. To solve this problem, the researchers merged the foreground segmentation and tracking method, which improved tracking performance.

Estimation methods formulate the tracking problem to an estimation problem in which an object is represented by a state vector. The state vector describes the dynamic behavior of a system, such as the position and velocity of an object. The general framework for the dynamic mode estimation problem is taken from Bayesian methods. The Bayesian filters allow the target to continuously update its position on the coordinate system, based on the latest sensor data. This algorithm is recursive and consists of two steps: prediction and updating.

The prediction step estimates the new position of the target in the next step using the state model, while the updating step uses the current observation to update the target position using the observation model. The prediction and updating steps are performed on each frame of the video sequence. Here are some examples of this method.

Kalman Filter: To use the Kalman filter in the object tracking, a dynamic model of the target movement should be designed. The Kalman filter is used to estimate the position of a linear system assumed that the errors are Gaussian. Kalman filter has important features that tracking can take advantage of it, including:

The Kalman filter consists of two stages of prediction and updating. In the prediction step, previous models are used to predict the current position. The update step uses current measurements to correct the position.

Particle Filter: Most tracking issues are non-linear. Therefore, particle filter has been considered for solving such problems. The particle filter is a recursive Monte Carlo statistical calculation method that is often used for non-gaussian noise measurement models. The main idea of the particle filter shows the distribution of a set of particles.

In learning-based methods, the features and appearance of different targets and their prediction are learned in the next frames, and then in the test time, based on what they learned, they can detect the object and track it in the next frames. Learning-based methods are often divided into three types of generative, discriminative and reinforcements learning.

Discriminative Methods: Discriminative trackers usually consider tracking as a classification problem that discriminates the target from the background. Discriminative learning is divided into two categories contains Shallow learning and Deep learning. Shallow learning methods is a type of methods that use pre- defined features, and they can’t extract features. Shallow learning with fewer layers predicts the model, but deep learning has too many layers. Another difference is that shallow learning requires important and discriminatory features extracted by professionals, but deep learning itself extracts these important features.

Generative Methods: The generative appearance models mainly concentrate on how to accurately fit the data from the object class. However, it is very difficult to verify the correctness of the specified model in practice. By introducing online-update mechanisms, they incrementally learn visual representation for the foreground object region information while ignoring the influence of the background. Traditional online learning methods are adopted to track an object by searching for regions most similar to the target model. The online learning strategy is embedded in the tracking framework to update the appearance model of the target adaptively in response to appearance variations.

Recently, more and more of such algorithms have started exploiting the representational power of deep learning (DL). The strength of Deep Neural Networks (DNN) resides in their ability to learn rich representations and to extract complex and abstract features from their input. Convolutional neural networks (CNN) currently constitute the state-of-the-art in spatial pattern extraction and are employed in tasks such as image classification or object detection, while recurrent neural networks (RNN) like the Long Short-Term Memory (LSTM) are used to process sequential data, like audio signals, temporal series, and text. Since DL methods have been able to reach top performance in many of those tasks, we are now progressively seeing them used in most of the top performing MOT algorithms, aiding to solve some of the subtasks in which the problem is divided.

In this section we are going to discuss some of the very popular and state of the art trackers. The selection of these trackers was based on various factors such accuracy, ease of understanding, frame rate (high FPS) etc. Below find the list of trackers and their brief explanation:

This paper [1] first appeared in the year 2016 and was ranked best open-source multiple object tracker on the MOT benchmark. It is still a very popular tracker as many current state of the art trackers are also based on this tracker fundamental concepts. The main reason for its popularity was not only it was accurate but at the same time it was very fast and had many real time application.

There are four main components in the methodology described in the paper:

Detection: The detections are being fed from any deep learning models to initialize the trackers components.

Estimate Model: Here they describe the object model, i.e., the representation and the motion model used to propagate a target’s identity into the next frame. They approximate the inter-frame displacements of each object with a linear constant velocity model which is independent of other objects and camera motion. The state of each target is show in the below equation:

where u and v represent the horizontal and vertical pixel location of the center of the target, while the scale s and r represent the scale (area) and the aspect ratio of the target’s bounding box respectively. When a detection is associated to a target, the detected bounding box is used to update the target state where the velocity components are solved optimally via a Kalman filter framework. If no detection is associated to the target, its state is simply predicted without correction using the linear velocity model.

Data Association: In assigning detections to existing targets, each target’s bounding box geometry is estimated by predicting its new location in the current frame. The assignment cost matrix is then computed as the intersection- over-union (IOU) distance between each detection and all predicted bounding boxes from the existing targets. The assignment is solved optimally using the Hungarian algorithm.

Creation and Deletion of Track identities: When objects enter and leave the image, unique identities need to be created or destroyed accordingly. For creating trackers, they consider any detection with an overlap less than IOUmin to signify the existence of an untracked object. The tracker is initialized using the geometry of the bounding box with the velocity set to zero. Since the velocity is unobserved at this point the covariance of the velocity component is initialized with large values, reflecting this uncertainty. Tracks are terminated if they are not detected for TLost frames. This prevents an unbounded growth in the number of trackers and localization errors caused by predictions over long durations without corrections from the detector.

Table 1: Performance of SORT on MOT 15

The problem with sort was even though it gives good performance in terms of precision and accuracy, but it returns a relatively high number of identity switches. This is because the employed association metric in SORT is only accurate when state estimation uncertainty is low. Thus, it has a deficiency in tracking which does not work in cases like occlusions as they typically appear in frontal-view camera scenes. Some of these issues are addressed in this paper, DeepSORT [2].

Figure 2: DeepSORT Algorithm flow

Figure 2 gives us a general idea about the algorithm, this algorithm has two branches which is described below:

Appearance Branch: Given detections in each frame, the deep appearance descriptor, is applied to extract their appearance features. It utilizes a feature bank mechanism to store the features of the last 100 frames for each tracklet. As new detections come, the smallest cosine distance between the feature bank. The distance is used as the matching cost during the association procedure.

Motion branch: The Kalman filter algorithm accounts for predicting the positions of tracklets in the current frame. Then, Mahalanobis distance is used to measure the spatio-temporal dissimilarity between tracklets and detections. DeepSORT takes this motion distance as a gate to filter out unlikely associations.

The matching cascade algorithm is proposed to solve the association task as a series of subproblems instead of a global assignment problem.

Table 2: Tracking results on the MOT16 challenge

Bytetrack [4] propose a simple, effective, and generic data association method, called BYTE. Different from previous methods which only keep the high score detection boxes, they keep almost every detection box and separate them into high score ones and low score ones.

Below are the simple that define the bytetrack algorithm:

Table 3: Tracking results on the MOT17 challenge

Before jumping to this new algorithm lets discuss in brief about the limitations of SORT that lays the foundation of the new proposed method that is OC-SORT [3]

Sensitive to State Noise: SORT is sensitive to noise from KF’s states. Even if the estimated position has a shift of only a few pixels, it causes significant variance to the estimated speed. In general, the variance of speed estimation can be of the same magnitude as the speed itself or even bigger.

In most cases, this will not make a massive impact as the shift is only of few pixels from the ground truth on the next time step and the supervision provided by the observation corrects the estimates from KF motion model. However, they will see the sensitivity introduces significant problems in practice because of the error accumulation across multiple time-steps when no observation is available for KF update.

Temporal Error Magnification: Error gets accumulated to construct a trajectory. Consider a track is occluded between t and t+T and the noise of speed estimate follows normal distribution that is given as:

till the step t+T, for the estimated positions the noise follows the below distribution:

So, this shows that without the supervision from observation, the estimates from linear motion assumption of KF results in square-order error accumulation with respect to time

To address the limitations above, we use the momentum of the object moving into the association stage and develop a pipeline with less noise and more robustness over occlusion and non-linear motion.

Three important steps were added to address all the problems systematically, figure 3 also summarizes all these steps pictorially.

Observation centric online smoothing (OOS):Once a track is associated to an observation again after a period of being untracked, they perform online smoothing over the parameters back to the period of being lost through a virtual trajectory of observations. This fixes the accumulated error during the time interval.

Observation centric Momentum (OCM):The linear motion model assumes a consistent velocity direction. However, this assumption often does not hold due to the non-linear motion of objects and state noise. In a reasonably short time, we can approximate the motion as linear, but the noise still prevents us from leveraging the consistency of velocity direction. They propose a way to reduce the noise and add the velocity consistency (momentum) term into the cost matrix

Observation Centric Recovery (OCR): Re-identifying an object with no trajectory prior, whose position can be thought of as following a Gaussian distribution with the position of its last-time presence as the mean and the variance growing with respect to the time of its being lost.

Figure 3. Observation-centric Online Smoothing reduces the error accumulation when a track is broken. The target is occluded between the second and the third time step and the tracker finds it back at the third step. Yellow boxes are observations by the detector. White stars are the estimated centers without OOS. Yellow stars are the estimated centers fixed by OOS. The gray star on the fourth step is the estimated center without OOS and fails to match to observations.
Table 5: Performance on MOT 20 challenge

They have worked mainly in improving the DeepSORT [5] algorithm, their proposed improvements lie in the two branches (Refer Fig 2 and Fig 4 for comparison).

In the appearance branch, a stronger appearance feature extractor, BoT, is applied to replace the original simple CNN which generates much more discriminative features. They have replaced the feature bank with the feature updating strategy proposed by exponential moving average (EMA).

For the motion branch, they have adapted ECC for camera motion compensation. The vanilla Kalman filter is vulnerable w.r.t. low-quality detections and ignores the information on the scales of detection noise. To solve this problem, they are using the NSA Kalman algorithm (adaptably calculates the noise covariance).

Instead of employing only the appearance feature distance during matching, they solve the assignment problem with both appearance and motion information.

Table 6: Performance on MOT 20 challenge

In this article, first we have presented a comprehensive classification of object tracking algorithms. In this category, tracking algorithms are divided into feature-based, segmentation-based, estimation-based, and learning-based categories. Later in this article we focused on learning-based and estimated-based approaches. Estimated-based methods have showed quite promising results in terms of accuracy and speed. These approaches have been extensively used in many real time applications. Learning-based tracking algorithms, especially deep learning based, have recently received much attention lately. Deep learning is a new and exciting field in various fields, especially computer vision, and in many fields, it has a shown a higher accuracy and has made a lot of progress. The computation complexity of deep learning networks is sometimes a bottleneck for many real time applications even though they have good tracking accuracy. However, it can’t be said that deep learning works best in all cases and always should be used, but by knowing the advantages and disadvantages of all methods one can find out which method can work best in problem-solving.

1. Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016, September). Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP) (pp. 3464–3468). IEEE.

2. Wojke, N., Bewley, A., & Paulus, D. (2017, September). Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP) (pp. 3645–3649). IEEE.

3. Cao, J., Weng, X., Khirodkar, R., Pang, J., & Kitani, K. (2022). Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking. arXiv preprint arXiv:2203.14360.

4. Zhang, Y., Sun, P., Jiang, Y., Yu, D., Yuan, Z., Luo, P., … & Wang, X. (2021). ByteTrack: Multi-Object Tracking by Associating Every Detection Box. arXiv preprint arXiv:2110.06864.

5. Du, Y., Song, Y., Yang, B., & Zhao, Y. (2022). Strongsort: Make deepsort great again. arXiv preprint arXiv:2202.13514.

6. Nasseri, M. H., Moradi, H., Hosseini, R., & Babaee, M. (2021). Simple online and real-time tracking with occlusion handling. arXiv preprint arXiv:2103.04147.

7. Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., & Leibe, B. (2021). Hota: A higher order metric for evaluating multi-object tracking. International journal of computer vision, 129(2), 548–578.

8. Wu, J., Cao, J., Song, L., Wang, Y., Yang, M., & Yuan, J. (2021). Track to detect and segment: An online multi-object tracker. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12352–12361).

9. Zhang, L., & Yu, N. (2022). Online Multi-Object Tracking with Unsupervised Re-Identification Learning and Occlusion Estimation. Neurocomputing.

Add a comment

Related posts:

Destructuring Objects and Arrays in JavaScript

Javascript is a language that is used for all sorts of things. And each year, new features come to the popular language. Some are very niche, and some, could be very useful during regular…

The Curse of Atuk

Some consider the unproduced screenplay Atuk cursed. Based on Mordecai Richler’s 1963 novel The Incomparable Atuk, the story follows a young Inuit poet and native of Baffin Island. The titular…

Here are the Two Kinds of Innovation Enterprises Must Leverage

The corporate world is at a precipice. For the first time in modern history, simply having a ton of money is not a guarantee of position. Now, a kid with a computer and some basic coding (or even…