Motion Estimation Algorithm Performance

Projected Interframe Compression Performance of Martingale's Motion Estimation Algorithm

Abstract

Martingale's advanced proprietary motion estimation algorithm (MME) will provide compression performance superior to the mean absolute difference (MAD) minimization scheme used with MPEG. This is because the MME algorithm

Computes correct translational motion vectors under identifiable scene conditions where MAD has an increased failure probability, and
Uses a longer frame history (without the attendant frame buffers) than MAD's two-frame algorithm, which provides increased gain and smaller RMS error in estimation of the correct translational motion vector.

MME can achieve an inter-frame data compression rate of 7.5-to-1, or better, compared with MPEG's motion estimation factor of 5-to-1, depending on the nature of the video and the inter-frame compressibility of macroblocks that MAD presently fails to exploit.

Our MME algorithm requires only 56 arithmetic operations per pixel per frame, making it an order of magnitude faster than the macroblock search method (MAD) used for motion estimation in most MPEG implementations. In advanced applications, the MME algorithm can also be paired with lossless coding to implement a symmetric video codec. This provides the vastly improved compression available with pixel-level motion compensation without transmitting the motion vectors! (This option is not compatible with MPEG or Px64.)

Introduction

Prior to empirical testing, any comparison of compression performance between motion estimation algorithms is subjective and highly dependent on the qualifying assumptions about the data and the optimization objectives. There is no "theoretically best" compression algorithm for all possible image sequences¹. However, there are certain principles of signal and video processing that permit us to make some qualitative assessments. In order to apply these principles, we need to begin with some basic assumptions to establish the framework.

First, we assume that whenever there exists translational motion at a macroblock scale in the interval between successive I or P frames, any two algorithms that produce an unbiased estimate of the correct motion vector with comparable RMS error will achieve comparable compression performance because the resulting error blocks will be substantially the same. In order for one algorithm to claim improved compression performance, it must be shown that it produces more motion vectors per frame, with a smaller RMS estimation error.

Second, based on empirical studies performed in our target tracking experiments, it is reasonable to assume that our MME algorithm typically converges within three frames to motion estimates at the pixel level whose (weighted) average over each macroblock is an unbiased estimate of the gross translational motion. The resulting RMS error is comparable with that of the MAD method when illumination is constant and motion is purely translational. We assume that the macroblock compression factor achievable under those circumstances is the same, whether MAD or MME supplies the motion vectors, and that it is R₁ to 1.

Finally, we note that the MAD algorithm employs a threshold level for the minimum of the mean absolute difference over the search space, such that when that minimum MAD exceeds the threshold, the algorithm presumes that no meaningful motion vector can be computed for the macroblock. We will assume that on the average 100P percent of the macroblocks in each frame are intra-frame coded without motion estimation, and that of these an average MPEG compression ratio of R₂ to 1 could have been obtained by using a good estimate of the mean translation component of the macroblock.

Basis for Improved Compression Performance

Our basis for the claim of improved compression performance lies in the fact that our MME algorithm is a pixel-level motion estimator that is relatively insensitive to

time-varying illumination of the scene, or parts thereof,
rotational motion,
scaling of the scene (zooming or radial motion),
motion of constituent elements of arbitrarily-defined blocks.

The MAD algorithm and all other existing motion estimation algorithms are heavily dependent on constant illumination. Furthermore, all block translation algorithms, such as those specified in MPEG and Px64, degrade rapidly when there is constituent motion within the block, and when the motion has significant rotational or scaling components. But in typical video, only a small percentage of the motion is "purely translational at constant illumination".

To illustrate, consider the case in which there is simple translational motion under conditions of variable average illumination of the scene. In this case, even when the MAD algorithm tests the correct motion vector, it will obtain a large value for the mean absolute difference metric. Therefore, there is an increased probability that it will find a better correlation at another location (especially if the illumination is both spatially and temporally nonuniform) or that it will reject the motion hypothesis altogether by exceeding the threshold level. The MME, on the other hand, is insensitive to changes in average illumination. It will produce the correct translation vector, and the resulting error frame will contain mostly a DC component due to the illumination difference, which is highly compressible using either DCT+ Quantization+ Coding or Wavelet+ Quantization+ Coding.

Consider next the case where there are scaling, rotational, and/or constituent motions present. When the MAD method tests the correct translation vector during its search, the MAD metric will again be large in proportion to the magnitude of the scale factor, the rotational angle or the constituent motions. Unlike the variable illumination error, this growth arises due to the misalignment of constituent objects, or of spatial frequency components that are overlaid out of phase by the candidate translation vector. Consequently, the probability of finding a better correlation with a false motion vector, or of having to fall back to intra-frame coding, also increases if the MAD metric is used. But with our MME algorithm, the error frame obtained using the correct translational component will contain mostly high spatial frequency, low amplitude components which are more highly compressible on the average than the corresponding I-frame.

Also, it is reasonable to assert that when motion persists through more than two I or P frame intervals, then the MME algorithm will provide better RMS estimation errors than MAD, even when the motion is purely translational. This is because the MME algorithm, being based on martingale processes and dynamic programming principles, accumulates information about coherent motion from all the past history of the input video back to the last change of context. Thus it is able to achieve greater motion estimation gain over any algorithm that only employs only two frames in the computation of the motion estimate.

Our MME algorithm can be used to eliminate B frames from MPEG sequences if desired, but the use of B frames greatly increases the compression ratio, and, as previously noted, it provides the convergence time needed by the MME algorithm after abrupt changes of context. That is, frames destined to be B-coded will still be used by the MME algorithm to sharpen the accuracy of its motion vectors at the next P frame, even though the motion vectors associated with the B frames will be discarded.

Quantitative Estimates

We know that the average frame data compression ratio for the MAD motion estimation algorithm in MPEG is empirically about 5-to-1 ². Let P be the average fraction of macroblocks per frame that are NOT compressed with motion estimation in MPEG, for whatever reason. If R₁ is the average compression ratio of each macroblock for which MAD provides a motion vector, and R₂ is the average compression ratio for the remaining macroblocks using the motion vector that only MME can supply, then the net frame data reduction ( 1/5 ) is a weighted average of the data reductions available for the two classes of macroblocks:

1/5 = (1-P)*(1/R₁) + P*(1/R₂)

Using MPEG, the 5-to-1 compression is obtained with no inter-frame compression in the fraction P of the macroblocks for which MAD provides no motion vector; therefore, the value of R₂ for MPEG+MAD is 1. That means, for example, that if P = 0.1, then R₁ must be 9; i.e., the average data compression attributable to motion estimation on the remaining 90% of the macroblocks in each frame must be 9-to-1 in order to achieve the empirically observed 5-to-1 average for the whole frame.

Using MME in place of MAD, the (1-P) fraction of the macroblocks for which MAD would have found a motion vector are assumed to be compressed by the same amount (R₁) when MME supplies the motion vector. But the remaining fraction will now be compressed at a rate R₂ that is different. Table I shows the average total frame compression rates that are expected to result from using MME in place of MAD for various values of P and R₂ (the left and top margins, respectively, of Table I). For example, if P=0.1, then R₁=9 (by the intermediate calculation as described above). But if R₂=3.0 using MME, then the net frame compression is 7.5 (cell with solid border in the table).

TABLE I: Estimated frame compression factors after replacing MAD by MME

The shaded column in Table I shows that if MME were to achieve NO compression on the corresponding percentage of blocks that were missed by MAD, then, under the prevailing assumptions, the net compression factor for the whole frame (due to motion estimation) would be the same as with MPEG+MAD, i.e., 5-to-1.

Additional Compression Gain with Proprietary Architecture

Finally, if our MME algorithm is coupled with a lossless coding method, then it can be used in a video coding architecture similar to that of the perceptually weighted residual-excited linear predictive (RELP) vocoder ³. Abandoning the lossy quantization step costs bandwidth -- roughly a factor of 3 in conjunction with the DCT. However, this will be offset by the improved compression ratio made possible by pixel-level motion estimation, which is more robust under the effects of rotation, scaling, and constituent motions. And, because of the lossless coding, the bandwidth once required for transmission of the motion vectors will also be reclaimed. This is because, in the vocoder paradigm, a replica of the receiver is imbedded in the transmitter so that the "prediction coefficients" need not be transmitted.

It is difficult to estimate the ultimate quality of the resulting compression, because, as with speech compression, much depends on the quality of the predictive model. In that regard, however, we remark that our MME algorithm is the basis for a speech processing and recognition algorithm presently under development, and when that project is complete, it is likely that we will be able to integrate both video compression and audio compression into the same architecture, thus achieving more efficient use of hardware.

Conclusion

Martingale Research Corporation's motion estimation algorithm provides at least an order of magnitude improvement in computational burden for the production of video motion vectors at pixel-level resolution. Furthermore, it can provide motion estimation data compression rates with MPEG of 7.5-to-1 or better, depending on the compressibility of macroblocks not presently compressible with the MAD motion estimator. Even greater compression rates may be achieved by exploiting pixel-level motion estimation in a RELP vocoder-type of video coded architecture.

References

Gailly, J-l., "Introduction to MPEG", Part 2 of 3 parts of the Data Compression FAQ available by ftp on rtfm.mit.edu in directory /pub /usenet /news .answers /compression-faq.
Doyle, Bob, "How Codecs Work", Newmedia, March 1994, 52-55.
Owens, F.J., Signal Processing of Speech, McGraw-Hill, 1993, 6.6.2.
Patel, K., Smith, B., and Rowe, L., "Performance of a Software MPEG Video Decoder", Computer Sciences Division-EECS, U.C.-Berkeley, Technical Report (downloaded from anonymous ftp from cs.berkeley.edu).
Rowe, L., Patel, K., Smith, B., and Liu, K., "MPEG Video in Software: Representation, Transmission, and Playback", Computer Sciences Division-EECS, U.C.-Berkeley, Technical Report (downloadable by anonymous ftp from cs.berkeley.edu); also appears in High Speed Networking and Multimedia Computing, IS&T/SPIE Symp. on Elec. Imaging Sci. & Tech., San Jose, CA, February 1994.