Visual Object Tracking by Fusion of Audio Imaging in Template Matching Framework

Full Text (PDF, 1010KB), PP.40-49

Views: 0 Downloads: 0


Satbir Singh 1,* Arun Khosla 1 Rajiv Kapoor 1

1. Dr.B R Ambedkar National Institute of Technology, Jalandhar, India

* Corresponding author.


Received: 4 Apr. 2019 / Revised: 16 May 2019 / Accepted: 24 Jun. 2019 / Published: 8 Aug. 2019

Index Terms

Audio Imaging, Template Matching, Object tracking, Information fusion


Audio imaging can play a fundamental role in computer vision, in particular in automated surveillance, boosting the accuracy of current systems based on standard optical cameras. We present here a method for object tracking application that fuses visual image with an audio image in the template-matching framework. Firstly, an improved template matching based tracking is presented that takes care of the chaotic movements in the template-matching algorithm. Then a fusion scheme is presented that makes use of deviations in the correlation scores pattern obtained across the individual frame in each imaging domain. The method is compared with various state of art trackers that perform track estimation using only visible imagery. Results highlight a significant improvement in the object tracking by the assistance of audio imaging using the proposed method under severe challenging vision conditions such as occlusions, object shape deformations, the presence of clutters and camouflage, etc.

Cite This Paper

Satbir Singh, Arun Khosla, Rajiv Kapoor, " Visual Object Tracking by Fusion of Audio Imaging in Template Matching Framework", International Journal of Image, Graphics and Signal Processing(IJIGSP), Vol.11, No.8, pp. 40-49, 2019. DOI: 10.5815/ijigsp.2019.08.04


[1]G. S. Walia and R. Kapoor, “Recent advances on multicue object tracking: a survey,” Artificial Intelligence Review, vol. 46, no. 1, pp. 821–847, 2016.

[2]S. Singh, R. Kapoor, and A. Khosla, Cross-Domain Usage in Real Time Video-Based Tracking. U.S.A: IGI Global, 2017, pp. 105–129.

[3]J. Ma, Y. Ma, and C. Li, “Infrared and visible image fusion methods and applications: A survey,” Information Fusion, vol. 45, pp. 153–178, 2018.

[4]W. W. Gaver, “What in the world do we hear: An ecological approach to auditory event perception,” Ecological psychology, vol. 5, no. 1, pp. 1-29, 1993.

[5]B. Jones and B. Kabanoff, “Eye movements in auditory space perception,” Perception, & Psychophysics, vol. 17, no. 3, pp. 241-245, 1975.

[6]P. Majdak, M. J. Goupell, and B. Laback, “3-d localization of virtual sound sources: Effects of visual environment, pointing method, and training,” Attention, Perception, & Psychophysics, vol. 72, no. 2, pp. 454-469, 2010.

[7]B. R. Shelton and C. L. Searle, “The influence of vision on the absolute identification of sound-source position,” Perception & Psychophysics, vol. 28, no. 6, pp. 589-596 1980.

[8]R. S. Bolia, W. R. D’Angelo, and R. L. McKinley, “Aurally aided visual search in three-dimensional space,” Human Factors, vol. 41, no.4, pp. 664-669, 1999.

[9]D. R. Perrott, J. Cisneros, R. L. McKinley, and W. R. D’Angelo, “Aurally aided visual search under virtual and free-field listening conditions.” Human Factors, vol. 38, no.4, pp. 702-715, 1996.

[10]M. F. Fallon and S. Godsill, “Acoustic source localization and tracking using track before detect,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6 pp. 1228–1242, 2010.

[11]A. Plinge and G. Fink, “Multi-speaker tracking using multiple distributed microphone arrays,” in IEEE Int. Conf. on Acoust., Speech and Sig. Proc. , May 2014, pp. 614–618.

[12]D. B. Ward, E. Lehmann, and R. Williamson, “Particle filtering algorithms for tracking an acoustic source in a reverberant environment,” IEEE Trans. On Speech and Audio Proc., vol. 11, no.6, pp. 826–836, 2003.

[13]K. Wu, S. T. Goh, and A. W. Khong, “Speaker localization and tracking in the presence of sound interference by exploiting speech harmonicity,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 365–369. 

[14]E. A. Lehmann and A. M. Johansson, “Particle filter with integrated voice activity detection for acoustic source tracking,” EURASIP Journal on Applied Signal Processing, vol. 2007, no. 1,  pp. 28–28, 2007.

[15]V. Cevher, R. Velmurugan, and J. H. McClellan, “ Acoustic multi-target tracking using direction-of-arrival batches,” IEEE Transactions on Signal Processing, vol. 55, no. 6, pp. 2810–2825, 2007.

[16]V. Kilic, M. Barnard, W. Wang, and J. Kittler, “Audio Assisted Robust Visual Tracking With Adaptive Particle Filtering,” IEEE Transactions on Multimedia, vol. 17, no. 2, pp. 186–200, 2015.

[17]N. Checka, K. W. Wilson, M. R. Siracusa, and T.Darrell, “Multiple person and speaker activity tracking with a particle filter,” IEEE International Conference on Acoustics Speech and Signal Processing, (1), V-881-4. 2004.

[18]K. Nickel, T. Gehrig, R. Stiefelhagen, and J. McDonough, “A joint particle filter for audio-visual speaker tracking,” in proceedings of the 7th International Conference on Multimodal Interfaces, 2005.

[19]Y. Lim and J. Choi, “Speaker selection and tracking in a cluttered environment with audio and visual information,” IEEE Transactions on Consumer Electronics, vol. 55, no. 3, pp. 1581–1589, 2009.

[20]N. Megherbi, S. Ambellouis, O. Colôt, and F. Cabestaing, “Joint audio-video people tracking using belief theory,”  in IEEE International Conference on Advanced Video and Signal Based Surveillance, p. 135–140, 2005

[21]E. D’Arca, A. Hughes, N. M. Robertson, and J. Hopgood, “Video tracking through occlusions by fast audio source localization,” in IEEE International Conference on Image Processing, pp. 2660–2664, 2013.

[22]S. T. Shivappa, B. D. Rao, and M. M. Trivedi, “Audio-visual fusion and tracking with multilevel iterative decoding: Framework and experimental evaluation,” IEEE Journal on Selected Topics in Signal Processing, vol. 4, no. 5, pp. 882–894, 2010.

[23]F. Su, “Acoustic Imaging Using a 64- Node Microphone Array and Beamformer System,” thesis submitted to Carleton University Ottawa, Ontario, 2015.

[24]M. Legg, and S.Bradley, “A combined microphone and camera calibration technique with application to acoustic imaging,” IEEE Transactions on Image Processing, vol. 22, no. 10, pp. 4028-4039, 2013.

[25]A. Zunino et al., “Seeing the Sound: a New Multimodal Imaging Device for Computer Vision,” in IEEE conference on Computer Vision Workshop, pp. 693-701, 2015.


[27]K. Nummiaro, E. K. Meier, and L. V. Gool, “An adaptive color-based particle filter,” Image Vis. Comput., vol. 21, no. 1, pp. 99–110, 2003.

[28]J. Xiao, R. Stolkin, M. Oussalah, and A. Leonardis, “Continuously Adaptive Data Fusion and Model Relearning for Particle Filter Tracking With Multiple Features,” IEEE Sens. J., vol. 16, no. 8, pp. 2639–2649, 2016.