R. Brunelli and M. Omologo
A major challenge in the development of multimodal systems is the synergistic integration of multiple data streams from possibly heterogeneous sensors. As an example, audio-visual person tracking aims at inferring the location of targets from both microphone data and camera images. Using both modalities, a tracker may achieve significantly better performance than one that uses either alone, since one modality may compensate for weaknesses of the other: an occlusion may completely invalidate any visual information about the location of an audible target, information that might be deduced from acoustic data. This tutorial consists of two parts, one related to acoustic technologies for speaker location and tracking, the second related to video-based person tracking as well as fusion of audio and video processing.