Video Content Fusion

In ICoSOLE, we have real-time requirements for calibration, as the cameras might Change their relative positions and orientations at any time. Full recalibration at each time instant is too costly, thus methods to update the current calibration will be developed. These methods need to avoid drift and will use constraints based on detection results from the scene and assumptions about the continuity of object motions. Due to the properties of user generated video, such as motion blur, or coding artefacts, feature point based methods must be complemented by detection based approaches that are more robust against these Content impairments.

In scenes with a higher density of sources, we aim placing them in a common space. Photosynth is a well-known example of providing such an approach for still Images also dealing with issues of user generated content). We intend to provide a similar Approach for video content, enabling the user to navigate the space of related video sources.

Mobile vision based location recognition has so far been presented mainly within snapshot bases services. A video stream based localisation approach requires further development using highly innovative methodologies to become capable to match informative visual descriptors in near real-time, taking the possibly varying quality of the video into account. ICoSOLE will advance the state of the art in geo-indexed landmark recognition by applying fast cue detectors on live video streams and from this efficiently generate metadata on accurate user position, orientation and image content.

A key challenge in ICoSOLE related to this research topic is to achieve robust matching among user generated content items and between user generated and professional Content items, while still keeping the approach scalable. Scalability is in many approaches achieved by quantisation of features (e.g., into histograms) in order to index them more efficiently. Results for the TRECVID instance search task show that these approaches achieve significantly lower performance than those matching a richer feature representation. As matching complete descriptors is not feasible, we will investigate the use of recently proposed methods, such as Fisher coding of feature vectors, or compact descriptors for visual search (CDVS). For the latter, standardisation for still image descriptors is ongoing in MPEG, but an extension to video is still missing and will be considered in ICoSOLE.