Public Deliverables – ICoSOLE – Immersive Coverage of Spatially Outspread Live Events

D5.3 Broadcast Playout Engine and Audio Playout Libraries (July 2016)

DOI: http://dx.doi.org/10.7800/304icosoled53

This document describes the development of the ICoSOLE playout engine for linear, directed live content. It reports on design, architecture and the key decision to use the open-source multimedia framework GStreamer as the foundational framework. This document provides details on synchronisation, GPU acceleration of video processing and puts the engine into context of the first field trial and the overall ICoSOLE live editing system. A brief description of further development of the engine and use beyond ICoSOLE is provided at the end. Finally, the C++ and browser-based libraries for audio rendering in the context of the BBC IP studio project are described in the final section of the document.

D3.3.2. UGC Capture Tools (July 2016)

DOI: http://dx.doi.org/10.7800/304icosoled332

“ICoSOLE will develop an approach enabling an immersive experience of live events, which are spatially spread out. The approach is scalable in terms of content capture. In order to make the approach applicable for events where not all venues of interest can be covered with professional equipment, user generated content (UGC) captured with mobile devices (e.g., built-in cameras of smart phones, action cameras) will be supported. The content will be streamed from the mobile devices to the production system adopting network-adaptive (trans-)coding and streaming techniques.”

The above is an excerpt from the ICoSOLE project’s Description of Work (DoW) that highlights the significance of user-generated content. Figure below shows the conceptual diagram of the ICoSOLE system architecture – with the framed UGC capture part on the left.

In context of ICoSOLE, UGC capture covers the generation of audio and video, as well as metadata on consumer/prosumer devices (smartphones and tablets). To meet the quality constraints of the project, on-device analysis of capture essence, based on metadata generated by in-build sensors of the device, will be done.

The developed capture tools will also deal with the communication with production management, as well as transport of captured essence (audio, video and metadata) to a central storage and processing unit on site. Considering mobile devices, like smartphones, one has to deal with tremendous bandwidth fluctuations within mobile networks. Especially in the case of live streaming, this will be challenging, as the multimedia data must be delivered in time. If the requirement of real time up-streaming cannot be met (e.g. in cases where the uplink channel has not enough capacity to deliver even a low-quality representation), the captured essence should be stored on the device for later transmission to the system.

D4.2 Algorithms for Content Analysis and for Automatic Content Pro-Selection (June 2016)

DOI: http://dx.doi.org/10.7800/304icosoled42

This documents reports about the final version of content analysis and selection algorithms. In particular, it describes the work on visual and audio quality analysis, visual matching and automatic and manual tools for content selection. In addition to the client side (i.e., mobile device based) content analysis tools described in D4.1, we have implemented a set of visual and audio quality algorithms running on the server side. These algorithms are capable of performing more in-depth analysis, as they are not limited by the resource constraints of the mobile device. For audio content, algorithms for audio classification have also been implemented. For video content, a novel compact description framework has been developed. It extends approaches for compact visual descriptors to video content, making use of the temporal redundancy in the content. This enables efficient representation of content and segment-based matching of descriptors, thus reducing the computational complexity of visual similarity matching. This document also describes the work on content selection, consisting of two stages. Pre-selection is performed automatically, by filtering content based on quality properties and ranking incoming content based on content metadata (either captured, provided with the content or extracted using the algorithms described in this document). The second stage is a web-based application for content selection, which also serves as the platform for communicating with the users contributing content.

D3.2 Professional Audio and Video Capture (December 2015)
DOI: http://dx.doi.org/10.7800/304icosoled32

This document presents progress made so far in WP3T2 on professional audio and video capture system research and development up to month 24 of the ICoSOLE project. The document essentially consists of two parts, describing the video capture system developed by iMinds and the audio capture system by BBC respectively. The iMinds video capture system is an ultra-high resolution virtual camera system that allows to capture events in a cost effective way from multiple angles. Framing can be deferred to a post-production stage, in which novel creative possibilities originate. The approach enables single operator capturing of complex events. It naturally allows for remote live production, is suited for integration in an IP broadcast environment, and blends seamlessly with virtual reality. The video capture system was (and still is) designed in the context of numerous field trials, following a “design by doing” approach. The BBC is investigating techniques for capture of professional quality audio that can be used in immersive and interactive user experiences. This involves using conventional microphone techniques in production of immersive spatial audio but also extending the techniques to incorporate multi-microphone arrays for 3D spatial audio. This document presents research on available techniques and solutions for immersive and interactive audio capture, as well as experience with the techniques during field trials. Practical issues of set-up, capture/transport and calibration are also discussed. A summary of the findings when using these microphone techniques in production of immersive and interactive experiences is also given.

D6.3 First Demonstration and Field Trial (November 2015)
DOI: http://dx.doi.org/10.7800/304ICOSOLED63

In this deliverable, the course of the field trial at Dranouter Festival is reported. This took place from August 7th – 9th. All partners of the ICoSOLE consortium were present and tested out their first prototypes. Some prototypes were already integrated to some Level, which enables us to align our software from very early stages on. During the Festival, a lot of recordings were stored as well. These recordings will be used as a test pool for future prototypes. Furthermore, they can be very valuable to showcase current prototypes on conferences and other events. Participating in a real Event posed a lot of hurdles to overcome, too. Therefore, we started planning every part of the field trial from very early on. We also had a lot of support from the festival organisation, which made this test a real success.

D4.4 Tools for Content Compositing and User Interface for Live Playout (November 2015)

DOI: http://dx.doi.org/10.7800/304ICOSOLED44

This document presents progress made so far in WP4T3 on professional content compositing in the ICoSOLE project. On the one hand, an algorithm for making virtual camera transitions between physical standpoints has been developed. On the other hand, a user interface of live playout enabling a director to produce linear content for live broadcast has resulted.

D2.4.1 Evaluations v1 (November 2015)

DOI: http://dx.doi.org/10.7800/304ICOSOLED241

This deliverable contains the results and analysis of several user studies and technical evaluations that have taken place during the initial project period. The main evaluations have been organised around the field trial at the Dranouter music festival in August 2015, in which the project partners have participated. Several user studies have directly taken place during the festival. Two other user studies have led to the design, implementation and validation of a Smart TV user application for live events. Finally, the live mixing workflow leading to a live stream has been evaluated. In terms of technical evaluations, details are provided on statistics from the user tests, as well as technical evaluations of the content pre-caching approach and the multi-depth layered Video. In all, the general conclusion of both the user tests and the technical tests are quite positive. Though they show there is still work to perform to improve the results, they also show the concepts chased by ICoSOLE are valid. Likewise, the ICoSOLE consortium will proceed in the way it is currently moving, fine-tuning prototypes and use cases by incorporating the results of these Evaluations.

D7.2.2 Report on Dissemination and Standardisation Activities Y2 (October 2015)

DOI: http://dx.doi.org/10.7800/304ICOSOLEd722

This document is a report on dissemination and standardisation activities carried out by the Project partners during the second year of the Project. Following the dissemination and standardization activities of the first year, the effort in the second project year was more focused on the presentation of technology advancements as well as on activities carried out jointly between partners or within the consortium. Witness of the work performed in the second year are eight public deliverables, 18 papers and articles which have been released in international publications, seven presentations, posters and demos at international symposiums or conferences and last but not least the first field trial carried out at the occasion of the Dranouter folk festival in Belgium. Deliverables and presentations (if permitted) are available from the project web-site at http://icosole.eu/. Furthermore, the envisaged plans for dissemination and standardisation of the third project year are described in a separate chapter.

D3.1.2 Format Agnostic Scene Representation (October 2015)

DOI: http://dx.doi.org/10.7800/304ICOSOLEd312

This deliverable defines the scope of the scene representation and its relation to the overall System architecture. In the DoW, the task format-agnostic scene representation is defined as follows: The goal of this task is to define a common site-global framework into which all captured video, audio and metadata are to be registered 3D-spatially as well as in time (through consistent time stamping and synchronisation). This scene representation focuses on the sensors involved in capturing data, their parameters, the content and metadata streams they produce, as well as processing components that modify and combine these streams. The entities in the scene representation are consistent with those defined in the overall System architecture. The definition of the scene epresentation adds specialisation of some of these entities. It defines attributes of the involved devices/objects, as well as metadata carried in metadata streams of these devices. The semantic scene creation in WP4 defines the scene used for production, consisting of the video and audio streams and objects in the scene. In order to create this scene, and a spatiotemporally alignment of the data, information about the capture devices and their metadata are needed. The production-side scene has a similar scope as MPEG BIFS, while the capture scene representation is more focused on the capture devices and their metadata. This document is structured as follows: section 3 lists requirements gathered for the scene representation. Section 4 describes the approach of the generic scene representation model. Section 5 describes actual model structures and values. Section 6 and Section 7 describe the properties of processing devices and entity nodes respectively. The structure is based on the previous deliverable D3.1.1 defining the initial parameter settings with a focus on modifications and extensions, which deemed necessary and relevant due to the knowledge gained in course of the project execution. The annex specifies the metadata formats, i.e., the representation of the attributes specified in the document using JSON.

D5.2 First Version of Playout Clients (June 2015)

DOI: http://dx.doi.org/10.7800/304ICOSOLEd52f

This report presents early work on playout clients for the ICoSOLE project. Several different applications have been developed to demonstrate technical advancements and features of the ICoSOLE user experience. The clients are primarily “web-based”, meaning that they run in a web browser, most often implemented in HTML and JavaScript. A set of state-of-the-art clients has been implemented for adaptive streaming of media using MPEG DASH to various devices. Plug-in free JavaScript implementations are supported on many platforms, with a Flash client available as a fallback solution for older platforms. A novel method has been used to reduce the start-up delay when accessing MPEG DASH streams. Web clients for immersive and interactive content viewing have been created, including 360˚ or omnidirectional video and 3D binaural audio, using WebGL and the Web Audio API. A technical prototype for browser-based 360˚ video playout augments the video with graphical overlays that are projected into the scene geometry. The audio prototype provides a flexible configuration of audio sources in 3D space, as well as an emphasis parameter for reducing the volume of audio sources out of view. In addition, clients for content navigation have been created. The Venue Explorer presents an interactive view of the event (image or video) that the user can pan and zoom around. An interactive sound mix of sub-event audio is generated and graphical overlays of sub-event information are presented, to assist the user in spatiotemporal exploration of the event. The Wall of Moments presents an integrated view of professional and user-generated content from an event allowing a viewer to see the professional coverage in the context of their own experience or that of their social network. Media is synchronised and can be presented simultaneously with picture-in-picture view. Future work will see these applications converge to fewer more fully featured offerings, which share a common data source and backbone of delivery technologies.

D3.3.1 First Version of UGC Capture Tools (May 2015)

DOI: http://dx.doi.org/10.7800/304ICOSOLED331

In this document we describe the current status, as well as an outlook of UGC capture tools developed within the project. To meet the requirements, as stated in the DoW, an Android based application, with focus on high quality content generation (ensured by on-device analysis of captured essence) will be implemented. Also low latency, especially for live scenarios, and an easy to use and intuitive user interface is of high interest. We split the application design in different parts, (a) input/capture, (b) quality analysis and (c) upload/streaming. Different approaches for these parts have been developed and evaluated by partners since the start of this task. In terms of audio/video capture, we evaluated two different ways: either using Androids MediaRecorder API, which is easy to use, or theMediaCodec API provided by newer Android devices, which enables greater flexibility, when it comes to content analysis. We choose to pursue the second approach, having in mind that real-time quality analysis on the captured essence will be done. For quality analysis, we make use of the available sensors of the mobile device, like accelerometer to detect fast and shaky movements of the mobile device and, thus, unstable image sequences. Furthermore, algorithms for sharpness, noise and over-/underexposure detection have been implemented. If the recorded content does not reach the desired quality levels it gets discarded. In order to encourage users to optimize the quality of their content, they receive immediate feedback on the quality of their recordings. For uploading/streaming of captured essence we choose to evaluate session based streaming libraries for RTMP and RTSP in terms of their practicability for the project. Due to incompatibilities and a lack of support for our use-case, we developed another approach, which is session free and based on the HTTP protocol, with great similarities to MPEG-DASH, which we will pursue further. Based on the findings of our evaluations, and user feedback of a first test shoot (the Marconi Moments in October 2014), we will improve the application and work on the integration of the different parts, implemented by partners.

D4.1 First Version of Algorithms for Content Analysis and Automatic Content Pre-Selection(April 2015)

DOI: http://dx.doi.org/10.7800/304ICOSOLE41

This documents reports about the first version of content analysis and selection algorithms. In particular, it describes the work on visual and audio quality analysis, visual matching and initial work in rule-based content selection. In order to give the user direct feedback about quality problems, we have developed no-reference, real-time capable quality algorithms in an Android capture application. The mobile device identifies the luminance level, over- or underexposure, blur and noise by inspecting the captured image content on the fly. These algorithms address quality parameters that may be influenced by the user (in contrast to transmission problems), thus providing immediate feedback to the user. The proposed algorithms have low computational complexity and are thus suited for real-time implementations on devices with limited processing capabilities. Nonetheless, objective evaluation has shown that their performance is comparable to more complex algorithms for many practical quality problems. On the server side, an initial blocking detection algorithm is presented. The algorithm is capable of detecting blocking independent on a possible spatial scaling of the video signal after encoding or a deviation in the incoming signal itself, since it does not rely on a fixed grid size. The exact grid size and position is detected before image features are extracted tailored to the underlying image grid. Finally, based on the image features, a binary decision map is obtained on the block level, indicating the severity of affection. For audio content analysis, a modular approach for use on both the mobile device side as well as on the server side has been selected. The early focus on system architecture aspects within this work package was useful –and necessary– for the desired inter-device and inter-processing aspects for audio quality and content analysis. Different sources for quality degradation have been identified: introduced from recording device, from sound sources, user based, and from the transmission chain. A first algorithm for audio dynamic problems, audio clipping, was developed. An exemplary 3D audio scene, based on recordings from different mobile devices was created and published at the 41th German Annual Conference on Acoustics (DAGA). We also describe the work on performing visual matching between visual content streams. We first document an experiment using full matching of local SIFT descriptors on content from the Marconi Moments test shoot. In order to speed up feature matching, we then report on work using compact descriptors (such as VLAD), and combining them with full matching of local descriptors in order to balance between speed and matching accuracy. The results show that a significant speedup without sacrificing accuracy can be achieved by prefiltering with compact features. Finally, this document reports on initial work on content selection, implementing rule-based selection strategies. The work reported so far can be group in three categories: quality-based selection on user mobile devices, quality-based selection on the server side and ranking based on the depicted content in a stream.

D4.3 Format Definition for Integrated Content and Scene (November 2014)

DOI: http://dx.doi.org/10.7800/304ICOSOLED43

In the ICoSOLE project, data and metadata originating from various devices and other sources need to be stored and accessible. For good interoperability, it is crucial that standard formats are defined in a meaningful way. This document describes audio and video format specifications, as well as a semantic scene model that organizes this information in a logical way, and how the capture scene representation (D3.1.1) can be transformed into the semantic scene representation. Furthermore, (parts of) this semantic scene model will be used to exchange information between different applications. To achieve this, an example in JSON format, which specifies the serialization of this semantic model, has been created (see Annex).

D2.3 System Architecture and Interface Definitions (November 2014)

DOI: http://dx.doi.org/10.7800/304ICOSOLED23

ICoSOLE aims at developing a platform that enables users to experience live events which are spatially spread out, such as festivals (e.g. Gentse feesten in Belgium, Glastonbury in the UK), parades, marathons or bike races, in an immersive way by combining high-quality spatial video and audio and user generated content. The project will develop a platform for a context-adapted hybrid broadcast-Internet service, providing efficient tools for capture, production and distribution of audiovisual content captured by a heterogeneous set of devices spread over the event site. The approach uses a variety of sensors, ranging from mobile consumer devices over professional broadcast capture equipment to panoramic and/or free-viewpoint video and spatial audio. Methods for streaming live high-quality audiovisual content from mobile capture devices to content acquisition, processing and editing services will be developed. In order to combine the heterogeneous capture sources, ICoSOLE has to define approaches for integration of content from professional and consumer capture devices, including mobile (and moving) sensors, based on metadata and content analysis. Methods for fusing visual and audio information into a format agnostic data representation have to be applied, which enable rendering video and audio for virtual viewer/listener positions. This document outlines the concept of the chosen system architecture and its associated interface definitions, which will comply with the aforementioned requirements. The approach used within ICoSOLE is based on the model-view-controller pattern, using XML for its related description of system functionalities. It basically consists of the definition of three different device types (stationary, moveable and processing), their spatio-temporal properties, and their related interconnections to each other, represented by means of a connection diagram. Each of the functionalities is described starting from an overview, and going into more details in dedicated chapters. Additional sequence diagrams showing the signal flow and the interactions of the related components will support the reader in having a better understanding of inter-component communication of commands and Transactions.

D7.2.1 Report on Dissemination and Standardisation Activities Y1 (October 2014)

DOI: http://dx.doi.org/10.7800/304ICOSOLED721f

This document is a report on dissemination and standardisation activities carried out by the project partners during the first year of the project. In the first year the main effort was dedicated to the set-up of the project web-site at http://icosole.eu/ and an accompanying account on Twitter. Two public deliverables have been delivered during Year 1. Seven papers and articles have been released in international publications. Project partners have made presentations in four international symposiums or conferences. Deliverables and presentations (if possible) are available from the project web-site. Furthermore, the plans for dissemination and standardisation for the second year of the project are described.

D6.1 Initial Demonstrators (October 2014)

DOI: http://dx.doi.org/10.7800/304ICOSOLED61

This document describes the demonstrators of ICoSOLE developed in the first year of the project. This set of demonstrators serve the purpose to verify chosen technologies and methods and if they are a viable path to the final ICoSOLE system. At this point in time the developed tools are standalone components, no integration between the developed tools has been performed (and neither was integration planned for this initial project phase). The structure of this document follows the workflow in the future ICoSOLE system; hence starting from content capture via processing to playout and rendering. Eleven demonstrators have been developed. Each of the remaining chapter is dedicated to one component and their role withing the ICoSOLE system. Here is the list of demonstrators

Venue Explorer
The Wall of Moments
Audio Set Tools
Quality Analysis
Visual Localisation and Alignment
Studying the temporal redundancy of local descriptors
Playout Server – Mixing & Monitoring Live Video with Graphic Overlays
Content Pre-Caching to Accelerate Media Stream Switching in MPEG-DASH-based Distribution Environments
Bitdash – Live & Catchup HD Streaming
Object-Based Binaural Rendering in a Web Browser
3D Audio Rendering of ICoSOLE Recordings

Based on these demonstrators the development of the ICoSOLE system will continue.

D2.2 Use Cases & Requirements (July 2014)

DOI: http://dx.doi.org/10.7800/304ICOSOLED22

Projects as large and as complex as ICoSOLE, have to be rationalised and broken down into realisable units. The team started with potential scenarios for the project, the primary one being a music festival, and the secondary being a sports event. From these scenarios a set of use cases were collected which cover every aspect of providing an immersive and interactive audio-visual experience of these large events to the consumer. The use cases were grouped into five categories: production preparation, UGC request & response platform, production applications, playback on TV and playback on mobile devices. This document gives full descriptions for 23 different use cases, and 42 difference service requirements and how they all interrelate. Therefore it will provide a reference point for the initial technical work for the project. However, as the project evolves it is expected some of these use cases and requirements may change, or new ones will be needed. But they do give a solid starting point for the whole system to be developed from.

D2.1 Usage Scenarios (December 2013)

DOI: http://dx.doi.org/10.7800/304ICOSOLED21

This document describes two extensive scenarios that demonstrate the production of spatially outspread live events as envisioned within the ICoSOLE project. The first scenario handles about a city festival (such as Gentse Feesten, Glastonbury, …). The second scenario shows the ICoSOLE vision within a cyclocross race. Both scenarios consist of 3 parts. The first part is the preparation of the production, where all kind of technical, administrative and descriptive metadata are created in order to prepare an efficient production. The integration of user generated content is taken into account from the preparation onwards with the help of a user generated content request and response platform. The second part is the production itself where live editing on many different heterogeneous Content streams is the main challenge. It also takes into account incoming content from the user generated content request and response platform. The output is one or more broadcast streams. Moreover, a postproduction allows for editing more content. The third and final part of the scenarios consists of the consumption of the created content in an immersive way, both live and on-demand, and on both TV as well as mobile devices. Furthermore, the scenarios have been analysed and a set of system components (user interfaces) and use cases have been identified. They are listed in sections 5 and 6 of this document respectively. Finally, for each of the general use cases, a list of potentially contributing ICoSOLE partners has been created. This list can be found in section 7.