10 research outputs found

    Spoken content metadata and MPEG-7

    Full text link
    The words spoken in an audio stream form an obvious descriptor essential to most audio-visual metadata standards. When derived using automatic speech recognition systems, the spoken content fits into neither low-level (representative) nor high-level (semantic) metadata categories. This results in difficulties in creating a representation that can support both interoperability between different extraction and application utilities while retaining robustness to the limitations of the extraction process. In this paper, we discuss the issues encountered in the design of the MPEG-7 spoken content descriptor and their applicability to other metadata standards

    VAMP: semantic validation for MPEG-7 profile descriptions

    Get PDF
    MPEG-7 can be used to create complex and comprehensive metadata descriptions of multimedia content. Since MPEG-7 is defined in terms of an XML schema, the semantics of its elements has no formal grounding. In addition, certain features can be described in multiple ways. MPEG-7 profiles are subsets of the standard that apply to specific application areas and that aim to reduce this syntactic variability, but they still lack formal semantics. We propose an approach for expressing the semantics explicitly

    Speaker segmentation and clustering

    Get PDF
    This survey focuses on two challenging speech processing topics, namely: speaker segmentation and speaker clustering. Speaker segmentation aims at finding speaker change points in an audio stream, whereas speaker clustering aims at grouping speech segments based on speaker characteristics. Model-based, metric-based, and hybrid speaker segmentation algorithms are reviewed. Concerning speaker clustering, deterministic and probabilistic algorithms are examined. A comparative assessment of the reviewed algorithms is undertaken, the algorithm advantages and disadvantages are indicated, insight to the algorithms is offered, and deductions as well as recommendations are given. Rich transcription and movie analysis are candidate applications that benefit from combined speaker segmentation and clustering. © 2007 Elsevier B.V. All rights reserved

    Accessing the spoken word

    Get PDF
    Spoken-word audio collections cover many domains, including radio and television broadcasts, oral narratives, governmental proceedings, lectures, and telephone conversations. The collection, access, and preservation of such data is stimulated by political, economic, cultural, and educational needs. This paper outlines the major issues in the field, reviews the current state of technology, examines the rapidly changing policy issues relating to privacy and copyright, and presents issues relating to the collection and preservation of spoken audio conten

    AXMEDIS 2007 Conference Proceedings

    Get PDF
    The AXMEDIS International Conference series has been established since 2005 and is focused on the research, developments and applications in the cross-media domain, exploring innovative technologies to meet the challenges of the sector. AXMEDIS2007 deals with all subjects and topics related to cross-media and digital-media content production, processing, management, standards, representation, sharing, interoperability, protection and rights management. It addresses the latest developments and future trends of the technologies and their applications, their impact and exploitation within academic, business and industrial communities

    Searching Spontaneous Conversational Speech:Proceedings of ACM SIGIR Workshop (SSCS2008)

    Get PDF

    Bayesian Approaches to Uncertainty in Speech Processing

    Get PDF

    Concurrency in auditory displays for connected television

    Get PDF
    Many television experiences depend on users being both willing and able to visually attend to screen-based information. Auditory displays offer an alternative method for presenting this information and could benefit all users. This thesis explores how this may be achieved through the design and evaluation of auditory displays involving varying degrees of concurrency for two television use cases: menu navigation and presenting related content alongside a television show. The first study, on the navigation of auditory menus, looked at onset asynchrony and word length in the presentation of spoken menus. The effects of these on task duration, accuracy and workload were considered. Onset asynchrony and word length both caused significant effects on task duration and accuracy, while workload was only affected by onset asynchrony. An optimum asynchrony was identified, which was the same for both long and short words, but better performance was obtained with the shorter words that no longer overlapped. The second experiment investigated how disruption, workload, and preference are affected when presenting additional content accompanying a television programme. The content took the form of sound from different spatial locations or as text on a smartphone and the programme's soundtrack was either modified or left unaltered. Leaving the soundtrack unaltered or muting it negatively impacted user experience. Removing the speech from the television programme and presenting the secondary content as sound from a smartphone was the best auditory approach. This was found to compare well with the textual presentation, resulting in less visual disruption and imposing a similar workload. Additionally, the thesis reviews the state-of-the-art in television experiences and auditory displays. The human auditory system is introduced and important factors in the concurrent presentation of speech are highlighted. Conclusions about the utility of concurrency within auditory displays for television are made and areas for further work are identified

    Phone-Based Spoken Document Retrieval in Conformance with the MPEG-7 Standard

    No full text
    This paper presents a phone-based approach of spoken document retrieval, developed in the framework of the emerging MPEG-7 standard. The audio part of MPEG-7 encloses a SpokenContent tool that provides a standardized description of the content of spoken documents. In the context of MPEG-7, we propose an indexing and retrieval method that uses phonetic information only and a vector space IR model. Experiments are conducted on a database of German spoken documents, with 10 city name queries. Two phone-based retrieval approaches are presented and combined. The first one is based on the combination of phone N-grams of different lengths used as indexing terms. The other consists of expanding the document representation by means of phone confusion probabilitie
    corecore