Search CORE

242,173 research outputs found

From Audio Paper to Next Generation Paper

Author: Frohlich David
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 20/10/2017
Field of study

It has been 24 years since the publication of Wellner’s (1993) digital desk, demonstrating the augmentation of paper documents with projected information. Since then there have been many related developments in computing; including the world wide web, e-book readers, maturation of the augmented reality paradigm, embedded and printed electronics, and the internet of things. In this talk I draw on some of my own design explorations of augmenting paper with sound over the years, to illustrate the value of ‘audiopaper’ but also the way these explorations were rooted in the applications and technology of the day. I show that two key technologies have been important to the implementation of audiopaper over the years, and that the bigger opportunity is in connecting paper to the web. This culminates in a vision for two future generations of paper which communicate and interact with the digital devices around the

Crossref

University of Surrey

Surrey Research Insight

Sparks of Large Audio Models: A Survey and Outlook

Author: Cuayáhuitl Heriberto
Latif Siddique
Ren Yi
Schuller Björn W.
Shamshad Fahad
Shoukat Moazzam
Togneri Roberto
Usama Muhammad
Wang Wenwu
Zhang Xulong
Publication venue
Publication date: 03/09/2023
Field of study

This survey paper provides a comprehensive overview of the recent advancements and challenges in applying large language models to the field of audio signal processing. Audio processing, with its diverse signal representations and a wide range of sources--from human voices to musical instruments and environmental sounds--poses challenges distinct from those found in traditional Natural Language Processing scenarios. Nevertheless, \textit{Large Audio Models}, epitomized by transformer-based architectures, have shown marked efficacy in this sphere. By leveraging massive amount of data, these models have demonstrated prowess in a variety of audio tasks, spanning from Automatic Speech Recognition and Text-To-Speech to Music Generation, among others. Notably, recently these Foundational Audio Models, like SeamlessM4T, have started showing abilities to act as universal translators, supporting multiple speech tasks for up to 100 languages without any reliance on separate task-specific systems. This paper presents an in-depth analysis of state-of-the-art methodologies regarding \textit{Foundational Large Audio Models}, their performance benchmarks, and their applicability to real-world scenarios. We also highlight current limitations and provide insights into potential future research directions in the realm of \textit{Large Audio Models} with the intent to spark further discussion, thereby fostering innovation in the next generation of audio-processing systems. Furthermore, to cope with the rapid development in this area, we will consistently update the relevant repository with relevant recent articles and their open-source implementations at https://github.com/EmulationAI/awesome-large-audio-models.Comment: work in progress, Repo URL: https://github.com/EmulationAI/awesome-large-audio-model

arXiv.org e-Print Archive

Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed

Author: Beerends John G
Chung Joon Son
Cornu Thomas Le
Lan Yuxuan
Lee Daehyun
Ngiam Jiquan
Pachoud Samuel
Summerfield Quentin
Thiede Thilo
Zimmermann Marina
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 12/08/2018
Field of study

Speechreading or lipreading is the technique of understanding and getting phonetic features from a speaker's visual features such as movement of lips, face, teeth and tongue. It has a wide range of multimedia applications such as in surveillance, Internet telephony, and as an aid to a person with hearing impairments. However, most of the work in speechreading has been limited to text generation from silent videos. Recently, research has started venturing into generating (audio) speech from silent video sequences but there have been no developments thus far in dealing with divergent views and poses of a speaker. Thus although, we have multiple camera feeds for the speech of a user, but we have failed in using these multiple video feeds for dealing with the different poses. To this end, this paper presents the world's first ever multi-view speech reading and reconstruction system. This work encompasses the boundaries of multimedia research by putting forth a model which leverages silent video feeds from multiple cameras recording the same subject to generate intelligent speech for a speaker. Initial results confirm the usefulness of exploiting multiple camera views in building an efficient speech reading and reconstruction system. It further shows the optimal placement of cameras which would lead to the maximum intelligibility of speech. Next, it lays out various innovative applications for the proposed system focusing on its potential prodigious impact in not just security arena but in many other multimedia analytics problems.Comment: 2018 ACM Multimedia Conference (MM '18), October 22--26, 2018, Seoul, Republic of Kore

arXiv.org e-Print Archive

Crossref

Real-time systems development with SDL and next generation validation tools

Author: Sinnott R.O.
Publication venue
Publication date: 01/01/2001
Field of study

The language SDL has long been applied in the development of various kinds of systems. Real-time systems are one application area where SDL has been applied extensively. Whilst SDL allows for certain modelling aspects of real-time systems to be represented, the language and its associated tool support have certain drawbacks for modelling and reasoning about such systems. In this paper we highlight the limitations of SDL and its associated tool support in this domain and present language extensions and next generation real-time system tool support to help overcome them. The applicability of the extensions and tools is demonstrated through a case study based upon a multimedia binding object used to support a configuration of time dependent information producers and consumers realising the so called lip-synchronisation algorithm

CiteSeerX

Enlighten

University of Melbourne Institutional Repository

Recommended from our members

What did the Romans ever do for us? ‘Next generation’ networks and hybrid learning resources

Author: Richardson Paul
Thomas Elaine
Walker Steve
Publication venue
Publication date: 01/04/2012
Field of study

Networked learning is fundamentally concerned with the use of information and communication technologies (ICT) to link people to people and resources, to support the process of learning. This paper explores some current and forthcoming changes in ICT and some potential implications of these developments for networked learning. Whilst we aim to avoid taking a technologically determinist stance, we explore the potential for future practice and how some educational and pedagogic practices are evolving to exploit and shape the digital environment. We argue that we can change both the ways in which connections between people (learners and other learners; learners and tutors) are made and the nature of the resources that learning communities (particularly distributed communities) can engage with. In doing this we draw on two strands of work. Firstly, we draw on the ‘IBZL Education’ a UK Open University initiative to develop new scholarship in the context of STEM (Science, Technology, Engineering and Mathematics) through which educators are encouraged to think about technological change in the next five to ten years and ways in which we can intervene and shape these developments. We use problem-based learning as an example of a learning experience that can be difficult to implement in a networked learning environment. IBZL identified two broad strands of significant technological development. 'Superfast' broadband networks that are capable of supporting novel applications are being rolled in the UK (and elsewhere). Also, boundaries between the real and virtual worlds are becoming blurred as in the ‘internet of things’ where, for example, RFID tags enable information about the real world to be brought into the virtual one. We use the term ‘artefact’ to describe designed components, whether entirely digital, such as a computer forum, or material, such as a tablet PC. Networked ‘hybrid’ technologies of virtual and material components have may great potential for use in education. Secondly, we illustrate how these changes may be beginning to happen in distance education using the example of TU100 My Digital Life, a new introductory Open University. . TU100 Students use an electronics board in their own homes to work on a programming problem in collaboration other students through a tutor-led tutorial in a web conferencing system. We also note some of the evident complexity that establishing such resources as part of wider infrastructures of networked learning would be likely to involve

Open Research Online (The Open University)