34 research outputs found

    Artificial Intelligence for Multimedia Signal Processing

    Get PDF
    Artificial intelligence technologies are also actively applied to broadcasting and multimedia processing technologies. A lot of research has been conducted in a wide variety of fields, such as content creation, transmission, and security, and these attempts have been made in the past two to three years to improve image, video, speech, and other data compression efficiency in areas related to MPEG media processing technology. Additionally, technologies such as media creation, processing, editing, and creating scenarios are very important areas of research in multimedia processing and engineering. This book contains a collection of some topics broadly across advanced computational intelligence algorithms and technologies for emerging multimedia signal processing as: Computer vision field, speech/sound/text processing, and content analysis/information mining

    KG-CNN: Augmenting Convolutional Neural Networks with Knowledge Graphs for Multi-Class Image Classification

    Get PDF
    Computer vision is slowly becoming more and more prevalent in daily life. Tesla has recently announced that it plans to scale up the manufacturing of their Robotaxis by 2024, with this increase in self-driving vehicles being just one example, the importance of computer vision is growing year by year. Vision can be easy to take for granted, as most humans grow up using vision as their primary way of absorbing environmental information. The way humans process and classify visual information differs significantly from how current computer vision systems process and organise visual information. The human brain can use its past knowledge and experience to draw conclusions that would be impossible for a limited artificial system. The human brain has a lifetime of visually informed learning to fall back on when classifying an object, the function of an object, and the relationships between particular objects. Humans can recall this information when needed once initially learned. Current convolutional neural networks (CNN) focus on processing the image at a pixel level, so they do not have this historical and relational information about these objects to fall back on, at least not explicitly. A CNN may connect certain features or objects during the classification process but this is a black box, and the end user has now way of determining this links

    A novel EEG based linguistic BCI

    Get PDF
    While a human being can think coherently, physical limitations no matter how severe, should never become disabling. Thinking and cognition are performed and expressed through language, which is the most natural form of human communication. The use of covert speech tasks for BCIs has been successfully achieved for invasive and non-invasive systems. In this work, by incorporating the most recent discoveries on the spatial, temporal, and spectral signatures of word production, a novel system is designed, which is custom-build for linguistic tasks. Other than paying attention and waiting for the onset cue, this BCI requires absolutely no cognitive effort from the user and operates using automatic linguistic functions of the brain in the first 312ms post onset, which is also completely out of the control of the user and immune from inconsistencies. With four classes, this online BCI achieves classification accuracy of 82.5%. Each word produces a signature as unique as its phonetic structure, and the number of covert speech tasks used in this work is limited by computational power. We demonstrated that this BCI can successfully use wireless dry electrode EEG systems, which are becoming as capable as traditional laboratory grade systems. This frees the potential user from the confounds of the lab, facilitating real-world application. Considering that the number of words used in daily life does not exceed 2000, the number of words used by this type of novel BCI may indeed reach this number in the future, with no need to change the current system design or experimental protocol. As a promising step towards noninvasive synthetic telepathy, this system has the potential to not only help those in desperate need, but to completely change the way we communicate with our computers in the future as covert speech is much easier than any form of manual communication and control

    Low Resource Efficient Speech Retrieval

    Get PDF
    Speech retrieval refers to the task of retrieving the information, which is useful or relevant to a user query, from speech collection. This thesis aims to examine ways in which speech retrieval can be improved in terms of requiring low resources - without extensively annotated corpora on which automated processing systems are typically built - and achieving high computational efficiency. This work is focused on two speech retrieval technologies, spoken keyword retrieval and spoken document classification. Firstly, keyword retrieval - also referred to as keyword search (KWS) or spoken term detection - is defined as the task of retrieving the occurrences of a keyword specified by the user in text form, from speech collections. We make advances in an open vocabulary KWS platform using context-dependent Point Process Model (PPM). We further accomplish a PPM-based lattice generation framework, which improves KWS performance and enables automatic speech recognition (ASR) decoding. Secondly, the massive volumes of speech data motivate the effort to organize and search speech collections through spoken document classification. In classifying real-world unstructured speech into predefined classes, the wildly collected speech recordings can be extremely long, of varying length, and contain multiple class label shifts at variable locations in the audio. For this reason each spoken document is often first split into sequential segments, and then each segment is independently classified. We present a general purpose method for classifying spoken segments, using a cascade of language independent acoustic modeling, foreign-language to English translation lexicons, and English-language classification. Next, instead of classifying each segment independently, we demonstrate that exploring the contextual dependencies across sequential segments can provide large classification performance improvements. Lastly, we remove the need of any orthographic lexicon and instead exploit alternative unsupervised approaches to decoding speech in terms of automatically discovered word-like or phoneme-like units. We show that the spoken segment representations based on such lexical or phonetic discovery can achieve competitive classification performance as compared to those based on a domain-mismatched ASR or a universal phone set ASR

    Semantic Segmentation of Ambiguous Images

    Get PDF
    Medizinische Bilder können schwer zu interpretieren sein. Nicht nur weil das Erkennen von Strukturen und möglichen VerĂ€nderungen Erfahrung und jahrelanges Training bedarf, sondern auch weil die dargestellten Messungen oft im Kern mehrdeutig sind. Fundamental ist dies eine Konsequenz dessen, dass medizinische Bild-ModalitĂ€ten, wie bespielsweise MRT oder CT, nur indirekte Messungen der zu Grunde liegenden molekularen IdentitĂ€ten bereithalten. Die semantische Bedeutung eines Bildes kann deshalb im Allgemeinen nur gegeben einem grĂ¶ĂŸeren Bild-Kontext erfasst werden, welcher es oft allerdings nur unzureichend erlaubt eine eindeutige Interpretation in Form einer einzelnen Hypothese vorzunehmen. Ähnliche Szenarien existieren in natĂŒrlichen Bildern, in welchen die Kontextinformation, die es braucht um Mehrdeutigkeiten aufzulösen, limitiert sein kann, beispielsweise aufgrund von Verdeckungen oder Rauschen in der Aufnahme. ZusĂ€tzlich können ĂŒberlappende oder vage Klassen-Definitionen zu schlecht gestellten oder diversen LösungsrĂ€umen fĂŒhren. Die PrĂ€senz solcher Mehrdeutigkeiten kann auch das Training und die Leistung von maschinellen Lernverfahren beeintrĂ€chtigen. DarĂŒber hinaus sind aktuelle Modelle ueberwiegend unfĂ€hig komplex strukturierte und diverse Vorhersagen bereitzustellen und stattdessen dazu gezwungen sich auf sub-optimale, einzelne Lösungen oder ununterscheidbare Mixturen zu beschrĂ€nken. Dies kann besonders problematisch sein wenn Klassifikationsverfahren zu pixel-weisen Vorhersagen wie in der semantischen Segmentierung skaliert werden. Die semantische Segmentierung befasst sich damit jedem Pixel in einem Bild eine Klassen-Kategorie zuzuweisen. Diese Art des detailierten Bild-VerstĂ€ndnisses spielt auch eine wichtige Rolle in der Diagnose und der Behandlung von Krankheiten wie Krebs: Tumore werden hĂ€ufig in MRT oder CT Bildern entdeckt und deren prĂ€zise Lokalisierung und Segmentierung ist von grosser Bedeutung in deren Bewertung, der Vorbereitung möglicher Biopsien oder der Planung von Fokal-Therapien. Diese klinischen Bildverarbeitungen, aber auch die optische Wahrnehmung unserer Umgebung im Rahmen von tĂ€glichen Aufgaben wie dem Autofahren, werden momentan von Menschen durchgefĂŒhrt. Als Teil des zunehmenden Einbindens von maschinellen Lernverfahren in unsere Entscheidungsfindungsprozesse, ist es wichtig diese Aufgaben adequat zu modellieren. Dies schliesst UnsicherheitsabschĂ€tzungen der Modellvorhersagen mit ein, mitunter solche Unsicherheiten die den Bild-Mehrdeutigkeiten zugeschrieben werden können. Die vorliegende Thesis schlĂ€gt mehrere Art und Weisen vor mit denen mit einer mehrdeutigen Bild-Evidenz umgegangen werden kann. ZunĂ€chst untersuchen wir den momentanen klinischen Standard der im Falle von Prostata LĂ€sionen darin besteht, die MRT-sichtbaren LĂ€sionen subjektiv auf ihre AggressivitĂ€t hin zu bewerten, was mit einer hohen VariabilitĂ€t zwischen Bewertern einhergeht. Unseren Studien zufolge können bereits einfache machinelle Lernverfahren und sogar simple quantitative MRT-basierte Parameter besser abschneiden als ein individueller, subjektiver Experte, was ein vielversprechendes Potential der Quantifizerung des Prozesses nahelegt. Desweiteren stellen wir die derzeit erfolgreichste Segmentierungsarchitektur auf einem stark mehrdeutigen Datensatz zur Probe der wĂ€hrend klinischer Routine erhoben und annotiert wurde. Unsere Experimente zeigen, dass die standard Segmentierungsverlustfuntion in Szenarien mit starkem Annotationsrauschen sub-optimal sein kann. Als eine Alternative erproben wir die Möglichkeit ein Modell der Verlustunktion zu lernen mit dem Ziel die Koexistenz von plausiblen Lösungen wĂ€hrend des Trainings zuzulassen. Wir beobachten gesteigerte Performanz unter Verwendung dieser Trainingsmethode fĂŒr ansonsten unverĂ€nderte neuronale Netzarchitekturen und finden weiter gesteigerte relative Verbesserungen im Limit weniger Daten. Mangel an Daten und Annotationen, hohe Maße an Bild- und Annotationsrauschen sowie mehrdeutige Bild-Evidenz finden sich besonders hĂ€ufig in DatensĂ€tzen medizinischer Bilder wieder. Dieser Teil der Thesis exponiert daher einige der SchwĂ€chen die standard Techniken des maschinellen Lernens im Lichte dieser Besonderheiten aufweisen können. Derzeitige Segmentierungsmodelle, wie die zuvor Herangezogenen, sind dahingehend eingeschrĂ€nkt, dass sie nur eine einzige Vorhersage abgeben können. Dies kontrastiert die Beobachtung dass eine Gruppe von Annotierern, gegeben mehrdeutiger Bilddaten, typischer Weise eine Menge an diverser aber plausibler Annotationen produziert. Um die vorgenannte Modell-EinschrĂ€nkung zu beheben und die angemessen probabilistische Behandlung der Aufgabe zu ermöglichen, entwickeln wir zwei Modelle, die eine Verteilung ĂŒber plausible Annotationen vorhersagen statt nur einer einzigen, deterministischen Annotation. Das erste der beiden Modelle kombiniert ein `encoder-decoder\u27 Modell mit dem Verfahren der `variational inference\u27 und verwendet einen globalen `latent vector\u27, der den Raum der möglichen Annotationen fĂŒr ein gegebenes Bild kodiert. Wir zeigen, dass dieses Modell deutlich besser als die Referenzmethoden abschneidet und gut kalibrierte Unsicherheiten aufweist. Das zweite Modell verbessert diesen Ansatz indem es eine flexiblere und hierarchische Formulierung verwendet, die es erlaubt die VariabilitĂ€t der Segmentierungen auf verschiedenden Skalen zu erfassen. Dies erhöht die GranularitĂ€t der Segmentierungsdetails die das Modell produzieren kann und erlaubt es unabhĂ€ngig variierende Bildregionen und Skalen zu modellieren. Beide dieser neuartigen generativen Segmentierungs-Modelle ermöglichen es, falls angebracht, diverse und kohĂ€rente Bild Segmentierungen zu erstellen, was im Kontrast zu frĂŒheren Arbeiten steht, welche entweder deterministisch sind, die Modellunsicherheiten auf der Pixelebene modellieren oder darunter leiden eine unangemessen geringe DiversitĂ€t abzubilden. Im Ergebnis befasst sich die vorliegende Thesis mit der Anwendung von maschinellem Lernen fĂŒr die Interpretation medizinischer Bilder: Wir zeigen die Möglichkeit auf den klinischen Standard mit Hilfe einer quantitativen Verwendung von Bildparametern, die momentan nur subjektiv in Diagnosen einfliessen, zu verbessern, wir zeigen den möglichen Nutzen eines neuen Trainingsverfahrens um die scheinbare Verletzlichkeit der standard Segmentierungsverlustfunktion gegenĂŒber starkem Annotationsrauschen abzumildern und wir schlagen zwei neue probabilistische Segmentierungsmodelle vor, die die Verteilung ĂŒber angemessene Annotationen akkurat erlernen können. Diese BeitrĂ€ge können als Schritte hin zu einer quantitativeren, verstĂ€rkt Prinzipien-gestĂŒtzten und unsicherheitsbewussten Analyse von medizinischen Bildern gesehen werden -ein wichtiges Ziel mit Blick auf die fortschreitende Integration von lernbasierten Systemen in klinischen ArbeitsablĂ€ufen

    Articulatory representations to address acoustic variability in speech

    Get PDF
    The past decade has seen phenomenal improvement in the performance of Automatic Speech Recognition (ASR) systems. In spite of this vast improvement in performance, the state-of-the-art still lags significantly behind human speech recognition. Even though certain systems claim super-human performance, this performance often is sub-par across domains and across datasets. This gap is predominantly due to the lack of robustness against speech variability. Even clean speech is extremely variable due to a large number of factors such as voice characteristics, speaking style, speaking rate, accents, casualness, emotions and more. The goal of this thesis is to investigate the variability of speech from the perspective of speech production, put forth robust articulatory features to address this variability, and to incorporate these features in state-of-the-art ASR systems in the best way possible. ASR systems model speech as a sequence of distinctive phone units like beads on a string. Although phonemes are distinctive units in the cognitive domain, their physical realizations are extremely varied due to coarticulation and lenition which are commonly observed in conversational speech. The traditional approaches deal with this issue by performing di-, tri- or quin-phone based acoustic modeling but are insufficient to model longer contextual dependencies. Articulatory phonology analyzes speech as a constellation of coordinated articulatory gestures performed by the articulators in the vocal tract (lips, tongue tip, tongue body, jaw, glottis and velum). In this framework, acoustic variability is explained by the temporal overlap of gestures and their reduction in space. In order to analyze speech in terms of articulatory gestures, the gestures need to be estimated from the speech signal. The first part of the thesis focuses on a speaker independent acoustic-to-articulatory inversion system that was developed to estimate vocal tract constriction variables (TVs) from speech. The mapping from acoustics to TVs was learned from the multi-speaker X-ray Microbeam (XRMB) articulatory dataset. Constriction regions from TV trajectories were defined as articulatory gestures using articulatory kinematics. The speech inversion system combined with the TV kinematics based gesture annotation provided a system to estimate articulatory gestures from speech. The second part of this thesis deals with the analysis of the articulatory trajectories under different types of variability such as multiple speakers, speaking rate, and accents. It was observed that speaker variation degraded the performance of the speech inversion system. A Vocal Tract Length Normalization (VTLN) based speaker normalization technique was therefore developed to address the speaker variability in the acoustic and articulatory domains. The performance of speech inversion systems was analyzed on an articulatory dataset containing speaking rate variations to assess if the model was able to reliably predict the TVs in challenging coarticulatory scenarios. The performance of the speech inversion system was analyzed in cross accent and cross language scenarios through experiments on a Dutch and British English articulatory dataset. These experiments provide a quantitative measure of the robustness of the speech inversion systems to different speech variability. The final part of this thesis deals with the incorporation of articulatory features in state-of-the-art medium vocabulary ASR systems. A hybrid convolutional neural network (CNN) architecture was developed to fuse the acoustic and articulatory feature streams in an ASR system. ASR experiments were performed on the Wall Street Journal (WSJ) corpus. Several articulatory feature combinations were explored to determine the best feature combination. Cross-corpus evaluations were carried out to evaluate the WSJ trained ASR system on the TIMIT and another dataset containing speaking rate variability. Results showed that combining articulatory features with acoustic features through the hybrid CNN improved the performance of the ASR system in matched and mismatched evaluation conditions. The findings based on this dissertation indicate that articulatory representations extracted from acoustics can be used to address acoustic variability in speech observed due to speakers, accents, and speaking rates and further be used to improve the performance of Automatic Speech Recognition systems

    Mapping Acoustic and Semantic Dimensions of Auditory Perception

    Get PDF
    Auditory categorisation is a function of sensory perception which allows humans to generalise across many different sounds present in the environment and classify them into behaviourally relevant categories. These categories cover not only the variance of acoustic properties of the signal but also a wide variety of sound sources. However, it is unclear to what extent the acoustic structure of sound is associated with, and conveys, different facets of semantic category information. Whether people use such data and what drives their decisions when both acoustic and semantic information about the sound is available, also remains unknown. To answer these questions, we used the existing methods broadly practised in linguistics, acoustics and cognitive science, and bridged these domains by delineating their shared space. Firstly, we took a model-free exploratory approach to examine the underlying structure and inherent patterns in our dataset. To this end, we ran principal components, clustering and multidimensional scaling analyses. At the same time, we drew sound labels’ semantic space topography based on corpus-based word embeddings vectors. We then built an LDA model predicting class membership and compared the model-free approach and model predictions with the actual taxonomy. Finally, by conducting a series of web-based behavioural experiments, we investigated whether acoustic and semantic topographies relate to perceptual judgements. This analysis pipeline showed that natural sound categories could be successfully predicted based on the acoustic information alone and that perception of natural sound categories has some acoustic grounding. Results from our studies help to recognise the role of physical sound characteristics and their meaning in the process of sound perception and give an invaluable insight into the mechanisms governing the machine-based and human classifications

    Alzheimer’s Dementia Recognition Through Spontaneous Speech

    Get PDF

    IberSPEECH 2020: XI Jornadas en TecnologĂ­a del Habla and VII Iberian SLTech

    Get PDF
    IberSPEECH2020 is a two-day event, bringing together the best researchers and practitioners in speech and language technologies in Iberian languages to promote interaction and discussion. The organizing committee has planned a wide variety of scientific and social activities, including technical paper presentations, keynote lectures, presentation of projects, laboratories activities, recent PhD thesis, discussion panels, a round table, and awards to the best thesis and papers. The program of IberSPEECH2020 includes a total of 32 contributions that will be presented distributed among 5 oral sessions, a PhD session, and a projects session. To ensure the quality of all the contributions, each submitted paper was reviewed by three members of the scientific review committee. All the papers in the conference will be accessible through the International Speech Communication Association (ISCA) Online Archive. Paper selection was based on the scores and comments provided by the scientific review committee, which includes 73 researchers from different institutions (mainly from Spain and Portugal, but also from France, Germany, Brazil, Iran, Greece, Hungary, Czech Republic, Ucrania, Slovenia). Furthermore, it is confirmed to publish an extension of selected papers as a special issue of the Journal of Applied Sciences, “IberSPEECH 2020: Speech and Language Technologies for Iberian Languages”, published by MDPI with fully open access. In addition to regular paper sessions, the IberSPEECH2020 scientific program features the following activities: the ALBAYZIN evaluation challenge session.Red Española de TecnologĂ­as del Habla. Universidad de Valladoli

    Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020

    Get PDF
    On behalf of the Program Committee, a very warm welcome to the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020). This edition of the conference is held in Bologna and organised by the University of Bologna. The CLiC-it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after six years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges
    corecore