91 research outputs found

    Deep Neural Networks for Automatic Speech-To-Speech Translation of Open Educational Resources

    Full text link
    [ES] En los últimos años, el aprendizaje profundo ha cambiado significativamente el panorama en diversas áreas del campo de la inteligencia artificial, entre las que se incluyen la visión por computador, el procesamiento del lenguaje natural, robótica o teoría de juegos. En particular, el sorprendente éxito del aprendizaje profundo en múltiples aplicaciones del campo del procesamiento del lenguaje natural tales como el reconocimiento automático del habla (ASR), la traducción automática (MT) o la síntesis de voz (TTS), ha supuesto una mejora drástica en la precisión de estos sistemas, extendiendo así su implantación a un mayor rango de aplicaciones en la vida real. En este momento, es evidente que las tecnologías de reconocimiento automático del habla y traducción automática pueden ser empleadas para producir, de forma efectiva, subtítulos multilingües de alta calidad de contenidos audiovisuales. Esto es particularmente cierto en el contexto de los vídeos educativos, donde las condiciones acústicas son normalmente favorables para los sistemas de ASR y el discurso está gramaticalmente bien formado. Sin embargo, en el caso de TTS, aunque los sistemas basados en redes neuronales han demostrado ser capaces de sintetizar voz de un realismo y calidad sin precedentes, todavía debe comprobarse si esta tecnología está lo suficientemente madura como para mejorar la accesibilidad y la participación en el aprendizaje en línea. Además, existen diversas tareas en el campo de la síntesis de voz que todavía suponen un reto, como la clonación de voz inter-lingüe, la síntesis incremental o la adaptación zero-shot a nuevos locutores. Esta tesis aborda la mejora de las prestaciones de los sistemas actuales de síntesis de voz basados en redes neuronales, así como la extensión de su aplicación en diversos escenarios, en el contexto de mejorar la accesibilidad en el aprendizaje en línea. En este sentido, este trabajo presta especial atención a la adaptación a nuevos locutores y a la clonación de voz inter-lingüe, ya que los textos a sintetizar se corresponden, en este caso, a traducciones de intervenciones originalmente en otro idioma.[CA] Durant aquests darrers anys, l'aprenentatge profund ha canviat significativament el panorama en diverses àrees del camp de la intel·ligència artificial, entre les quals s'inclouen la visió per computador, el processament del llenguatge natural, robòtica o la teoria de jocs. En particular, el sorprenent èxit de l'aprenentatge profund en múltiples aplicacions del camp del processament del llenguatge natural, com ara el reconeixement automàtic de la parla (ASR), la traducció automàtica (MT) o la síntesi de veu (TTS), ha suposat una millora dràstica en la precisió i qualitat d'aquests sistemes, estenent així la seva implantació a un ventall més ampli a la vida real. En aquest moment, és evident que les tecnologies de reconeixement automàtic de la parla i traducció automàtica poden ser emprades per a produir, de forma efectiva, subtítols multilingües d'alta qualitat de continguts audiovisuals. Això és particularment cert en el context dels vídeos educatius, on les condicions acústiques són normalment favorables per als sistemes d'ASR i el discurs està gramaticalment ben format. No obstant això, al cas de TTS, encara que els sistemes basats en xarxes neuronals han demostrat ser capaços de sintetitzar veu d'un realisme i qualitat sense precedents, encara s'ha de comprovar si aquesta tecnologia és ja prou madura com per millorar l'accessibilitat i la participació en l'aprenentatge en línia. A més, hi ha diverses tasques al camp de la síntesi de veu que encara suposen un repte, com ara la clonació de veu inter-lingüe, la síntesi incremental o l'adaptació zero-shot a nous locutors. Aquesta tesi aborda la millora de les prestacions dels sistemes actuals de síntesi de veu basats en xarxes neuronals, així com l'extensió de la seva aplicació en diversos escenaris, en el context de millorar l'accessibilitat en l'aprenentatge en línia. En aquest sentit, aquest treball presta especial atenció a l'adaptació a nous locutors i a la clonació de veu interlingüe, ja que els textos a sintetitzar es corresponen, en aquest cas, a traduccions d'intervencions originalment en un altre idioma.[EN] In recent years, deep learning has fundamentally changed the landscapes of a number of areas in artificial intelligence, including computer vision, natural language processing, robotics, and game theory. In particular, the striking success of deep learning in a large variety of natural language processing (NLP) applications, including automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS), has resulted in major accuracy improvements, thus widening the applicability of these technologies in real-life settings. At this point, it is clear that ASR and MT technologies can be utilized to produce cost-effective, high-quality multilingual subtitles of video contents of different kinds. This is particularly true in the case of transcription and translation of video lectures and other kinds of educational materials, in which the audio recording conditions are usually favorable for the ASR task, and there is a grammatically well-formed speech. However, although state-of-the-art neural approaches to TTS have shown to drastically improve the naturalness and quality of synthetic speech over conventional concatenative and parametric systems, it is still unclear whether this technology is already mature enough to improve accessibility and engagement in online learning, and particularly in the context of higher education. Furthermore, advanced topics in TTS such as cross-lingual voice cloning, incremental TTS or zero-shot speaker adaptation remain an open challenge in the field. This thesis is about enhancing the performance and widening the applicability of modern neural TTS technologies in real-life settings, both in offline and streaming conditions, in the context of improving accessibility and engagement in online learning. Thus, particular emphasis is placed on speaker adaptation and cross-lingual voice cloning, as the input text corresponds to a translated utterance in this context.Pérez González De Martos, AM. (2022). Deep Neural Networks for Automatic Speech-To-Speech Translation of Open Educational Resources [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/184019TESISPremios Extraordinarios de tesis doctorale

    Neural recommender models for sparse and skewed behavioral data

    Get PDF
    Modern online platforms offer recommendations and personalized search and services to a large and diverse user base while still aiming to acquaint users with the broader community on the platform. Prior work backed by large volumes of user data has shown that user retention is reliant on catering to their specific eccentric tastes, in addition to providing them popular services or content on the platform. Long-tailed distributions are a fundamental characteristic of human activity, owing to the bursty nature of human attention. As a result, we often observe skew in data facets that involve human interaction. While there are superficial similarities to Zipf's law in textual data and other domains, the challenges with user data extend further. Individual words may have skewed frequencies in the corpus, but the long-tail words by themselves do not significantly impact downstream text-mining tasks. On the contrary, while sparse users (a majority on most online platforms) contribute little to the training data, they are equally crucial at inference time. Perhaps more so, since they are likely to churn. In this thesis, we study platforms and applications that elicit user participation in rich social settings incorporating user-generated content, user-user interaction, and other modalities of user participation and data generation. For instance, users on the Yelp review platform participate in a follower-followee network and also create and interact with review text (two modalities of user data). Similarly, community question-answer (CQA) platforms incorporate user interaction and collaboratively authored content over diverse domains and discussion threads. Since user participation is multimodal, we develop generalizable abstractions beyond any single data modality. Specifically, we aim to address the distributional mismatch that occurs with user data independent of dataset specifics; While a minority of the users generates most training samples, it is insufficient only to learn the preferences of this subset of users. As a result, the data's overall skew and individual users' sparsity are closely interlinked: sparse users with uncommon preferences are under-represented. Thus, we propose to treat these problems jointly with a skew-aware grouping mechanism that iteratively sharpens the identification of preference groups within the user population. As a result, we improve user characterization; content recommendation and activity prediction (+6-22% AUC, +6-43% AUC, +12-25% RMSE over state-of-the-art baselines), primarily for users with sparse activity. The size of the item or content inventories compounds the skew problem. Recommendation models can achieve very high aggregate performance while recommending only a tiny proportion of the inventory (as little as 5%) to users. We propose a data-driven solution guided by the aggregate co-occurrence information across items in the dataset. We specifically note that different co-occurrences are not equally significant; For example, some co-occurring items are easily substituted while others are not. We develop a self-supervised learning framework where the aggregate co-occurrences guide the recommendation problem while providing room to learn these variations among the item associations. As a result, we improve coverage to ~100% (up from 5%) of the inventory and increase long-tail item recall up to 25%. We also note that the skew and sparsity problems repeat across data modalities. For instance, social interactions and review content both exhibit aggregate skew, although individual users who actively generate reviews may not participate socially and vice-versa. It is necessary to differentially weight and merge different data sources for each user towards inference tasks in such cases. We show that the problem is inherently adversarial since the user participation modalities compete to describe a user accurately. We develop a framework to unify these representations while algorithmically tackling mode collapse, a well-known pitfall with adversarial models. A more challenging but important instantiation of sparsity is the few-shot setting or cross-domain setting. We may only have a single or a few interactions for users or items in the sparse domains or partitions. We show that contextualizing user-item interactions helps us infer behavioral invariants in the dense domain, allowing us to correlate sparse participants to their active counterparts (resulting in 3x faster training, ~19% recall gains in multi-domain settings). Finally, we consider the multi-task setting, where the platform incorporates multiple distinct recommendations and prediction tasks for each user. A single-user representation is insufficient for users who exhibit different preferences along each dimension. At the same time, it is counter-productive to handle correlated prediction or inference tasks in isolation. We develop a multi-faceted representation approach grounded on residual learning with heterogeneous knowledge graph representations, which provides us an expressive data representation for specialized domains and applications with multimodal user data. We achieve knowledge sharing by unifying task-independent and task-specific representations of each entity with a unified knowledge graph framework. In each chapter, we also discuss and demonstrate how the proposed frameworks directly incorporate a wide range of gradient-optimizable recommendation and behavior models, maximizing their applicability and pertinence to user-centered inference tasks and platforms

    Automatic understanding of multimodal content for Web-based learning

    Get PDF
    Web-based learning has become an integral part of everyday life for all ages and backgrounds. On the one hand, the advantages of this learning type, such as availability, accessibility, flexibility, and cost, are apparent. On the other hand, the oversupply of content can lead to learners struggling to find optimal resources efficiently. The interdisciplinary research field Search as Learning is concerned with the analysis and improvement of Web-based learning processes, both on the learner and the computer science side. So far, automatic approaches that assess and recommend learning resources in Search as Learning (SAL) focus on textual, resource, and behavioral features. However, these approaches commonly ignore multimodal aspects. This work addresses this research gap by proposing several approaches that address the question of how multimodal retrieval methods can help support learning on the Web. First, we evaluate whether textual metadata of the TIB AV-Portal can be exploited and enriched by semantic word embeddings to generate video recommendations and, in addition, a video summarization technique to improve exploratory search. Then we turn to the challenging task of knowledge gain prediction that estimates the potential learning success given a specific learning resource. We used data from two user studies for our approaches. The first one observes the knowledge gain when learning with videos in a Massive Open Online Course (MOOC) setting, while the second one provides an informal Web-based learning setting where the subjects have unrestricted access to the Internet. We then extend the purely textual features to include visual, audio, and cross-modal features for a holistic representation of learning resources. By correlating these features with the achieved knowledge gain, we can estimate the impact of a particular learning resource on learning success. We further investigate the influence of multimodal data on the learning process by examining how the combination of visual and textual content generally conveys information. For this purpose, we draw on work from linguistics and visual communications, which investigated the relationship between image and text by means of different metrics and categorizations for several decades. We concretize these metrics to enable their compatibility for machine learning purposes. This process includes the derivation of semantic image-text classes from these metrics. We evaluate all proposals with comprehensive experiments and discuss their impacts and limitations at the end of the thesis.Web-basiertes Lernen ist ein fester Bestandteil des Alltags aller Alters- und Bevölkerungsschichten geworden. Einerseits liegen die Vorteile dieser Art des Lernens wie Verfügbarkeit, Zugänglichkeit, Flexibilität oder Kosten auf der Hand. Andererseits kann das Überangebot an Inhalten auch dazu führen, dass Lernende nicht in der Lage sind optimale Ressourcen effizient zu finden. Das interdisziplinäre Forschungsfeld Search as Learning beschäftigt sich mit der Analyse und Verbesserung von Web-basierten Lernprozessen. Bisher sind automatische Ansätze bei der Bewertung und Empfehlung von Lernressourcen fokussiert auf monomodale Merkmale, wie Text oder Dokumentstruktur. Die multimodale Betrachtung ist hingegen noch nicht ausreichend erforscht. Daher befasst sich diese Arbeit mit der Frage wie Methoden des Multimedia Retrievals dazu beitragen können das Lernen im Web zu unterstützen. Zunächst wird evaluiert, ob textuelle Metadaten des TIB AV-Portals genutzt werden können um in Verbindung mit semantischen Worteinbettungen einerseits Videoempfehlungen zu generieren und andererseits Visualisierungen zur Inhaltszusammenfassung von Videos abzuleiten. Anschließend wenden wir uns der anspruchsvollen Aufgabe der Vorhersage des Wissenszuwachses zu, die den potenziellen Lernerfolg einer Lernressource schätzt. Wir haben für unsere Ansätze Daten aus zwei Nutzerstudien verwendet. In der ersten wird der Wissenszuwachs beim Lernen mit Videos in einem MOOC-Setting beobachtet, während die zweite eine informelle web-basierte Lernumgebung bietet, in der die Probanden uneingeschränkten Internetzugang haben. Anschließend erweitern wir die rein textuellen Merkmale um visuelle, akustische und cross-modale Merkmale für eine ganzheitliche Darstellung der Lernressourcen. Durch die Korrelation dieser Merkmale mit dem erzielten Wissenszuwachs können wir den Einfluss einer Lernressource auf den Lernerfolg vorhersagen. Weiterhin untersuchen wir wie verschiedene Kombinationen von visuellen und textuellen Inhalten Informationen generell vermitteln. Dazu greifen wir auf Arbeiten aus der Linguistik und der visuellen Kommunikation zurück, die seit mehreren Jahrzehnten die Beziehung zwischen Bild und Text untersucht haben. Wir konkretisieren vorhandene Metriken, um ihre Verwendung für maschinelles Lernen zu ermöglichen. Dieser Prozess beinhaltet die Ableitung semantischer Bild-Text-Klassen. Wir evaluieren alle Ansätze mit umfangreichen Experimenten und diskutieren ihre Auswirkungen und Limitierungen am Ende der Arbeit

    Emotion-aware cross-modal domain adaptation in video sequences

    Get PDF

    Tracking the Temporal-Evolution of Supernova Bubbles in Numerical Simulations

    Get PDF
    The study of low-dimensional, noisy manifolds embedded in a higher dimensional space has been extremely useful in many applications, from the chemical analysis of multi-phase flows to simulations of galactic mergers. Building a probabilistic model of the manifolds has helped in describing their essential properties and how they vary in space. However, when the manifold is evolving through time, a joint spatio-temporal modelling is needed, in order to fully comprehend its nature. We propose a first-order Markovian process that propagates the spatial probabilistic model of a manifold at fixed time, to its adjacent temporal stages. The proposed methodology is demonstrated using a particle simulation of an interacting dwarf galaxy to describe the evolution of a cavity generated by a Supernov

    Principles and Applications of Data Science

    Get PDF
    Data science is an emerging multidisciplinary field which lies at the intersection of computer science, statistics, and mathematics, with different applications and related to data mining, deep learning, and big data. This Special Issue on “Principles and Applications of Data Science” focuses on the latest developments in the theories, techniques, and applications of data science. The topics include data cleansing, data mining, machine learning, deep learning, and the applications of medical and healthcare, as well as social media


    Get PDF

    Visualization as Assemblage: How Modesty, Ethics, and Attachment Inform a Critical Design Practice

    Get PDF
    Visualization is a form of design practice that deploys representational processes of enormous rhetorical and analytical power. What is often left out of the picture is the network of processes which it assembles and the non-visual effects it produces. This study asks how visualization can operate as a critical design practice that attends to the representational and performative processes it arranges. In order to contextualize this form of arrangement in design, the study undertakes a review of Bruno Latours interpretation of design as a form of modest restyling and arrangement. It also addresses this question through the use of a productive alignment between Latours development of actor-network theory and Deleuze and Guattaris assemblage theory which allows to both describe how things and processes mobilize knowledge and how human subjectivity emerges from human-nonhuman entanglements, respectively. The assemblage framework is applied to three case studies that offer distinct instances of critical visualization practices with each emphasizing a specific aspect. Liquid Traces (2014present), from Forensic Architecture (a research project based at Goldsmiths, University of London), is a project that condemns NATO forces for criminal negligence that led to the deaths of 63 refugees fleeing Libya by boat in 2011, and also reveals the ways a surface may assemble components and highlight its own form of construction. Anti-Eviction Mapping Project (2013present), from the San Francisco Tenants Union, advocates for housing justice by mobilizing maps, events, and site-specific installations, and illustrates how visualization is a process that exists beyond any one artifact. In The Air, Tonight (2013present), from the Public Visualization Studio, is my own research-creation project highlights the connection between housing and climate through an annual visualization event, and shows how design can operate through iteration, reworking, and connection to allied processes. What emerges from this study is an ethics of visualization that refocuses criticality on the potential of design to act modestly (Latour), to reveal its own construction, and to maintain the quality of attachments made