2,766 research outputs found
Analyzing Autoencoder-Based Acoustic Word Embeddings
Recent studies have introduced methods for learning acoustic word embeddings
(AWEs)---fixed-size vector representations of words which encode their acoustic
features. Despite the widespread use of AWEs in speech processing research,
they have only been evaluated quantitatively in their ability to discriminate
between whole word tokens. To better understand the applications of AWEs in
various downstream tasks and in cognitive modeling, we need to analyze the
representation spaces of AWEs. Here we analyze basic properties of AWE spaces
learned by a sequence-to-sequence encoder-decoder model in six typologically
diverse languages. We first show that these AWEs preserve some information
about words' absolute duration and speaker. At the same time, the
representation space of these AWEs is organized such that the distance between
words' embeddings increases with those words' phonetic dissimilarity. Finally,
the AWEs exhibit a word onset bias, similar to patterns reported in various
studies on human speech processing and lexical access. We argue this is a
promising result and encourage further evaluation of AWEs as a potentially
useful tool in cognitive science, which could provide a link between speech
processing and lexical memory.Comment: 6 pages, 7 figures, accepted to BAICS workshop (ICLR2020
Personalized Video Recommendation Using Rich Contents from Videos
Video recommendation has become an essential way of helping people explore
the massive videos and discover the ones that may be of interest to them. In
the existing video recommender systems, the models make the recommendations
based on the user-video interactions and single specific content features. When
the specific content features are unavailable, the performance of the existing
models will seriously deteriorate. Inspired by the fact that rich contents
(e.g., text, audio, motion, and so on) exist in videos, in this paper, we
explore how to use these rich contents to overcome the limitations caused by
the unavailability of the specific ones. Specifically, we propose a novel
general framework that incorporates arbitrary single content feature with
user-video interactions, named as collaborative embedding regression (CER)
model, to make effective video recommendation in both in-matrix and
out-of-matrix scenarios. Our extensive experiments on two real-world
large-scale datasets show that CER beats the existing recommender models with
any single content feature and is more time efficient. In addition, we propose
a priority-based late fusion (PRI) method to gain the benefit brought by the
integrating the multiple content features. The corresponding experiment shows
that PRI brings real performance improvement to the baseline and outperforms
the existing fusion methods
Recommended from our members
Preservation of Patient Level Privacy: Federated Classification and Calibration Models
With the launching of the Precision Medicine Initiative in the United States, by the National Institute of Health, and the emergence of a large volume of electronic health records, there are many opportunities to improve clinical decision support systems. A large number of samples are needed to build predictive models that have adequate discrimination and calibration. However, protecting patient privacy is also an important issue. Patient data are typically protected in localized silos, and consolidation of datasets from different healthcare systems is difficult. Federated learning allows the training of a global model by amassing intermediate calculations from localized medical systems. The knowledge learned from the data can be transferred and aggregated to achieve better performance than the one achieved by individual local models. Federated learning may help build better models, providing more accurate predictions. There are two types of measures to assess how well a model performs: discrimination and calibration. While most papers report discrimination measures, calibration has often been neglected but it is a critical metric for evaluation. In this dissertation, I show a novel way to build classifiers and calibration models in a federated manner. I also show how I can evaluate and improve model calibration in this manner. Federated modeling enables the accumulation of knowledge and information that are otherwise locked behind local medical systems
Cultural influences on word meanings revealed through large-scale semantic alignment
If the structure of language vocabularies mirrors the structure of natural divisions that are universally perceived, then the meanings of words in different languages should closely align. By contrast, if shared word meanings are a product of shared culture, history and geography, they may differ between languages in substantial but predictable ways. Here, we analysed the semantic neighbourhoods of 1,010 meanings in 41 languages. The most-aligned words were from semantic domains with high internal structure (number, quantity and kinship). Words denoting natural kinds, common actions and artefacts aligned much less well. Languages that are more geographically proximate, more historically related and/or spoken by more-similar cultures had more aligned word meanings. These results provide evidence that the meanings of common words vary in ways that reflect the culture, history and geography of their users
The relational processing limits of classic and contemporary neural network models of language processing
Whether neural networks can capture relational knowledge is a matter of long-standing controversy. Recently, some researchers have argued that (1) classic connectionist models can handle relational structure and (2) the success of deep learning approaches to natural language processing suggests that structured representations are unnecessary to model human language. We tested the Story Gestalt model, a classic connectionist model of text comprehension, and a Sequence-to-Sequence with Attention model, a modern deep learning architecture for natural language processing. Both models were trained to answer questions about stories based on abstract thematic roles. Two simulations varied the statistical structure of new stories while keeping their relational structure intact. The performance of each model fell below chance at least under one manipulation. We argue that both models fail our tests because they can't perform dynamic binding. These results cast doubts on the suitability of traditional neural networks for explaining relational reasoning and language processing phenomena
A Data-driven Approach to the Semantics of Iconicity in American Sign Language and English
A growing body of research shows that both signed and spoken languages display regular patterns of iconicity in their vocabularies. We compared iconicity in the lexicons of American Sign Language (ASL) and English by combining previously collected ratings of ASL signs (Caselli, Sevcikova Sehyr, Cohen-Goldberg, & Emmorey, 2017) and English words (Winter, Perlman, Perry, & Lupyan, 2017) with the use of data-driven semantic vectors derived from English. Our analyses show that models of spoken language lexical semantics drawn from large text corpora can be useful for predicting the iconicity of signs as well as words. Compared to English, ASL has a greater number of regions of semantic space with concentrations of highly iconic vocabulary. There was an overall negative relationship between semantic density and the iconicity of both English words and ASL signs. This negative relationship disappeared for highly iconic signs, suggesting that iconic forms may be more easily discriminable in ASL than in English. Our findings contribute to an increasingly detailed picture of how iconicity is distributed across different languages
Machine Translation with Image Context from Mandarin Chinese to English
Despite ongoing improvements in machine translation, machine translators still lack the capability of incorporating context from which source text may have been derived. Machine translators use text from a source language to translate it into a target language without observing any visual context. This work aims to produce a neural machine translation model that is capable of accepting both text and image context as a multimodal translator from Mandarin Chinese to English. The model was trained on a small multimodal dataset of 700 images and sentences, and compared to a translator trained only on the text associated with those images. The model was also trained on a larger text only corpus of 21,000 sentences with and without the addition of the small multimodal dataset. Notable differences were produced between the text only and the multimodal translators when trained on the small 700 sentence and image dataset, however no observable discrepancies were found between the translators trained on the larger text corpus. Further research with a larger multimodal dataset could provide more results clarifying the utility of multimodal machine translation
- …