2,185 research outputs found
Semi-supervised transductive speaker identification
We present an application of transductive semi-supervised learning to the problem of speaker identification. Formulating this problem as one of transduction is the most natural choice in some scenarios, such as when annotating archived speech data. Experiments with the CHAINS corpus show that, using the basic MFCC-encoding of recorded utterances, a well known simple semi-supervised algorithm, label spread, can solve this problem well. With only a small number of labelled utterances, the semi-supervised algorithm drastically outperforms a state of the art supervised support vector machine algorithm. Although we restrict ourselves to the transductive setting in this paper, the results encourage future work on semi-supervised learning for inductive speaker identification
Transfer Learning for Speech and Language Processing
Transfer learning is a vital technique that generalizes models trained for
one setting or task to other settings or tasks. For example in speech
recognition, an acoustic model trained for one language can be used to
recognize speech in another language, with little or no re-training data.
Transfer learning is closely related to multi-task learning (cross-lingual vs.
multilingual), and is traditionally studied in the name of `model adaptation'.
Recent advance in deep learning shows that transfer learning becomes much
easier and more effective with high-level abstract features learned by deep
models, and the `transfer' can be conducted not only between data distributions
and data types, but also between model structures (e.g., shallow nets and deep
nets) or even model types (e.g., Bayesian models and neural models). This
review paper summarizes some recent prominent research towards this direction,
particularly for speech and language processing. We also report some results
from our group and highlight the potential of this very interesting research
field.Comment: 13 pages, APSIPA 201
Enabling Auditing and Intrusion Detection of Proprietary Controller Area Networks
The goal of this dissertation is to provide automated methods for security researchers to overcome ‘security through obscurity’ used by manufacturers of proprietary Industrial Control Systems (ICS). `White hat\u27 security analysts waste significant time reverse engineering these systems\u27 opaque network configurations instead of performing meaningful security auditing tasks. Automating the process of documenting proprietary protocol configurations is intended to improve independent security auditing of ICS networks. The major contributions of this dissertation are a novel approach for unsupervised lexical analysis of binary network data flows and analysis of the time series data extracted as a result. We demonstrate the utility of these methods using Controller Area Network (CAN) data sampled from passenger vehicles
Spanish Corpora of tweets about COVID-19 vaccination for automatic stance detection
The paper presents new annotated corpora for performing stance detection on Spanish Twitter data, most notably Health-related tweets. The objectives of this research are threefold: (1) to develop a manually annotated benchmark corpus for emotion recognition taking into account different variants of Spanish in social posts; (2) to evaluate the efficiency of semi-supervised models for extending such corpus with unlabelled posts; and (3) to describe such short text corpora via specialised topic modelling. A corpus of 2,801 tweets about COVID-19 vaccination was annotated by three native speakers to be in favour (904), against (674) or neither (1,223) with a 0.725 Fleiss’ kappa score. Results show that the self-training method with SVM base estimator can alleviate annotation work while ensuring high model performance. The self-training model outperformed the other approaches and produced a corpus of 11,204 tweets with a macro averaged f1 score of 0.94. The combination of sentence-level deep learning embeddings and density-based clustering was applied to explore the contents of both corpora. Topic quality was measured in terms of the trustworthiness and the validation index.Agencia Estatal de Investigación | Ref. PID2020–113673RB-I00Xunta de Galicia | Ref. ED431C2018/55Fundação para a Ciência e a Tecnologia | Ref. UIDB/04469/2020Financiado para publicación en acceso aberto: Universidade de Vigo/CISU
A Survey on Semi-Supervised Learning for Delayed Partially Labelled Data Streams
Unlabelled data appear in many domains and are particularly relevant to
streaming applications, where even though data is abundant, labelled data is
rare. To address the learning problems associated with such data, one can
ignore the unlabelled data and focus only on the labelled data (supervised
learning); use the labelled data and attempt to leverage the unlabelled data
(semi-supervised learning); or assume some labels will be available on request
(active learning). The first approach is the simplest, yet the amount of
labelled data available will limit the predictive performance. The second
relies on finding and exploiting the underlying characteristics of the data
distribution. The third depends on an external agent to provide the required
labels in a timely fashion. This survey pays special attention to methods that
leverage unlabelled data in a semi-supervised setting. We also discuss the
delayed labelling issue, which impacts both fully supervised and
semi-supervised methods. We propose a unified problem setting, discuss the
learning guarantees and existing methods, explain the differences between
related problem settings. Finally, we review the current benchmarking practices
and propose adaptations to enhance them
- …