1,412 research outputs found
A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments
Most speech and language technologies are trained with massive amounts of
speech and text information. However, most of the world languages do not have
such resources or stable orthography. Systems constructed under these almost
zero resource conditions are not only promising for speech technology but also
for computational language documentation. The goal of computational language
documentation is to help field linguists to (semi-)automatically analyze and
annotate audio recordings of endangered and unwritten languages. Example tasks
are automatic phoneme discovery or lexicon discovery from the speech signal.
This paper presents a speech corpus collected during a realistic language
documentation process. It is made up of 5k speech utterances in Mboshi (Bantu
C25) aligned to French text translations. Speech transcriptions are also made
available: they correspond to a non-standard graphemic form close to the
language phonology. We present how the data was collected, cleaned and
processed and we illustrate its use through a zero-resource task: spoken term
discovery. The dataset is made available to the community for reproducible
computational language documentation experiments and their evaluation.Comment: accepted to LREC 201
A radiative transfer framework for non-exponential media
We develop a new theory of volumetric light transport for media with non-exponential free-flight distributions. Recent insights from atmospheric sciences and neutron transport demonstrate that such distributions arise in the presence of correlated scatterers, which are naturally produced by processes such as cloud condensation and fractal-pattern formation. Our theory accommodates correlations by disentangling the concepts of the free-flight distribution and transmittance, which are equivalent when scatterers are statistically independent, but become distinct when correlations are present. Our theory results in a generalized path integral which allows us to handle non-exponential media using the full range of Monte Carlo rendering algorithms while enriching the range of achievable appearance. We propose parametric models for controlling the statistical correlations by leveraging work on stochastic processes, and we develop a method to combine such unresolved correlations (and the resulting non-exponential free-flight behavior) with explicitly modeled macroscopic heterogeneity. This provides a powerful authoring approach where artists can freely design the shape of the attenuation profile separately from the macroscopic heterogeneous density, while our theory provides a physically consistent interpretation in terms of a path space integral. We address important considerations for graphics including energy conservation, reciprocity, and bidirectional rendering algorithms, all in the presence of surfaces and correlated media
Supervised and unsupervised language modelling in Chest X-Ray radiological reports
Chest radiography (CXR) is the most commonly used imaging modality and deep neural network (DNN) algorithms have shown promise in effective triage of normal and abnormal radiograms. Typically, DNNs require large quantities of expertly labelled training exemplars, which in clinical contexts is a major bottleneck to effective modelling, as both considerable clinical skill and time is required to produce high-quality ground truths. In this work we evaluate thirteen supervised classifiers using two large free-text corpora and demonstrate that bi-directional long short-term memory (BiLSTM) networks with attention mechanism effectively identify Normal, Abnormal, and Unclear CXR reports in internal (n = 965 manually-labelled reports, f1-score = 0.94) and external (n = 465 manually-labelled reports, f1-score = 0.90) testing sets using a relatively small number of expert-labelled training observations (n = 3,856 annotated reports). Furthermore, we introduce a general unsupervised approach that accurately distinguishes Normal and Abnormal CXR reports in a large unlabelled corpus. We anticipate that the results presented in this work can be used to automatically extract standardized clinical information from free-text CXR radiological reports, facilitating the training of clinical decision support systems for CXR triage
Vulnerability Clustering and other Machine Learning Applications of Semantic Vulnerability Embeddings
Cyber-security vulnerabilities are usually published in form of short natural
language descriptions (e.g., in form of MITRE's CVE list) that over time are
further manually enriched with labels such as those defined by the Common
Vulnerability Scoring System (CVSS). In the Vulnerability AI (Analytics and
Intelligence) project, we investigated different types of semantic
vulnerability embeddings based on natural language processing (NLP) techniques
to obtain a concise representation of the vulnerability space. We also
evaluated their use as a foundation for machine learning applications that can
support cyber-security researchers and analysts in risk assessment and other
related activities. The particular applications we explored and briefly
summarize in this report are clustering, classification, and visualization, as
well as a new logic-based approach to evaluate theories about the vulnerability
space.Comment: 27 pages, 13 figure
A Survey on Text Classification Algorithms: From Text to Predictions
In recent years, the exponential growth of digital documents has been met by rapid progress in text classification techniques. Newly proposed machine learning algorithms leverage the latest advancements in deep learning methods, allowing for the automatic extraction of expressive features. The swift development of these methods has led to a plethora of strategies to encode natural language into machine-interpretable data. The latest language modelling algorithms are used in conjunction with ad hoc preprocessing procedures, of which the description is often omitted in favour of a more detailed explanation of the classification step. This paper offers a concise review of recent text classification models, with emphasis on the flow of data, from raw text to output labels. We highlight the differences between earlier methods and more recent, deep learning-based methods in both their functioning and in how they transform input data. To give a better perspective on the text classification landscape, we provide an overview of datasets for the English language, as well as supplying instructions for the synthesis of two new multilabel datasets, which we found to be particularly scarce in this setting. Finally, we provide an outline of new experimental results and discuss the open research challenges posed by deep learning-based language models
Bi-Objective Search with Bi-Directional A*
Bi-objective search is a well-known algorithmic problem, concerned with finding a set of optimal solutions in a two-dimensional domain. This problem has a wide variety of applications such as planning in transport systems or optimal control in energy systems. Recently, bi-objective A*-based search (BOA*) has shown state-of-the-art performance in large networks. This paper develops a bi-directional and parallel variant of BOA*, enriched with several speed-up heuristics. Our experimental results on 1,000 benchmark cases show that our bi-directional A* algorithm for bi-objective search (BOBA*) can optimally solve all of the benchmark cases within the time limit, outperforming the state of the art BOA*, bi-objective Dijkstra and bi-directional bi-objective Dijkstra by an average runtime improvement of a factor of five over all of the benchmark instances
- …