147 research outputs found
Interactive Machine Learning with Applications in Health Informatics
Recent years have witnessed unprecedented growth of health data, including millions of biomedical research publications, electronic health records, patient discussions on health forums and social media, fitness tracker trajectories, and genome sequences. Information retrieval and machine learning techniques are powerful tools to unlock invaluable knowledge in these data, yet they need to be guided by human experts. Unlike training machine learning models in other domains, labeling and analyzing health data requires highly specialized expertise, and the time of medical experts is extremely limited. How can we mine big health data with little expert effort? In this dissertation, I develop state-of-the-art interactive machine learning algorithms that bring together human intelligence and machine intelligence in health data mining tasks. By making efficient use of human expert's domain knowledge, we can achieve high-quality solutions with minimal manual effort.
I first introduce a high-recall information retrieval framework that helps human users efficiently harvest not just one but as many relevant documents as possible from a searchable corpus. This is a common need in professional search scenarios such as medical search and literature review. Then I develop two interactive machine learning algorithms that leverage human expert's domain knowledge to combat the curse of "cold start" in active learning, with applications in clinical natural language processing. A consistent empirical observation is that the overall learning process can be reliably accelerated by a knowledge-driven "warm start", followed by machine-initiated active learning. As a theoretical contribution, I propose a general framework for interactive machine learning. Under this framework, a unified optimization objective explains many existing algorithms used in practice, and inspires the design of new algorithms.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147518/1/raywang_1.pd
Breaking Sticks and Ambiguities with Adaptive Skip-gram
Recently proposed Skip-gram model is a powerful method for learning
high-dimensional word representations that capture rich semantic relationships
between words. However, Skip-gram as well as most prior work on learning word
representations does not take into account word ambiguity and maintain only
single representation per word. Although a number of Skip-gram modifications
were proposed to overcome this limitation and learn multi-prototype word
representations, they either require a known number of word meanings or learn
them using greedy heuristic approaches. In this paper we propose the Adaptive
Skip-gram model which is a nonparametric Bayesian extension of Skip-gram
capable to automatically learn the required number of representations for all
words at desired semantic resolution. We derive efficient online variational
learning algorithm for the model and empirically demonstrate its efficiency on
word-sense induction task
EXPLOITING TAGGED AND UNTAGGED CORPORA FOR WORD SENSE DISAMBIGUATION
Ph.DDOCTOR OF PHILOSOPH
Lexical simplification for the systematic support of cognitive accessibility guidelines
The Internet has come a long way in recent years, contributing to the proliferation of
large volumes of digitally available information. Through user interfaces we can access
these contents, however, they are not accessible to everyone. The main users affected are
people with disabilities, who are already a considerable number, but accessibility barriers
affect a wide range of user groups and contexts of use in accessing digital information.
Some of these barriers are caused by language inaccessibility when texts contain long
sentences, unusual words and complex linguistic structures. These accessibility barriers
directly affect people with cognitive disabilities.
For the purpose of making textual content more accessible, there are initiatives such
as the Easy Reading guidelines, the Plain Language guidelines and some of the languagespecific
Web Content Accessibility Guidelines (WCAG). These guidelines provide documentation,
but do not specify methods for meeting the requirements implicit in these
guidelines in a systematic way. To obtain a solution, methods from the Natural Language
Processing (NLP) discipline can provide support for achieving compliance with the cognitive
accessibility guidelines for the language.
The task of text simplification aims at reducing the linguistic complexity of a text from
a syntactic and lexical perspective, the latter being the main focus of this Thesis. In this
sense, one solution space is to identify in a text which words are complex or uncommon,
and in the case that there were, to provide a more usual and simpler synonym, together
with a simple definition, all oriented to people with cognitive disabilities.
With this goal in mind, this Thesis presents the study, analysis, design and development
of an architecture, NLP methods, resources and tools for the lexical simplification of
texts for the Spanish language in a generic domain in the field of cognitive accessibility.
To achieve this, each of the steps present in the lexical simplification processes is studied,
together with methods for word sense disambiguation. As a contribution, different
types of word embedding are explored and created, supported by traditional and dynamic
embedding methods, such as transfer learning methods. In addition, since most of the
NLP methods require data for their operation, a resource in the framework of cognitive
accessibility is presented as a contribution.Internet ha avanzado mucho en los últimos años contribuyendo a la proliferación de
grandes volúmenes de información disponible digitalmente. A través de interfaces de
usuario podemos acceder a estos contenidos, sin embargo, estos no son accesibles a todas
las personas. Los usuarios afectados principalmente son las personas con discapacidad
siendo ya un número considerable, pero las barreras de accesibilidad afectan a un gran
rango de grupos de usuarios y contextos de uso en el acceso a la información digital. Algunas
de estas barreras son causadas por la inaccesibilidad al lenguaje cuando los textos
contienen oraciones largas, palabras inusuales y estructuras lingüísticas complejas. Estas
barreras de accesibilidad afectan directamente a las personas con discapacidad cognitiva.
Con el fin de hacer el contenido textual más accesible, existen iniciativas como las
pautas de Lectura Fácil, las pautas de Lenguaje Claro y algunas de las pautas de Accesibilidad
al Contenido en la Web (WCAG) específicas para el lenguaje. Estas pautas
proporcionan documentación, pero no especifican métodos para cumplir con los requisitos
implícitos en estas pautas de manera sistemática. Para obtener una solución, los
métodos de la disciplina del Procesamiento del Lenguaje Natural (PLN) pueden dar un
soporte para alcanzar la conformidad con las pautas de accesibilidad cognitiva relativas al
lenguaje
La tarea de la simplificación de textos del PLN tiene como objetivo reducir la complejidad
lingüística de un texto desde una perspectiva sintáctica y léxica, siendo esta última
el enfoque principal de esta Tesis. En este sentido, un espacio de solución es identificar
en un texto qué palabras son complejas o poco comunes, y en el caso de que sí hubiera,
proporcionar un sinónimo más usual y sencillo, junto con una definición sencilla, todo
ello orientado a las personas con discapacidad cognitiva.
Con tal meta, en esta Tesis, se presenta el estudio, análisis, diseño y desarrollo de
una arquitectura, métodos PLN, recursos y herramientas para la simplificación léxica de
textos para el idioma español en un dominio genérico en el ámbito de la accesibilidad
cognitiva. Para lograr esto, se estudia cada uno de los pasos presentes en los procesos
de simplificación léxica, junto con métodos para la desambiguación del sentido de las
palabras. Como contribución, diferentes tipos de word embedding son explorados y creados,
apoyados por métodos embedding tradicionales y dinámicos, como son los métodos
de transfer learning. Además, debido a que gran parte de los métodos PLN requieren
datos para su funcionamiento, se presenta como contribución un recurso en el marco de
la accesibilidad cognitiva.Programa de Doctorado en Ciencia y Tecnología Informática por la Universidad Carlos III de MadridPresidente: José Antonio Macías Iglesias.- Secretario: Israel González Carrasco.- Vocal: Raquel Hervás Ballestero
Applying Lexical Substitution and Text Mining for Bioinspired Engineering Design
Nature has repeated its evolution processes and developed its own “engineering” principles over a long period of time. Bioinspired design starts from the belief that nature has the most effective and optimized problem-solving schemes, which can be applied to human problems directly or indirectly. In summary, bioinspired design is the study of the design process, adapting the structure, behavior, or organic mechanisms of biology to engineering problems.
In bioinspired design studies, researchers have sought a way of improving concept generation through texts. Generally, there are two problems in text-based bioinspired design. First, there is a great lexical gap between two areas—biology and engineering. Thus, understanding the context of biological text is compromised, prohibiting analogical transfer between the two domains. Second, the amount of text is too great to be assimilated by engineers. This knowledge gap makes the engineer confused by the extensive information and slows down the design process.
The present work tried to apply lexical substitution and text mining theories to effectively process biological text. Regarding the matter of the lexical gap, this research developed an algorithm that translates biological terminology to words or phrases that are understandable to engineers by adapting four lexical sources: WordNet, Wikipedia, the Integrated Taxonomy Information System (ITIS), and WordNik. For the second problem, this research tried to categorize biological text based on morphological solutions by adapting the Latent Semantic Analysis (LSA) technique.
Two main contributions are made in this dissertation. First of all, this work is the first attempt to directly bridge the lexical gap between biology and engineering by translating biological terminology. The existing approach to bioinspired design study involves building a thesaurus or database that connects a few engineering keywords and their biological correspondences. However, since most other biological terms remain unchanged, this research is meaningful as it attempts to overcome this limitation. The second contribution is that this research ameliorates the natural language-based bioinspired concept generation. Specifically, the accessibility to biotexts for bioinspired design seems to be improved by enabling engineers to selectively acquire biological information for their problems
Graph-based approaches to word sense induction
This thesis is a study of Word Sense Induction (WSI), the Natural Language Processing (NLP) task of automatically discovering word meanings from text. WSI is an open problem in NLP whose solution would be of considerable benefit to many other NLP tasks. It has, however, has been studied by relatively few NLP researchers and often in set ways. Scope therefore exists to apply novel methods to the problem, methods that may improve
upon those previously applied. This thesis applies a graph-theoretic approach to WSI. In this approach, word senses are identifed by finding particular types of subgraphs in word co-occurrence graphs. A number of original methods for constructing, analysing, and partitioning graphs are introduced, with these methods then incorporated into graphbased WSI systems. These systems are then shown, in a variety of evaluation scenarios, to return results that are comparable to those of the current best performing WSI systems. The main contributions of the thesis are a novel parameter-free soft clustering algorithm that runs in time linear in the number of edges in the input graph, and novel generalisations of the clustering coeficient (a measure of vertex cohesion in graphs) to the weighted case. Further contributions of the thesis include: a review of graph-based WSI systems that have been proposed in the literature; analysis of the methodologies applied in these systems; analysis of the metrics used to evaluate WSI systems, and empirical evidence to verify the usefulness of each novel method introduced in the thesis for inducing word senses
- …