56 research outputs found
Translation-based Ranking in Cross-Language Information Retrieval
Today's amount of user-generated, multilingual textual data generates the necessity for information processing
systems, where cross-linguality, i.e the ability to work on more than one
language, is fully integrated into the underlying models. In the particular
context of Information Retrieval (IR), this amounts to rank and retrieve relevant
documents from a large repository in language A, given a user's information
need expressed in a query in language B. This kind of application is commonly
termed a Cross-Language Information Retrieval (CLIR) system. Such
CLIR systems typically involve a translation component of varying complexity,
which is responsible for translating the user input into the document
language. Using query translations from modern, phrase-based Statistical
Machine Translation (SMT) systems, and subsequently retrieving monolingually
is thus a straightforward choice. However, the amount of work committed to
integrate such SMT models into CLIR, or even jointly model translation and
retrieval, is rather small.
In this thesis, I focus on the shared aspect of ranking in translation-based
CLIR: Both, translation and retrieval models, induce rankings over a set of
candidate structures through assignment of scores. The subject of this thesis
is to exploit this commonality in three different ranking tasks: (1) "Mate-ranking" refers to the
task of mining comparable data for SMT domain adaptation through translation-based
CLIR. "Cross-lingual mates" are direct or close translations of the query.
I will show that such a CLIR system is able to find
in-domain comparable data from noisy user-generated corpora and improves
in-domain translation performance of an SMT system. Conversely, the CLIR system
relies itself on a translation model that is tailored for retrieval. This
leads to the second direction of research, in which I develop two ways to
optimize an SMT model for retrieval, namely (2) by SMT parameter optimization
towards a retrieval objective ("translation ranking"), and (3) by presenting
a joint model of translation and retrieval for "document ranking". The latter
abandons the common architecture of modeling both components separately. The
former task refers to optimizing for preference of
translation candidates that work well for retrieval. In the core task of "document ranking" for CLIR, I present a model that directly ranks documents using an SMT decoder. I present substantial improvements
over state-of-the-art translation-based CLIR baseline systems, indicating that
a joint model of translation and retrieval is a promising direction of
research in the field of CLIR
Three Essays on Enhancing Clinical Trial Subject Recruitment Using Natural Language Processing and Text Mining
Patient recruitment and enrollment are critical factors for a successful clinical trial; however, recruitment tends to be the most common problem in most clinical trials. The success of a clinical trial depends on efficiently recruiting suitable patients to conduct the trial. Every clinical trial research has a protocol, which describes what will be done in the study and how it will be conducted. Also, the protocol ensures the safety of the trial subjects and the integrity of the data collected. The eligibility criteria section of clinical trial protocols is important because it specifies the necessary conditions that participants have to satisfy.
Since clinical trial eligibility criteria are usually written in free text form, they are not computer interpretable. To automate the analysis of the eligibility criteria, it is therefore necessary to transform those criteria into a computer-interpretable format. Unstructured format of eligibility criteria additionally create search efficiency issues. Thus, searching and selecting appropriate clinical trials for a patient from relatively large number of available trials is a complex task.
A few attempts have been made to automate the matching process between patients and clinical trials. However, those attempts have not fully integrated the entire matching process and have not exploited the state-of-the-art Natural Language Processing (NLP) techniques that may improve the matching performance. Given the importance of patient recruitment in clinical trial research, the objective of this research is to automate the matching process using NLP and text mining techniques and, thereby, improve the efficiency and effectiveness of the recruitment process.
This dissertation research, which comprises three essays, investigates the issues of clinical trial subject recruitment using state-of-the-art NLP and text mining techniques.
Essay 1: Building a Domain-Specific Lexicon for Clinical Trial Subject Eligibility Analysis
Essay 2: Clustering Clinical Trials Using Semantic-Based Feature Expansion
Essay 3: An Automatic Matching Process of Clinical Trial Subject Recruitment
In essay1, I develop a domain-specific lexicon for n-gram Named Entity Recognition (NER) in the breast cancer domain. The domain-specific dictionary is used for selection and reduction of n-gram features in clustering in eassy2. The domain-specific dictionary was evaluated by comparing it with Systematized Nomenclature of Medicine--Clinical Terms (SNOMED CT). The results showed that it add significant number of new terms which is very useful in effective natural language processing In essay 2, I explore the clustering of similar clinical trials using the domain-specific lexicon and term expansion using synonym from the Unified Medical Language System (UMLS). I generate word n-gram features and modify the features with the domain-specific dictionary matching process. In order to resolve semantic ambiguity, a semantic-based feature expansion technique using UMLS is applied. A hierarchical agglomerative clustering algorithm is used to generate clinical trial clusters. The focus is on summarization of clinical trial information in order to enhance trial search efficiency. Finally, in essay 3, I investigate an automatic matching process of clinical trial clusters and patient medical records. The patient records collected from a prior study were used to test our approach. The patient records were pre-processed by tokenization and lemmatization. The pre-processed patient information were then further enhanced by matching with breast cancer custom dictionary described in essay 1 and semantic feature expansion using UMLS Metathesaurus. Finally, I matched the patient record with clinical trial clusters to select the best matched cluster(s) and then with trials within the clusters. The matching results were evaluated by internal expert as well as external medical expert
CLARIN. The infrastructure for language resources
CLARIN, the "Common Language Resources and Technology Infrastructure", has established itself as a major player in the field of research infrastructures for the humanities. This volume provides a comprehensive overview of the organization, its members, its goals and its functioning, as well as of the tools and resources hosted by the infrastructure. The many contributors representing various fields, from computer science to law to psychology, analyse a wide range of topics, such as the technology behind the CLARIN infrastructure, the use of CLARIN resources in diverse research projects, the achievements of selected national CLARIN consortia, and the challenges that CLARIN has faced and will face in the future.
The book will be published in 2022, 10 years after the establishment of CLARIN as a European Research Infrastructure Consortium by the European Commission (Decision 2012/136/EU)
CLARIN
The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium
CLARIN
The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium
An evaluation of the challenges of Multilingualism in Data Warehouse development
In this paper we discuss Business Intelligence and define what is meant by support for Multilingualism in a Business Intelligence reporting context. We identify support for Multilingualism as a challenging issue which has implications for data warehouse design and reporting performance. Data warehouses are a core component of most Business Intelligence systems and the star schema is the approach most widely used to develop data warehouses and dimensional Data Marts. We discuss the way in which Multilingualism can be supported in the Star Schema and identify that current approaches have serious limitations which include data redundancy and data manipulation, performance and maintenance issues. We propose a new approach to enable the optimal application of multilingualism in Business Intelligence. The proposed approach was found to produce satisfactory results when used in a proof-of-concept environment. Future work will include testing the approach in an enterprise environmen
- …