161 research outputs found
What Makes a Top-Performing Precision Medicine Search Engine? Tracing Main System Features in a Systematic Way
From 2017 to 2019 the Text REtrieval Conference (TREC) held a challenge task
on precision medicine using documents from medical publications (PubMed) and
clinical trials. Despite lots of performance measurements carried out in these
evaluation campaigns, the scientific community is still pretty unsure about the
impact individual system features and their weights have on the overall system
performance. In order to overcome this explanatory gap, we first determined
optimal feature configurations using the Sequential Model-based Algorithm
Configuration (SMAC) program and applied its output to a BM25-based search
engine. We then ran an ablation study to systematically assess the individual
contributions of relevant system features: BM25 parameters, query type and
weighting schema, query expansion, stop word filtering, and keyword boosting.
For evaluation, we employed the gold standard data from the three TREC-PM
installments to evaluate the effectiveness of different features using the
commonly shared infNDCG metric.Comment: Accepted for SIGIR2020, 10 page
Hizkuntza Anitzeko Erlazio Semantikoen Erauzketa Medikuntzaren Domeinuan
Aro digital honentan datu kopuru handiena textu gordin formatuan aurkitzen da. Datu
horiekin lan egiteko Informazio Erauzketa (IE) bihurtzen da oinarri gaur egungo
aplikazioetan. Hizkuntzaren prozesaketa automatikoko ataza gehientxuenetan gertatu
den bezala ikasketa sakonak artearen egoera ezarri du, baita IEn ere. Jakina da teknika
hauek datu kopuru handiak behar dituztela errendimendu ona lortzeko. Badira hainbat
domeinu eta testuinguru, datu anotatu gutxikoak, zailtasunak dituztenak ikasketa
sakoneko tekniken aurrerapenak modu eraginkorrean erabiltzeko. Anotazio berriak egitea
garestia izaten da orokorrean, batez ere eredu berri hauek behar duten kopuruetara
iristeko. Lan honen helburu nagusia domeinu eta testuinguru hauentzako modu merke
batean ikasketa sakoneko sistemen errendimendua hobetzeko teknikak esploratzea da.
Zehatzago esanda, ezagutza-transferentzia eta datuen-gehikuntza automatikoa
paradigmetan ikertuko dugu helburua lortzeko. Azkenik, teknika hauek baliabide urrikoa
den medikuntzako domeinuko eHealth-KD 2020 ataza-partekatuan aplikatuko eta
ebalutako dira, uneko artearen egoera hobetzeko helburuarekin.In this digital age the greatest amount of data is found in raw text format. Information
Extraction (IE) to work with this data becomes the basis in today's applications. As has
happened in most tasks of automatic language processing, deep learning has established
the state of the art in IE as well. It is well known that these techniques require a large
amount of data to achieve good performance. There are a number of domains and
contexts, with little annotated data, that have di culties making e ective the use of
advances in deep learning techniques. Making new annotations is generally expensive,
especially to reach the numbers needed for these new models. The main goal of this work
is to explore techniques to improve the performance of deep learning systems in a
cost-e ective way for these domains and contexts. More speci cally, we will investigate
transfer-learning and automatic data augmentation paradigms to achieve the goal.
Finally, these techniques will be applied and evaluated in the shared task eHealth-KD
2020 in the low-resource medical domain, with the goal of improving the state of the art
Achieving High Quality Knowledge Acquisition using Controlled Natural Language
Controlled Natural Languages (CNLs) are efficient languages for knowledge acquisition and reasoning. They are designed as a subset of natural languages with restricted grammar while being highly expressive. CNLs are designed to be automatically translated into logical representations, which can be fed into rule engines for query and reasoning. In this work, we build a knowledge acquisition machine, called KAM, that extends Attempto Controlled English (ACE) and achieves three goals. First, KAM can identify CNL sentences that correspond to the same logical representation but expressed in various syntactical forms. Second, KAM provides a graphical user interface (GUI) that allows users to disambiguate the knowledge acquired from text and incorporates user feedback to improve knowledge acquisition quality. Third, KAM uses a paraconsistent logical framework to encode CNL sentences in order to achieve reasoning in the presence of inconsistent knowledge
Name Variants for Improving Entity Discovery and Linking
Identifying all names that refer to a particular set of named entities is a challenging task, as quite often we need to consider many features that include a lot of variation like abbreviations, aliases, hypocorism, multilingualism or partial matches. Each entity type can also have specific rules for name variances: people names can include titles, country and branch names are sometimes removed from organization names, while locations are often plagued by the issue of nested entities. The lack of a clear strategy for collecting, processing and computing name variants significantly lowers the recall of tasks such as Named Entity Linking and Knowledge Base Population since name variances are frequently used in all kind of textual content.
This paper proposes several strategies to address these issues. Recall can be improved by combining knowledge repositories and by computing additional variances based on algorithmic approaches. Heuristics and machine learning methods then analyze the generated name variances and mark ambiguous names to increase precision. An extensive evaluation demonstrates the effects of integrating these methods into a new Named Entity Linking framework and confirms that systematically considering name variances yields significant performance improvements
- …