Search CORE

6,633 research outputs found

Challenges and solutions for Latin named entity recognition

Author: Ajaka Petra
Brown Christopher
de Marneffe Marie-Catherine
Elsner Micha
Erdmann Alex
Janse Mark
Joseph Brian D.
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2016
Field of study

Although spanning thousands of years and genres as diverse as liturgy, historiography, lyric and other forms of prose and poetry, the body of Latin texts is still relatively sparse compared to English. Data sparsity in Latin presents a number of challenges for traditional Named Entity Recognition techniques. Solving such challenges and enabling reliable Named Entity Recognition in Latin texts can facilitate many down-stream applications, from machine translation to digital historiography, enabling Classicists, historians, and archaeologists for instance, to track the relationships of historical persons, places, and groups on a large scale. This paper presents the first annotated corpus for evaluating Named Entity Recognition in Latin, as well as a fully supervised model that achieves over 90% F-score on a held-out test set, significantly outperforming a competitive baseline. We also present a novel active learning strategy that predicts how many and which sentences need to be annotated for named entities in order to attain a specified degree of accuracy when recognizing named entities automatically in a given text. This maximizes the productivity of annotators while simultaneously controlling quality

Ghent University Academic Bibliography

PeptiCKDdb-peptide- and protein-centric database for the investigation of genesis and progression of chronic kidney disease

Author: Fernandes Marco
Filip Szymon
Husi Holger
Jankowski Joachim
Krochmal Magdalena
Mischak Harald
Pontillo Claudia
Vlahou Antonia
Zoidakis Jerome
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2016
Field of study

The peptiCKDdb is a publicly available database platform dedicated to support research in the field of chronic kidney disease (CKD) through identification of novel biomarkers and molecular features of this complex pathology. PeptiCKDdb collects peptidomics and proteomics datasets manually extracted from published studies related to CKD. Datasets from peptidomics or proteomics, human case/control studies on CKD and kidney or urine profiling were included. Data from 114 publications (studies of body fluids and kidney tissue: 26 peptidomics and 76 proteomics manuscripts on human CKD, and 12 focusing on healthy proteome profiling) are currently deposited and the content is quarterly updated. Extracted datasets include information about the experimental setup, clinical study design, discovery-validation sample sizes and list of differentially expressed proteins (P-value < 0.05). A dedicated interactive web interface, equipped with multiparametric search engine, data export and visualization tools, enables easy browsing of the data and comprehensive analysis. In conclusion, this repository might serve as a source of data for integrative analysis or a knowledgebase for scientists seeking confirmation of their findings and as such, is expected to facilitate the modeling of molecular mechanisms underlying CKD and identification of biologically relevant biomarkers.Database URL: www.peptickddb.com

Institutional Repository of the Freie Universität Berlin

Maastricht University Research Portal

Crossref

PubMed Central

Publikationsserver der RWTH Aachen University

Enlighten

EveTAR: Building a Large-Scale Multi-Task Test Collection over Arabic Tweets

Author: A Bruns
AM Azmi
BS Wasike
D Bodoff
D Elsweiler
Hind Almerekhi
J Benhardus
JL Fleiss
JR Landis
K Darwish
M Efron
M Rowe
M Sanderson
Maram Hasanain
Mucahid Kutlu
Reem Suwaileh
RL Brennan
Tamer Elsayed
W Magdy
Zhang Y
Publication venue
Publication date: 21/08/2017
Field of study

This article introduces a new language-independent approach for creating a large-scale high-quality test collection of tweets that supports multiple information retrieval (IR) tasks without running a shared-task campaign. The adopted approach (demonstrated over Arabic tweets) designs the collection around significant (i.e., popular) events, which enables the development of topics that represent frequent information needs of Twitter users for which rich content exists. That inherently facilitates the support of multiple tasks that generally revolve around events, namely event detection, ad-hoc search, timeline generation, and real-time summarization. The key highlights of the approach include diversifying the judgment pool via interactive search and multiple manually-crafted queries per topic, collecting high-quality annotations via crowd-workers for relevancy and in-house annotators for novelty, filtering out low-agreement topics and inaccessible tweets, and providing multiple subsets of the collection for better availability. Applying our methodology on Arabic tweets resulted in EveTAR , the first freely-available tweet test collection for multiple IR tasks. EveTAR includes a crawl of 355M Arabic tweets and covers 50 significant events for which about 62K tweets were judged with substantial average inter-annotator agreement (Kappa value of 0.71). We demonstrate the usability of EveTAR by evaluating existing algorithms in the respective tasks. Results indicate that the new collection can support reliable ranking of IR systems that is comparable to similar TREC collections, while providing strong baseline results for future studies over Arabic tweets

arXiv.org e-Print Archive

Qatar University Institutional Repository

Crossref

People on Drugs: Credibility of User Statements in Health Communities

Author: Danescu-Niculescu-Mizil Cristian
Mukherjee Subhabrata
Weikum Gerhard
Publication venue
Publication date: 06/05/2017
Field of study

Online health communities are a valuable source of information for patients and physicians. However, such user-generated resources are often plagued by inaccuracies and misinformation. In this work we propose a method for automatically establishing the credibility of user-generated medical statements and the trustworthiness of their authors by exploiting linguistic cues and distant supervision from expert sources. To this end we introduce a probabilistic graphical model that jointly learns user trustworthiness, statement credibility, and language objectivity. We apply this methodology to the task of extracting rare or unknown side-effects of medical drugs --- this being one of the problems where large scale non-expert data has the potential to complement expert medical knowledge. We show that our method can reliably extract side-effects and filter out false statements, while identifying trustworthy users that are likely to contribute valuable medical information

arXiv.org e-Print Archive

MPG.PuRe

Development Of An Instrument To Measure Beliefs And Attitudes From Heart Valve Disease Patients.

Author: Colombo R.C.
Gallani M.C.
Padilha K.M.
Publication venue
Publication date: 26/11/2015
Field of study

The objective of this study was to verify content validity and reliability of "CAV-Instrument" -- an instrument to measure beliefs and attitudes of heart valve disease patients concerning their illness and treatment. The instrument was analyzed by three judges (using predetermined criteria) and submitted to the pretest (n = 17 subjects). The majority of the items were evaluated as adequate regarding their pertinence, clearness and significance regarding the analyzed questions. The pretest showed the necessity for small changes in some statements, which optimized instrument comprehension by the patients. The restructured instrument was applied to 46 patients to verify internal consistency. The whole instrument and most of its scales presented satisfactory internal consistency. It is concluded that the instrument has content validity and is internally consistent, ratifying the adequacy of its application to measure the strength of association among the researched variables.12345345

Repositorio da Producao Cientifica e Intelectual da Unicamp