Search CORE

220 research outputs found

Text Mining for Drug Discovery

Author: Piliouras Dimitrios
Publication venue
Publication date: 01/05/2014
Field of study

The University of Manchester - Institutional Repository

KOSHIK: A large-scale distributed computing framework for NLP

Author: Exner Peter
Nugues Pierre
Publication venue: 'Scitepress'
Publication date: 01/01/2014
Field of study

In this paper, we describe KOSHIK, an end-to-end framework to process the unstructured natural language content of multilingual documents. We used the Hadoop distributed computing infrastructure to build this framework as it enables KOSHIK to easily scale by adding inexpensive commodity hardware. We designed an annotation model that allows the processing algorithms to incrementally add layers of annotation without modifyingtheoriginaldocument. We used the Avro binary format to serialize th edocuments. Avro is designed for Hadoop and allows other data warehousing tools to directly query the documents. This paper reports the implementation choices and details of the framework,the annotation model,the options for querying processed data, and the parsing results on the English and Swedish editions of Wikipedia

Lund University Publications

Services approach & overview general tools and resources

Author: Dietl Reinhard
Doppler Gerhard
Essl Markus
Hoisl Bernhard
Richter Berit
Wild Fridolin
Publication venue
Publication date
Field of study

The contents of this deliverable are split into three groups. Following an introduction, a concept and vision is sketched on how to establish the necessary natural language processing (NLP) services including the integration of existing resources. Therefore, an overview on the state-of-the-art is given, incorporating technologies developed by the consortium partners and beyond, followed by the service approach and a practical example. Second, a concept and vision on how to create interoperability for the envisioned learning tools to allow for a quick and painless integration into existing learning environment(s) is elaborated. Third, generic paradigms and guidelines for service integration are provided.The work on this publication has been sponsored by the LTfLL STREP that is funded by the European Commission's 7th Framework Programme. Contract 212578 [http://www.ltfll-project.org

Big data driven co-occurring evidence discovery in chronic obstructive pulmonary disease patients

Author: Agarwal A
Baechle C
Zhu X
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/12/2017
Field of study

© 2017, The Author(s). Background: Chronic Obstructive Pulmonary Disease (COPD) is a chronic lung disease that affects airflow to the lungs. Discovering the co-occurrence of COPD with other diseases, symptoms, and medications is invaluable to medical staff. Building co-occurrence indexes and finding causal relationships with COPD can be difficult because often times disease prevalence within a population influences results. A method which can better separate occurrence within COPD patients from population prevalence would be desirable. Large hospital systems may potentially have tens of millions of patient records spanning decades of collection and a big data approach that is scalable is desirable. The presented method, Co-Occurring Evidence Discovery (COED), presents a methodology and framework to address these issues. Methods: Natural Language Processing methods are used to examine 64,371 deidentified clinical notes and discover associations between COPD and medical terms. Apache cTAKES is leveraged to annotate and structure clinical notes. Several extensions to cTAKES have been written to parallelize the annotation of large sets of clinical notes. A co-occurrence score is presented which can penalize scores based on term prevalence, as well as a baseline method traditionally used for finding co-occurrence. These scoring systems are implemented using Apache Spark. Dictionaries of ground truth terms for diseases, medications, and symptoms have been created using clinical domain knowledge. COED and baseline methods are compared using precision, recall, and F1 score. Results: The highest scoring diseases using COED are lung and respiratory diseases. In contrast, baseline methods for co-occurrence rank diseases with high population prevalence highest. Medications and symptoms evaluated with COED share similar results. When evaluated against ground truth dictionaries, the maximum improvements in recall for symptoms, diseases, and medications were 0.212, 0.130, and 0.174. The maximum improvements in precision for symptoms, diseases, and medications were 0.303, 0.333, and 0.180. Median increase in F1 score for symptoms, diseases, and medications were 38.1%, 23.0%, and 17.1%. A paired t-test was performed and F1 score increases were found to be statistically significant, where p < 0.01. Conclusion: Penalizing terms which are highly frequent in the corpus results in better precision and recall performance. Penalizing frequently occurring terms gives a better picture of the diseases, symptoms, and medications co-occurring with COPD. Using a mathematical and computational approach rather than purely expert driven approach, large dictionaries of COPD related terms can be assembled in a short amount of time

OPUS - University of Technology Sydney

Directory of Open Access Journals

SODA: A Service Oriented Data Acquisition Framework

Author: Diosteanu A
Stellato A
Turbati A
Publication venue: IGI Global
Publication date: 01/01/2012
Field of study

ART

TeXTracT: a Web-based Tool for Building NLP-enabled Applications

Author: Díaz Pace J. Andrés
Marcos Claudio
Rago Alejandro
Ramos Facundo M.
Vélez Juan I.
Publication venue
Publication date: 01/09/2016
Field of study

Over the last few years, the software industry has showed an increasing interest for applications with Natural Language Processing (NLP) capabilities. Several cloud-based solutions have emerged with the purpose of simplifying and streamlining the integration of NLP techniques via Web services. These NLP techniques cover tasks such as language detection, entity recognition, sentiment analysis, classification, among others. However, the services provided are not always as extensible and configurable as a developer may want, preventing their use in industry-grade developments and limiting their adoption in specialized domains (e.g., for analyzing technical documentation). In this context, we have developed a tool called TeXTracT that is designed to be composable, extensible, configurable and accessible. In our tool, NLP techniques can be accessed independently and orchestrated in a pipeline via RESTful Web services. Moreover, the architecture supports the setup and deployment of NLP techniques on demand. The NLP infrastructure is built upon the UIMA framework, which defines communication protocols and uniform service interfaces for text analysis modules. TeXTracT has been evaluated in two case-studies to assess its pros and cons.Sociedad Argentina de Informática e Investigación Operativa (SADIO

TeXTracT: a Web-based Tool for Building NLP-enabled Applications

Author: Díaz Pace J. Andrés
Marcos Claudio
Rago Alejandro
Ramos Facundo M.
Vélez Juan I.
Publication venue
Publication date: 02/12/2016
Field of study

Affective computing for smart operations: a survey and comparative analysis of the available tools, libraries and web services

Author: Corredera Arbide Alberto
Moya Fernández José Manuel
Romero Marta
Publication venue: E.T.S.I. Telecomunicación (UPM)
Publication date: 01/01/2017
Field of study

In this paper, we make a deep search of the available tools in the market, at the current state of the art of Sentiment Analysis. Our aim is to optimize the human response in Datacenter Operations, using a combination of research tools, that allow us to decrease human error in general operations, managing Complex Infrastructures. The use of Sentiment Analysis tools is the first step for extending our capabilities for optimizing the human interface. Using different data collections from a variety of data sources, our research provides a very interesting outcome. In our final testing, we have found that the three main commercial platforms (IBM Watson, Google Cloud and Microsoft Azure) get the same accuracy (89-90%). for the different datasets tested, based on Artificial Neural Network and Deep Learning techniques. The other stand-alone Applications or APIs, like Vader or MeaninCloud, get a similar accuracy level in some of the datasets, using a different approach, semantic Networks, such as Concepnet1, but the model can easily be optimized above 90% of accuracy, just adjusting some parameter of the semantic model. This paper points to future directions for optimizing DataCenter Operations Management and decreasing human error in complex environments

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital UPM

Cross-Platform Text Mining and Natural Language Processing Interoperability - Proceedings of the LREC2016 conference

Author
Publication venue: European Language Resources Association
Publication date: 01/01/2016
Field of study

No abstract available

Enlighten