118 research outputs found
The DBCLS BioHackathon: standardization and interoperability for bioinformatics web services and workflows. The DBCLS BioHackathon Consortium*
Web services have become a key technology for bioinformatics, since life science databases are globally decentralized and the exponential increase in the amount of available data demands for efficient systems without the need to transfer entire databases for every step of an analysis. However, various incompatibilities among database resources and analysis services make it difficult to connect and integrate these into interoperable workflows. To resolve this situation, we invited domain specialists from web service providers, client software developers, Open Bio* projects, the BioMoby project and researchers of emerging areas where a standard exchange data format is not well established, for an intensive collaboration entitled the BioHackathon 2008. The meeting was hosted by the Database Center for Life Science (DBCLS) and Computational Biology Research Center (CBRC) and was held in Tokyo from February 11th to 15th, 2008. In this report we highlight the work accomplished and the common issues arisen from this event, including the standardization of data exchange formats and services in the emerging fields of glycoinformatics, biological interaction networks, text mining, and phyloinformatics. In addition, common shared object development based on BioSQL, as well as technical challenges in large data management, asynchronous services, and security are discussed. Consequently, we improved interoperability of web services in several fields, however, further cooperation among major database centers and continued collaborative efforts between service providers and software developers are still necessary for an effective advance in bioinformatics web service technologies
Implementing a Portable Clinical NLP System with a Common Data Model - a Lisp Perspective
This paper presents a Lisp architecture for a portable NLP system, termed
LAPNLP, for processing clinical notes. LAPNLP integrates multiple standard,
customized and in-house developed NLP tools. Our system facilitates portability
across different institutions and data systems by incorporating an enriched
Common Data Model (CDM) to standardize necessary data elements. It utilizes
UMLS to perform domain adaptation when integrating generic domain NLP tools. It
also features stand-off annotations that are specified by positional reference
to the original document. We built an interval tree based search engine to
efficiently query and retrieve the stand-off annotations by specifying
positional requirements. We also developed a utility to convert an inline
annotation format to stand-off annotations to enable the reuse of clinical text
datasets with inline annotations. We experimented with our system on several
NLP facilitated tasks including computational phenotyping for lymphoma patients
and semantic relation extraction for clinical notes. These experiments
showcased the broader applicability and utility of LAPNLP.Comment: 6 pages, accepted by IEEE BIBM 2018 as regular pape
Impliance: A Next Generation Information Management Appliance
ably successful in building a large market and adapting to the changes of the
last three decades, its impact on the broader market of information management
is surprisingly limited. If we were to design an information management system
from scratch, based upon today's requirements and hardware capabilities, would
it look anything like today's database systems?" In this paper, we introduce
Impliance, a next-generation information management system consisting of
hardware and software components integrated to form an easy-to-administer
appliance that can store, retrieve, and analyze all types of structured,
semi-structured, and unstructured information. We first summarize the trends
that will shape information management for the foreseeable future. Those trends
imply three major requirements for Impliance: (1) to be able to store, manage,
and uniformly query all data, not just structured records; (2) to be able to
scale out as the volume of this data grows; and (3) to be simple and robust in
operation. We then describe four key ideas that are uniquely combined in
Impliance to address these requirements, namely the ideas of: (a) integrating
software and off-the-shelf hardware into a generic information appliance; (b)
automatically discovering, organizing, and managing all data - unstructured as
well as structured - in a uniform way; (c) achieving scale-out by exploiting
simple, massive parallel processing, and (d) virtualizing compute and storage
resources to unify, simplify, and streamline the management of Impliance.
Impliance is an ambitious, long-term effort to define simpler, more robust, and
more scalable information systems for tomorrow's enterprises.Comment: This article is published under a Creative Commons License Agreement
(http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute,
display, and perform the work, make derivative works and make commercial use
of the work, but, you must attribute the work to the author and CIDR 2007.
3rd Biennial Conference on Innovative Data Systems Research (CIDR) January
710, 2007, Asilomar, California, US
Query Optimization for On-Demand Information Extraction Tasks over Text Databases
Many modern applications involve analyzing large amounts of data that comes from unstructured text documents. In its original format, data contains information that, if extracted, can give more insight and help in the decision-making process. The ability to answer structured SQL queries over unstructured data allows for more complex data analysis. Querying unstructured data can be accomplished with the help of information extraction (IE) techniques. The traditional way is by using the Extract-Transform-Load (ETL) approach, which performs all possible extractions over the document corpus and stores the extracted relational results in a data warehouse. Then, the extracted data is queried. The ETL approach produces results that are out of date and causes an explosion in the number of possible relations and attributes to extract. Therefore, new approaches to perform extraction on-the-fly were developed; however, previous efforts relied on specialized extraction operators, or particular IE algorithms, which limited the optimization opportunities of such queries.
In this work, we propose an on-line approach that integrates the engine of the database management system with IE systems using a new type of view called extraction views. Queries on text documents are evaluated using these extraction views, which get populated at query-time with newly extracted data. Our approach enables the optimizer to apply all well-defined optimization techniques. The optimizer selects the best execution plan using a defined cost model that considers a user-defined balance between the cost and quality of extraction, and we explain the trade-off between the two factors. The main contribution is the ability to run on-demand information extraction to consider latest changes in the data, while avoiding unnecessary extraction from irrelevant text documents
Scalable and Declarative Information Extraction in a Parallel Data Analytics System
Informationsextraktions (IE) auf sehr großen Datenmengen erfordert hochkomplexe, skalierbare und anpassungsfähige Systeme. Obwohl zahlreiche IE-Algorithmen existieren, ist die nahtlose und erweiterbare Kombination dieser Werkzeuge in einem skalierbaren System immer noch eine große Herausforderung. In dieser Arbeit wird ein anfragebasiertes IE-System für eine parallelen Datenanalyseplattform vorgestellt, das für konkrete Anwendungsdomänen konfigurierbar ist und für Textsammlungen im Terabyte-Bereich skaliert. Zunächst werden konfigurierbare Operatoren für grundlegende IE- und Web-Analytics-Aufgaben definiert, mit denen komplexe IE-Aufgaben in Form von deklarativen Anfragen ausgedrückt werden können. Alle Operatoren werden hinsichtlich ihrer Eigenschaften charakterisiert um das Potenzial und die Bedeutung der Optimierung nicht-relationaler, benutzerdefinierter Operatoren (UDFs) für Data Flows hervorzuheben. Anschließend wird der Stand der Technik in der Optimierung nicht-relationaler Data Flows untersucht und herausgearbeitet, dass eine umfassende Optimierung von UDFs immer noch eine Herausforderung ist. Darauf aufbauend wird ein erweiterbarer, logischer Optimierer (SOFA) vorgestellt, der die Semantik von UDFs mit in die Optimierung mit einbezieht. SOFA analysiert eine kompakte Menge von Operator-Eigenschaften und kombiniert eine automatisierte Analyse mit manuellen UDF-Annotationen, um die umfassende Optimierung von Data Flows zu ermöglichen. SOFA ist in der Lage, beliebige Data Flows aus unterschiedlichen Anwendungsbereichen logisch zu optimieren, was zu erheblichen Laufzeitverbesserungen im Vergleich mit anderen Techniken führt. Als Viertes wird die Anwendbarkeit des vorgestellten Systems auf Korpora im Terabyte-Bereich untersucht und systematisch die Skalierbarkeit und Robustheit der eingesetzten Methoden und Werkzeuge beurteilt um schließlich die kritischsten Herausforderungen beim Aufbau eines IE-Systems für sehr große Datenmenge zu charakterisieren.Information extraction (IE) on very large data sets requires highly complex, scalable, and adaptive systems. Although numerous IE algorithms exist, their seamless and extensible combination in a scalable system still is a major challenge. This work presents a query-based IE system for a parallel data analysis platform, which is configurable for specific application domains and scales for terabyte-sized text collections. First, configurable operators are defined for basic IE and Web Analytics tasks, which can be used to express complex IE tasks in the form of declarative queries. All operators are characterized in terms of their properties to highlight the potential and importance of optimizing non-relational, user-defined operators (UDFs) for dataflows. Subsequently, we survey the state of the art in optimizing non-relational dataflows and highlight that a comprehensive optimization of UDFs is still a challenge. Based on this observation, an extensible, logical optimizer (SOFA) is introduced, which incorporates the semantics of UDFs into the optimization process. SOFA analyzes a compact set of operator properties and combines automated analysis with manual UDF annotations to enable a comprehensive optimization of data flows. SOFA is able to logically optimize arbitrary data flows from different application areas, resulting in significant runtime improvements compared to other techniques. Finally, the applicability of the presented system to terabyte-sized corpora is investigated. Hereby, we systematically evaluate scalability and robustness of the employed methods and tools in order to pinpoint the most critical challenges in building an IE system for very large data sets
Discovery of protein-protein interactions using a combination of linguistic, statistical and graphical information
BACKGROUND: The rapid publication of important research in the biomedical literature makes it increasingly difficult for researchers to keep current with significant work in their area of interest. RESULTS: This paper reports a scalable method for the discovery of protein-protein interactions in Medline abstracts, using a combination of text analytics, statistical and graphical analysis, and a set of easily implemented rules. Applying these techniques to 12,300 abstracts, a precision of 0.61 and a recall of 0.97 were obtained, (f = 0.74) and when allowing for two-hop and three-hop relations discovered by graphical analysis, the precision was 0.74 (f = 0.83). CONCLUSION: This combination of linguistic and statistical approaches appears to provide the highest precision and recall thus far reported in detecting protein-protein relations using text analytic approaches
Processamento automático de texto de narrativas clínicas
The informatization of medical systems and the subsequent move towards
the usage of Electronic Health Records (EHR) over the paper format by
medical professionals allowed for safer and more e cient healthcare. Additionally,
EHR can also be used as a data source for observational studies
around the world. However, it is estimated that 70-80% of all clinical data
is in the form of unstructured free text and regarding the data that is structured,
not all of it follows the same standards, making it di cult to use on
the mentioned observational studies.
This dissertation aims to tackle those two adversities using natural language
processing for the task of extracting concepts from free text and, afterwards,
use a common data model to harmonize the data. The developed system
employs an annotator, namely cTAKES, to extract the concepts from free
text. The extracted concepts are then normalized using text preprocessing,
word embeddings, MetaMap and UMLS Metathesaurus lookup. Finally, the
normalized concepts are converted to the OMOP Common Data Model and
stored in a database.
In order to test the developed system, the i2b2 2010 data set was used.
The di erent components of the system were tested and evaluated separately,
with the concept extraction component achieving a precision, recall
and F-score of 77.12%, 70.29% and 73.55%, respectively. The normalization
component was evaluated by completing the N2C2 2019 challenge
track 3, where it achieved a 77.5% accuracy. Finally, during the OMOP
CDM conversion component, it was observed that 7.92% of the concepts
were lost during the process. In conclusion, even though the developed system
still has margin for improvements, it proves to be a viable method of
automatically processing clinical narratives.A informatização dos sistemas médicos e a subsequente tendência por parte
de profissionais de saúde a substituir registos em formato de papel por registos
eletrónicos de saúde, permitiu que os serviços de saúde se tornassem
mais seguros e eficientes. Além disso, estes registos eletrónicos apresentam
também o benefício de poderem ser utilizados como fonte de dados para estudos
observacionais. No entanto, estima-se que 70-80% de todos os dados
clínicos se encontrem na forma de texto livre não-estruturado e os dados
que estão estruturados não seguem todos os mesmos padrões, dificultando
o seu potencial uso nos estudos observacionais.
Esta dissertação pretende solucionar essas duas adversidades através do uso
de processamento de linguagem natural para a tarefa de extrair conceitos
de texto livre e, de seguida, usar um modelo comum de dados para os harmonizar.
O sistema desenvolvido utiliza um anotador, especificamente o
cTAKES, para extrair conceitos de texto livre. Os conceitos extraídos são,
então, normalizados através de técnicas de pré-processamento de texto,
Word Embeddings, MetaMap e um sistema de procura no Metathesaurus
do UMLS. Por fim, os conceitos normalizados são convertidos para o modelo
comum de dados da OMOP e guardados numa base de dados.
Para testar o sistema desenvolvido usou-se o conjunto de dados i2b2 de
2010. As diferentes partes do sistema foram testadas e avaliadas individualmente
sendo que na extração dos conceitos obteve-se uma precisão, recall e
F-score de 77.12%, 70.29% e 73.55%, respetivamente. A normalização foi
avaliada através do desafio N2C2 2019-track 3 onde se obteve uma exatidão
de 77.5%. Na conversão para o modelo comum de dados OMOP observou-se
que durante a conversão perderam-se 7.92% dos conceitos. Concluiu-se
que, embora o sistema desenvolvido ainda tenha margem para melhorias,
este demonstrou-se como um método viável de processamento automático
do texto de narrativas clínicas.Mestrado em Engenharia de Computadores e Telemátic
Social media analytics: a survey of techniques, tools and platforms
This paper is written for (social science) researchers seeking to analyze the wealth of social media now available. It presents a comprehensive review of software tools for social networking media, wikis, really simple syndication feeds, blogs, newsgroups, chat and news feeds. For completeness, it also includes introductions to social media scraping, storage, data cleaning and sentiment analysis. Although principally a review, the paper also provides a methodology and a critique of social media tools. Analyzing social media, in particular Twitter feeds for sentiment analysis, has become a major research and business activity due to the availability of web-based application programming interfaces (APIs) provided by Twitter, Facebook and News services. This has led to an ‘explosion’ of data services, software tools for scraping and analysis and social media analytics platforms. It is also a research area undergoing rapid change and evolution due to commercial pressures and the potential for using social media data for computational (social science) research. Using a simple taxonomy, this paper provides a review of leading software tools and how to use them to scrape, cleanse and analyze the spectrum of social media. In addition, it discussed the requirement of an experimental computational environment for social media research and presents as an illustration the system architecture of a social media (analytics) platform built by University College London. The principal contribution of this paper is to provide an overview (including code fragments) for scientists seeking to utilize social media scraping and analytics either in their research or business. The data retrieval techniques that are presented in this paper are valid at the time of writing this paper (June 2014), but they are subject to change since social media data scraping APIs are rapidly changing
- …