101 research outputs found

    Building a protein name dictionary from full text: a machine learning term extraction approach

    Get PDF
    BACKGROUND: The majority of information in the biological literature resides in full text articles, instead of abstracts. Yet, abstracts remain the focus of many publicly available literature data mining tools. Most literature mining tools rely on pre-existing lexicons of biological names, often extracted from curated gene or protein databases. This is a limitation, because such databases have low coverage of the many name variants which are used to refer to biological entities in the literature. RESULTS: We present an approach to recognize named entities in full text. The approach collects high frequency terms in an article, and uses support vector machines (SVM) to identify biological entity names. It is also computationally efficient and robust to noise commonly found in full text material. We use the method to create a protein name dictionary from a set of 80,528 full text articles. Only 8.3% of the names in this dictionary match SwissProt description lines. We assess the quality of the dictionary by studying its protein name recognition performance in full text. CONCLUSION: This dictionary term lookup method compares favourably to other published methods, supporting the significance of our direct extraction approach. The method is strong in recognizing name variants not found in SwissProt

    Critical evaluation of the JDO API for the persistence and portability requirements of complex biological databases

    Get PDF
    BACKGROUND: Complex biological database systems have become key computational tools used daily by scientists and researchers. Many of these systems must be capable of executing on multiple different hardware and software configurations and are also often made available to users via the Internet. We have used the Java Data Object (JDO) persistence technology to develop the database layer of such a system known as the SigPath information management system. SigPath is an example of a complex biological database that needs to store various types of information connected by many relationships. RESULTS: Using this system as an example, we perform a critical evaluation of current JDO technology; discuss the suitability of the JDO standard to achieve portability, scalability and performance. We show that JDO supports portability of the SigPath system from a relational database backend to an object database backend and achieves acceptable scalability. To answer the performance question, we have created the SigPath JDO application benchmark that we distribute under the Gnu General Public License. This benchmark can be used as an example of using JDO technology to create a complex biological database and makes it possible for vendors and users of the technology to evaluate the performance of other JDO implementations for similar applications. CONCLUSIONS: The SigPath JDO benchmark and our discussion of JDO technology in the context of biological databases will be useful to bioinformaticians who design new complex biological databases and aim to create systems that can be ported easily to a variety of database backends

    MetaR: simple, high-level languages for data analysis with the R ecosystem 2

    Get PDF
    ABSTRACT 11 Data analysis tools have become essential to the study of biology. Tools available today were constructed with layers of technology developed over decades. Here, we explain how some of the principles used to develop this technology are sub-optimal for the construction of data analysis tools for biologists. In contrast, we applied language workbench technology (LWT) to create a data analysis language, called MetaR, tailored for biologists with no programming experience, as well as expert bioinformaticians and statisticians. A key novelty of this approach is its ability to blend user interface with scripting in such a way that beginners and experts alike can analyze data productively in the same analysis platform. While presenting MetaR, we explain how a judicious use of LWT eliminates problems that have historically contributed to data analysis bottlenecks. These results show that language design with LWT can be a compelling approach for developing intelligent data analysis tools. [1958, 1978]). 21 In this manuscript, we discuss several drawbacks of encoding programs as text that we believe One question we were particularly interested in testing was whether we could create an analysis tool 29 that would blend the boundary between programming/scripting languages and graphical user interfaces. 30 Programming languages such as the R language Ihaka and Gentleman [1996] are frequently preferred for 31 data analysis by experts. They have so far been the most flexible and powerful tools for data analysis, but 32 require a steep learning curve. In contrast, beginners tend to prefer data analysis software with a graphical 33 user interface, which are easier to learn, but eventually are found to lack flexibility and extensibility. 34 We reasoned that blending these two types of interfaces into one tool could provide a simpler way for We found that LWT made it straightforward to develop a data analysis tool that blends the distinction 38 between graphical user interface and scripting. If implementation was straightforward, our design of a 39 novel type of analysis tool was an iterative process that benefited from frequent feedback from users of 40 the tool. In this manuscript, we describe the goals of the language, explain how the tool can be used, and 41 highlight the most innovative aspects of the language compared to other tools used for data analysis, such 42 as the R language Ihaka and Gentleman [1996] or electronic notebooks. 43 The initial focus of MetaR was on analysis of RNA-Seq data and the creation of heatmaps, but the 44 tool is general and can be readily extended to support a broad range of data analyses. For instance, we 45 have used MetaR to analyze data in a study of association between the allogenomics score and kidney 46 graft function Mesnard et al. [2015]. We chose to focus on the construction of heatmaps as a use case and 47 illustration for this study because this activity is of interest to many biologists who obtain high-throughput 48 data. 49 Interestingly, we found that both beginners and experts can benefit from blending user interfaces and 50 scripting. Beginners benefit because the MetaR user interface is much simpler to learn than the full R 51 programming language. Expert users benefit because they can develop high-level language elements 52 to simplify repetitive aspects of data analysis in ways that text-based programming languages cannot 53 achieve. 54 LANGUAGE WORKBENCH TECHNOLOGY PRIMER 55 Since many readers may not be familiar with LWT, this section briefly describes how this technology 56 differs from traditional text-based technology. 57 Text-based programming languages are implemented with compilers that internally convert the text 58 representation of the source code into an abstract syntax tree (AST), a data structure used when analyzing 59 and transforming programming languages into machine code. 60 In the MPS LW, the AST is also a central data structure, but the parsing elements of the compilers are an AST to disk is done using serialization (loading is conversely done via deserialization to memory AST 67 data structures). 68 The choice of serialization rather than encoding with text has a profound consequence. Serialization Benson and Campagne [2015]. In this manuscript, we extensively use language composition to extend the 73 R language and provide the ability to embed user interfaces into R programs. 74 Abstract Syntax Tree (AST) 7

    Objective and automated protocols for the evaluation of biomedical search engines using No Title Evaluation protocols

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The evaluation of information retrieval techniques has traditionally relied on human judges to determine which documents are relevant to a query and which are not. This protocol is used in the Text Retrieval Evaluation Conference (TREC), organized annually for the past 15 years, to support the unbiased evaluation of novel information retrieval approaches. The TREC Genomics Track has recently been introduced to measure the performance of information retrieval for biomedical applications.</p> <p>Results</p> <p>We describe two protocols for evaluating biomedical information retrieval techniques without human relevance judgments. We call these protocols No Title Evaluation (NT Evaluation). The first protocol measures performance for focused searches, where only one relevant document exists for each query. The second protocol measures performance for queries expected to have potentially many relevant documents per query (high-recall searches). Both protocols take advantage of the clear separation of titles and abstracts found in Medline. We compare the performance obtained with these evaluation protocols to results obtained by reusing the relevance judgments produced in the 2004 and 2005 TREC Genomics Track and observe significant correlations between performance rankings generated by our approach and TREC. Spearman's correlation coefficients in the range of 0.79ā€“0.92 are observed comparing bpref measured with NT Evaluation or with TREC evaluations. For comparison, coefficients in the range 0.86ā€“0.94 can be observed when evaluating the same set of methods with data from two independent TREC Genomics Track evaluations. We discuss the advantages of NT Evaluation over the TRels and the data fusion evaluation protocols introduced recently.</p> <p>Conclusion</p> <p>Our results suggest that the NT Evaluation protocols described here could be used to optimize some search engine parameters before human evaluation. Further research is needed to determine if NT Evaluation or variants of these protocols can fully substitute for human evaluations.</p

    Compression of Structured High-Throughput Sequencing Data

    Get PDF
    Large biological datasets are being produced at a rapid pace and create substantial storage challenges, particularly in the domain of high-throughput sequencing (HTS). Most approaches currently used to store HTS data are either unable to quickly adapt to the requirements of new sequencing or analysis methods (because they do not support schema evolution), or fail to provide state of the art compression of the datasets. We have devised new approaches to store HTS data that support seamless data schema evolution and compress datasets substantially better than existing approaches. Building on these new approaches, we discuss and demonstrate how a multi-tier data organization can dramatically reduce the storage, computational and network burden of collecting, analyzing, and archiving large sequencing datasets. For instance, we show that spliced RNA-Seq alignments can be stored in less than 4% the size of a BAM file with perfect data fidelity. Compared to the previous compression state of the art, these methods reduce dataset size more than 40% when storing exome, gene expression or DNA methylation datasets. The approaches have been integrated in a comprehensive suite of software tools (http://goby.campagnelab.org) that support common analyses for a range of high-throughput sequencing assays.National Center for Research Resources (U.S.) (Grant UL1 RR024996)Leukemia & Lymphoma Society of America (Translational Research Program Grant LLS 6304-11)National Institute of Mental Health (U.S.) (R01 MH086883

    Beyond tissueInfo: functional prediction using tissue expression profile similarity searches

    Get PDF
    We present and validate tissue expression profile similarity searches (TEPSS), a computational approach to identify transcripts that share similar tissue expression profiles to one or more transcripts in a group of interest. We evaluated TEPSS for its ability to discriminate between pairs of transcripts coding for interacting proteins and non-interacting pairs. We found that ordering proteinā€“protein pairs by TEPSS score produces sets significantly enriched in reported pairs of interacting proteins [interacting versus non-interacting pairs, Odds-ratio (OR) = 157.57, 95% confidence interval (CI) (36.81ā€“375.51) at 1% coverage, employing a large dataset of about 50 000 human protein interactions]. When used with multiple transcripts as input, we find that TEPSS can predict non-obvious members of the cytosolic ribosome. We used TEPSS to predict S-nitrosylation (SNO) protein targets from a set of brain proteins that undergo SNO upon exposure to physiological levels of S-nitrosoglutathione in vitro. While some of the top TEPSS predictions have been validated independently, several of the strongest SNO TEPSS predictions await experimental validation. Our data indicate that TEPSS is an effective and flexible approach to functional prediction. Since the approach does not use sequence similarity, we expect that TEPSS will be useful for various gene discovery applications. TEPSS programs and data are distributed at http://icb.med.cornell.edu/crt/tepss/index.xml

    DNA Methylation Signatures Identify Biologically Distinct Subtypes in Acute Myeloid Leukemia

    Get PDF
    Abstract: We hypothesized that DNA methylation distributes into specific patterns in cancer cells, which reflect critical biological differences. We therefore examined the methylation profiles of 344 patients with acute myeloid leukemia (AML). Clustering of these patients by methylation data segregated patients into 16 groups. Five of these groups defined new AML subtypes that shared no other known feature. In addition, DNA methylation profiles segregated patients with CEBPA aberrations from other subtypes of leukemia, defined four epigenetically distinct forms of AML with NPM1 mutations, and showed that established AML1-ETO, CBFb-MYH11, and PML-RARA leukemia entities are associated with specific methylation profiles. We report a 15 gene methylation classifier predictive of overall survival in an independent patient cohort (p < 0.001, adjusted for known covariates)

    Mining expressed sequence tags identifies cancer markers of clinical interest

    Get PDF
    BACKGROUND: Gene expression data are a rich source of information about the transcriptional dis-regulation of genes in cancer. Genes that display differential regulation in cancer are a subtype of cancer biomarkers. RESULTS: We present an approach to mine expressed sequence tags to discover cancer biomarkers. A false discovery rate analysis suggests that the approach generates less than 22% false discoveries when applied to combined human and mouse whole genome screens. With this approach, we identify the 200 genes most consistently differentially expressed in cancer (called HM200) and proceed to characterize these genes. When used for prediction in a variety of cancer classification tasks (in 24 independent cancer microarray datasets, 59 classifications total), we show that HM200 and the shorter gene list HM100 are very competitive cancer biomarker sets. Indeed, when compared to 13 published cancer marker gene lists, HM200 achieves the best or second best classification performance in 79% of the classifications considered. CONCLUSION: These results indicate the existence of at least one general cancer marker set whose predictive value spans several tumor types and classification types. Our comparison with other marker gene lists shows that HM200 markers are mostly novel cancer markers. We also identify the previously published Pomeroy-400 list as another general cancer marker set. Strikingly, Pomeroy-400 has 27 genes in common with HM200. Our data suggest that a core set of genes are responsive to the deregulation of pathways involved in tumorigenesis in a variety of tumor types and that these genes could serve as transcriptional cancer markers in applications of clinical interest. Finally, our study suggests new strategies to select and evaluate cancer biomarkers in microarray studies

    Compression of structured high-throughput sequencing data.

    No full text
    • ā€¦
    corecore