40 research outputs found

    Making sense of EST sequences by CLOBBing them

    Get PDF
    BACKGROUND: Expressed sequence tags (ESTs) are single pass reads from randomly selected cDNA clones. They provide a highly cost-effective method to access and identify expressed genes. However, they are often prone to sequencing errors and typically define incomplete transcripts. To increase the amount of information obtainable from ESTs and reduce sequencing errors, it is necessary to cluster ESTs into groups sharing significant sequence similarity. RESULTS: As part of our ongoing EST programs investigating 'orphan' genomes, we have developed a clustering algorithm, CLOBB (Cluster on the basis of BLAST similarity) to identify and cluster ESTs. CLOBB may be used incrementally, preserving original cluster designations. It tracks cluster-specific events such as merging, identifies 'superclusters' of related clusters and avoids the expansion of chimeric clusters. Based on the Perl scripting language, CLOBB is highly portable relying only on a local installation of NCBI's freely available BLAST executable and can be usefully applied to > 95 % of the current EST datasets. Analysis of the Danio rerio EST dataset demonstrates that CLOBB compares favourably with two less portable systems, UniGene and TIGR Gene Indices. CONCLUSIONS: CLOBB provides a highly portable EST clustering solution and is freely downloaded from: http://www.nematodes.org/CLOB

    Improving cancer subtype diagnosis and grading using clinical decision support system based on computer-aided tissue image analysis

    Get PDF
    This research focuses towards the development of a clinical decision support system (CDSS) based on cellular and tissue image analysis and classification system that improves consistency and facilitates the clinical decision making process. In a typical cancer examination, pathologists make diagnosis by manually reading morphological features in patient biopsy images, in which cancer biomarkers are highlighted by using different staining techniques. This process is subjected to pathologist's training and experience, especially when the same cancer has several subtypes (i.e. benign tumor subtype vs. malignant subtype) and the same cancer tissue biopsy contains heterogeneous morphologies in different locations. The variability in pathologist's manual reading may result in varying cancer diagnosis and treatment. This Ph.D. research aims to reduce the subjectivity and variation existing in traditional histo-pathological reading of patient tissue biopsy slides through Computer-Aided Diagnosis (CAD). Using the CAD, quantitative molecular profiling of cancer biomarkers of stained biopsy images are obtained by extracting and analyzing texture and cellular structure features. In addition, cancer sub-type classification and a semi-automatic grade scoring (i.e. clinical decision making) for improved consistency over a large number of cancer subtype images can be performed. The CAD tools do have their own limitations and in certain cases the clinicians, however, prefer systems which are flexible and take into account their individuality when necessary by providing some control rather than fully automated system. Therefore, to be able to introduce CDSS in health care, we need to understand users' perspectives and preferences on the new information technology. This forms as the basis for this research where we target to present the quantitative information acquired through the image analysis, annotate the images and provide suitable visualization which can facilitate the process of decision making in a clinical setting.PhDCommittee Chair: Dr. May D. Wang; Committee Member: Dr. Andrew N. Young; Committee Member: Dr. Anthony J. Yezzi; Committee Member: Dr. Edward J. Coyle; Committee Member: Dr. Paul Benkese

    CollabCoder: A GPT-Powered Workflow for Collaborative Qualitative Analysis

    Full text link
    The Collaborative Qualitative Analysis (CQA) process can be time-consuming and resource-intensive, requiring multiple discussions among team members to refine codes and ideas before reaching a consensus. To address these challenges, we introduce CollabCoder, a system leveraging Large Language Models (LLMs) to support three CQA stages: independent open coding, iterative discussions, and the development of a final codebook. In the independent open coding phase, CollabCoder provides AI-generated code suggestions on demand, and allows users to record coding decision-making information (e.g. keywords and certainty) as support for the process. During the discussion phase, CollabCoder helps to build mutual understanding and productive discussion by sharing coding decision-making information with the team. It also helps to quickly identify agreements and disagreements through quantitative metrics, in order to build a final consensus. During the code grouping phase, CollabCoder employs a top-down approach for primary code group recommendations, reducing the cognitive burden of generating the final codebook. An evaluation involving 16 users confirmed the usability and effectiveness of CollabCoder and offered empirical insights into the LLMs' roles in CQA

    Computational analysis of proteomes from parasitic nematodes

    Get PDF

    Intelligent Data-Driven Reverse Engineering of Software Design Patterns

    Get PDF
    Recognising implemented instances of Design Patterns (DPs) in software design discloses and recovers a wealth of information about the intention of the original designers and the rationale for their design decisions. Because it is often the case that the documentation available for software systems, if any, is poor and/or obsolete, recovering such information can be of great help and importance for maintenance tasks. However, since DPs are abstractly and vaguely defined, a set of software classes with exactly the same relationships as expected for a DP instance may actually be only accidentally similar. On the other hand, a set of classes with relationships that are, to an extent, different from those typically expected can still be a true DP instance. The deciding factor is mainly concerned with whether or not the set of classes is actually intended to solve the design problem addressed by the DP, thus making the intent a fundamental and defining characteristic of DPs. Discerning the intent of potential instances requires building complex models that cannot be built using only the descriptions of DPs in books and catalogues. Accordingly, a paradigm shift in DP recognition towards fully machine learning based approaches is required. The problem is that no accurate and sufficiently large DP datasets exist, and it is difficult to manually construct one. Moreover, there is a lack of research on the feature set that should be used in DP recognition. The main aim of this thesis is to enable the required paradigm shift by laying down an accurate, comprehensive and information-rich foundation of feature and data sets. In order to achieve this aim, a large set of features is developed to cover a wide range of design aspects, with particular focus on design intent. This set serves as a global feature set from which different subsets can be objectively selected for different DPs. A new and feasible approach to DP dataset construction is designed and used to construct training datasets. The feature and data sets are then used experimentally to build and train DP classifiers. The results demonstrate the accuracy and utility of the sets introduced, and show that fully machine learning based approaches are capable of providing appropriate and well-equipped solutions for the problem of DP recognition.Saudi Cultural Burea

    Computational Analysis of the Transcriptome Using Long-Read RNA Sequencing

    Get PDF
    Reconstructing the transcriptome from RNA sequencing reads is a challenging problem, especially when no high quality reference genome is available. Current transcriptome annotations have largely relied on short read lengths intrinsic to most widely used high-throughput cDNA sequencing technologies. For example, in the annotation of the Caenorhabditis elegans transcriptome, more than half of the transcript isoforms lack full-length support and instead rely on inference from short reads that do not span the full length of the isoform. Short read sequencing technologies, though accurate, cannot reliably reconstruct full-length transcripts due to the highly complex nature of the transcriptome with large gene families, widespread alternative splicing, and highly variable expression and coverage per transcript. We applied nanopore-based direct RNA sequencing to characterize the developmental polyadenylated transcriptome of C. elegans. Using this approach we provide support for 23,865 splice isoforms across 14,611 genes, without the need for computational reconstruction of gene models. In addition, we have developed an open source de novo transcriptome assembly method, CONDUIT, which uses single molecule long read RNA sequencing to generate scaffolded splice graphs independent of a reference genome. It then pseudomaps short-read RNA sequencing reads to isoforms extracted from the scaffolded splice graph, polishes these splice graphs using both short and long read data, and outputs consensus isoforms extracted from these splice graphs. We show that CONDUIT produces highly accurate consensus isoforms, completely independent of a reference genome in several model systems and in a novel pathogenic yeast system

    Computational Methods for the Analysis of Genomic Data and Biological Processes

    Get PDF
    In recent decades, new technologies have made remarkable progress in helping to understand biological systems. Rapid advances in genomic profiling techniques such as microarrays or high-performance sequencing have brought new opportunities and challenges in the fields of computational biology and bioinformatics. Such genetic sequencing techniques allow large amounts of data to be produced, whose analysis and cross-integration could provide a complete view of organisms. As a result, it is necessary to develop new techniques and algorithms that carry out an analysis of these data with reliability and efficiency. This Special Issue collected the latest advances in the field of computational methods for the analysis of gene expression data, and, in particular, the modeling of biological processes. Here we present eleven works selected to be published in this Special Issue due to their interest, quality, and originality

    Climate Change, Modelling and Conservation of the World’s Terrestrial birds

    Get PDF
    Global climate change is an important threat to biodiversity and is predicted to be a major driver of wildlife population extinctions throughout the current century. Across a wide range of taxa, a well-documented response to climate change has been changes in species distributions, often towards higher latitudes and altitudes. Species distribution models (SDMs) have been widely used to predict further range changes in future but their use has often focused on discrete geographical areas. Moreover, SDMs have typically been correlative, ignoring biological traits. Here, I use SDMs to project future ranges for the world’s terrestrial birds under climate change. To improve the realism of projected range changes, I incorporate biological traits, including species’ age at first breeding and natal dispersal range. I use these projections to predict large-scale patterns in the responses of terrestrial birds to climate change, and to explore the implications of these models for avian conservation. There is little consensus on the most useful predictors for SDMs, so I begin by exploring how this varies geographically. With this knowledge, I develop SDMs for the world’s terrestrial birds and project future species ranges using three different global climate models (CCSM4, GFDL-CM3, HadGEM2-ES) under a low (rcp26), a medium (rcp45) and a high (rcp85) representative concentration pathway. The projected ranges are used to identify species most at risk from climate change and to highlight global hotspots where species are projected to experience the highest range losses. I explore how the projected range changes affect global species communities and I identify areas where species communities are projected to change or novel communities will emerge. I assess how projected changes will affect the ability of the global Important Bird and Biodiversity Areas (IBAs) network to confer protection on the world’s terrestrial bird species. Additionally, I highlight - based on projected range loss and suitable habitat and climate space beyond the dispersal range - species that will be unable to track climate change and that could be candidates for Assisted Colonization (AC). Finally, I explore the divergence between global species richness (SR) patterns and phylogenetic diversity (PD) for the world’s terrestrial birds, to assess if measuring biodiversity and setting conservation targets based on SR can be expected to cover their PD as well. Identifying the global consequences of projected range changes can inform future conservation efforts and research priorities. Changes in range extent and overlap were projected for the vast majority of the world’s terrestrial birds, with one-fifth projected to experience major range losses (>75% decline in range extent projected). This has far reaching consequences for the IBA network, with an overall trend of species moving out of the IBA coverage. Furthermore 13% of the world’s terrestrial birds are projected to have severe range losses that, combined with an inability to follow suitable habitat and climate space, mean they could benefit from AC as a conservation tool. Overall, PD was found to be highly correlated to SR on a global scale; however, there are localized differences where PD is higher or lower than could be expected from SR alone. These differences suggest that considering PD could enhance conservation planning. The results demonstrate the major threat that climate change poses for the world’s terrestrial bird species across all areas of the globe, and highlight the importance of considering climate change impacts to enhance their protection
    corecore