7 research outputs found

    Experiences with workflows for automating data-intensive bioinformatics

    Get PDF
    High-throughput technologies, such as next-generation sequencing, have turned molecular biology into a data-intensive discipline, requiring bioinformaticians to use high-performance computing resources and carry out data management and analysis tasks on large scale. Workflow systems can be useful to simplify construction of analysis pipelines that automate tasks, support reproducibility and provide measures for fault-tolerance. However, workflow systems can incur significant development and administration overhead so bioinformatics pipelines are often still built without them. We present the experiences with workflows and workflow systems within the bioinformatics community participating in a series of hackathons and workshops of the EU COST action SeqAhead. The organizations are working on similar problems, but we have addressed them with different strategies and solutions. This fragmentation of efforts is inefficient and leads to redundant and incompatible solutions. Based on our experiences we define a set of recommendations for future systems to enable efficient yet simple bioinformatics workflow construction and execution.Pubblicat

    A Semi-Automated Approach for Anatomical Ontology Mapping

    No full text
    This paper presents a study in the domain of semi-automated and fully-automated ontology mapping. A process for inferring additional cross-ontology links within the domain of anatomical ontologies is presented and evaluated on pairs from three model organisms. The results of experiments performed with various external knowledge sources and scoring schemes are discussed

    Application of Machine Learning Models in Error and Variant Detection in High-Variation Genomics Datasets

    No full text
    For metagenomics datasets, datasets of complex polyploid genomes, and other high-variation genomics datasets, there are difficulties with the analysis, error detection and variant calling, stemming from the challenges of discerning sequencing errors from biological variation. Confirming base candidates with high frequency of occurrence is no longer a reliable measure because of the natural variation and the presence of rare bases. The paper discusses an approach to the application of machine learning models to classify bases into erroneous and rare variations after preselecting potential error candidates with a weighted frequency measure, which aims to focus on unexpected variations by using the inter-sequence pairwise similarity. Different similarity measures are used to account for different types of datasets. Four machine learning models are implemented and tested

    In silico Prediction of C4-related Genes by Finding Duplications Causing Pattern Deviation and Comparative Analysis of Phylogenetic Trees

    Get PDF
    This study is focused on the development of a pattern-finding method for analyzing evolutionary trees to predict genes that may be involved in C4 photosynthesis. It relies on publicly available phylogenetic data which is processed with the authors’ own Python scripts and opensource software. The pattern recognition in the topology of the trees is an essential part of the process and the result is then validated by comparing the expression levels of the selected candidates. The same approach can be applied in studying the evolution of other important traits just by changing the type of pattern

    Machine Learning Models for Error Detection in Metagenomics and Polyploid Sequencing Data

    No full text
    Metagenomics studies, as well as genomics studies of polyploid species such as wheat, deal with the analysis of high variation data. Such data contain sequences from similar, but distinct genetic chains. This fact presents an obstacle to analysis and research. In particular, the detection of instrumentation errors during the digitalization of the sequences may be hindered, as they can be indistinguishable from the real biological variation inside the digital data. This can prevent the determination of the correct sequences, while at the same time make variant studies significantly more difficult. This paper details a collection of ML-based models used to distinguish a real variant from an erroneous one. The focus is on using this model directly, but experiments are also done in combination with other predictors that isolate a pool of error candidates

    Manageable Workflows for Processing Parallel Sequencing Data

    No full text
    ACM Computing Classification System (1998): D.2.11, D.1.3, D.3.1, J.3, C.2.4.Data analysis after parallel sequencing is a process that uses combinations of software tools that is often subject to experimentation and on-the-fly substitution, with the necessary file conversion. This article presents a developing system for creating and managing workflows aiding the tasks one encounters after parallel sequences, particularly in the area of metagenomics. The semantics, description language and software implementation aim to allow the creation of flexible, configurable workflows that are suitable for sharing and are easy to manipulate through software or by hand. The execution system design provides user-defined operations and interchangeability between an operation and a workflow. This allows significant extensibility, which can be further complemented with distributed computing and remote management interfaces
    corecore