7 research outputs found
Experiences with workflows for automating data-intensive bioinformatics
High-throughput technologies, such as next-generation sequencing, have turned molecular biology into a
data-intensive discipline, requiring bioinformaticians to use high-performance computing resources and carry out
data management and analysis tasks on large scale. Workflow systems can be useful to simplify construction of
analysis pipelines that automate tasks, support reproducibility and provide measures for fault-tolerance. However,
workflow systems can incur significant development and administration overhead so bioinformatics pipelines are
often still built without them. We present the experiences with workflows and workflow systems within the
bioinformatics community participating in a series of hackathons and workshops of the EU COST action SeqAhead.
The organizations are working on similar problems, but we have addressed them with different strategies and
solutions. This fragmentation of efforts is inefficient and leads to redundant and incompatible solutions. Based on our
experiences we define a set of recommendations for future systems to enable efficient yet simple bioinformatics
workflow construction and execution.Pubblicat
A Semi-Automated Approach for Anatomical Ontology Mapping
This paper presents a study in the domain of semi-automated and fully-automated ontology mapping. A process for inferring additional cross-ontology links within the domain of anatomical ontologies is presented and evaluated on pairs from three model organisms. The results of experiments performed with various external knowledge sources and scoring schemes are discussed
Application of Machine Learning Models in Error and Variant Detection in High-Variation Genomics Datasets
For metagenomics datasets, datasets of complex polyploid genomes, and other high-variation genomics datasets, there are difficulties with the analysis, error detection and variant calling, stemming from the challenges of discerning sequencing errors from biological variation. Confirming base candidates with high frequency of occurrence is no longer a reliable measure because of the natural variation and the presence of rare bases. The paper discusses an approach to the application of machine learning models to classify bases into erroneous and rare variations after preselecting potential error candidates with a weighted frequency measure, which aims to focus on unexpected variations by using the inter-sequence pairwise similarity. Different similarity measures are used to account for different types of datasets. Four machine learning models are implemented and tested
In silico Prediction of C4-related Genes by Finding Duplications Causing Pattern Deviation and Comparative Analysis of Phylogenetic Trees
This study is focused on the development of a pattern-finding method for analyzing evolutionary trees to predict genes that may be involved in C4 photosynthesis. It relies on publicly available phylogenetic data which is processed with the authorsâ own Python scripts and opensource software. The pattern recognition in the topology of the trees is an essential part of the process and the result is then validated by comparing the expression levels of the selected candidates. The same approach can be applied in studying the evolution of other important traits just by changing the type of pattern
Machine Learning Models for Error Detection in Metagenomics and Polyploid Sequencing Data
Metagenomics studies, as well as genomics studies of polyploid species such as wheat, deal with the analysis of high variation data. Such data contain sequences from similar, but distinct genetic chains. This fact presents an obstacle to analysis and research. In particular, the detection of instrumentation errors during the digitalization of the sequences may be hindered, as they can be indistinguishable from the real biological variation inside the digital data. This can prevent the determination of the correct sequences, while at the same time make variant studies significantly more difficult. This paper details a collection of ML-based models used to distinguish a real variant from an erroneous one. The focus is on using this model directly, but experiments are also done in combination with other predictors that isolate a pool of error candidates
Manageable Workflows for Processing Parallel Sequencing Data
ACM Computing Classification System (1998): D.2.11, D.1.3, D.3.1, J.3, C.2.4.Data analysis after parallel sequencing is a process that uses
combinations of software tools that is often subject to experimentation
and on-the-fly substitution, with the necessary file conversion. This article
presents a developing system for creating and managing workflows aiding
the tasks one encounters after parallel sequences, particularly in the area of
metagenomics.
The semantics, description language and software implementation aim
to allow the creation of flexible, configurable workflows that are suitable for
sharing and are easy to manipulate through software or by hand. The execution
system design provides user-defined operations and interchangeability
between an operation and a workflow. This allows significant extensibility,
which can be further complemented with distributed computing and remote
management interfaces