73,820 research outputs found
BioWorkbench: A High-Performance Framework for Managing and Analyzing Bioinformatics Experiments
Advances in sequencing techniques have led to exponential growth in
biological data, demanding the development of large-scale bioinformatics
experiments. Because these experiments are computation- and data-intensive,
they require high-performance computing (HPC) techniques and can benefit from
specialized technologies such as Scientific Workflow Management Systems (SWfMS)
and databases. In this work, we present BioWorkbench, a framework for managing
and analyzing bioinformatics experiments. This framework automatically collects
provenance data, including both performance data from workflow execution and
data from the scientific domain of the workflow application. Provenance data
can be analyzed through a web application that abstracts a set of queries to
the provenance database, simplifying access to provenance information. We
evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree
assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a
RASopathy analysis workflow. We analyze each workflow from both computational
and scientific domain perspectives, by using queries to a provenance and
annotation database. Some of these queries are available as a pre-built feature
of the BioWorkbench web application. Through the provenance data, we show that
the framework is scalable and achieves high-performance, reducing up to 98% of
the case studies execution time. We also show how the application of machine
learning techniques can enrich the analysis process
A perceptual hash function to store and retrieve large scale DNA sequences
This paper proposes a novel approach for storing and retrieving massive DNA
sequences.. The method is based on a perceptual hash function, commonly used to
determine the similarity between digital images, that we adapted for DNA
sequences. Perceptual hash function presented here is based on a Discrete
Cosine Transform Sign Only (DCT-SO). Each nucleotide is encoded as a fixed gray
level intensity pixel and the hash is calculated from its significant frequency
characteristics. This results to a drastic data reduction between the sequence
and the perceptual hash. Unlike cryptographic hash functions, perceptual hashes
are not affected by "avalanche effect" and thus can be compared. The similarity
distance between two hashes is estimated with the Hamming Distance, which is
used to retrieve DNA sequences. Experiments that we conducted show that our
approach is relevant for storing massive DNA sequences, and retrieving them
Automatic Translating Between Ancient Chinese and Contemporary Chinese with Limited Aligned Corpora
The Chinese language has evolved a lot during the long-term development.
Therefore, native speakers now have trouble in reading sentences written in
ancient Chinese. In this paper, we propose to build an end-to-end neural model
to automatically translate between ancient and contemporary Chinese. However,
the existing ancient-contemporary Chinese parallel corpora are not aligned at
the sentence level and sentence-aligned corpora are limited, which makes it
difficult to train the model. To build the sentence level parallel training
data for the model, we propose an unsupervised algorithm that constructs
sentence-aligned ancient-contemporary pairs by using the fact that the aligned
sentence pair shares many of the tokens. Based on the aligned corpus, we
propose an end-to-end neural model with copying mechanism and local attention
to translate between ancient and contemporary Chinese. Experiments show that
the proposed unsupervised algorithm achieves 99.4% F1 score for sentence
alignment, and the translation model achieves 26.95 BLEU from ancient to
contemporary, and 36.34 BLEU from contemporary to ancient.Comment: Acceptted by NLPCC 201
The Parallelism Motifs of Genomic Data Analysis
Genomic data sets are growing dramatically as the cost of sequencing
continues to decline and small sequencing devices become available. Enormous
community databases store and share this data with the research community, but
some of these genomic data analysis problems require large scale computational
platforms to meet both the memory and computational requirements. These
applications differ from scientific simulations that dominate the workload on
high end parallel systems today and place different requirements on programming
support, software libraries, and parallel architectural design. For example,
they involve irregular communication patterns such as asynchronous updates to
shared data structures. We consider several problems in high performance
genomics analysis, including alignment, profiling, clustering, and assembly for
both single genomes and metagenomes. We identify some of the common
computational patterns or motifs that help inform parallelization strategies
and compare our motifs to some of the established lists, arguing that at least
two key patterns, sorting and hashing, are missing
- …