2,021 research outputs found
Compressing DNA sequence databases with coil
Background: Publicly available DNA sequence databases such as GenBank are large, and are
growing at an exponential rate. The sheer volume of data being dealt with presents serious storage
and data communications problems. Currently, sequence data is usually kept in large "flat files,"
which are then compressed using standard Lempel-Ziv (gzip) compression β an approach which
rarely achieves good compression ratios. While much research has been done on compressing
individual DNA sequences, surprisingly little has focused on the compression of entire databases
of such sequences. In this study we introduce the sequence database compression software coil.
Results: We have designed and implemented a portable software package, coil, for compressing
and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared
towards achieving high compression ratios at the expense of execution time and memory usage
during compression β the compression time represents a "one-off investment" whose cost is
quickly amortised if the resulting compressed file is transmitted many times. Decompression
requires little memory and is extremely fast. We demonstrate a 5% improvement in compression
ratio over state-of-the-art general-purpose compression tools for a large GenBank database file
containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental
additions to a sequence database.
Conclusion: coil presents a compelling alternative to conventional compression of flat files for the
storage and distribution of DNA sequence databases having a narrow distribution of sequence
lengths, such as EST data. Increasing compression levels for databases having a wide distribution of
sequence lengths is a direction for future work
Computational Identification of Four Spliceosomal snRNAs from the Deep-Branching Eukaryote Giardia intestinalis
Funding: Marsden Fund New Zealand Allan Wilson Centre The funders had no role in study design, data collection and analysis, decision to publish, or
preparation of the manuscript.RNAs processing other RNAs is very general in eukaryotes, but is not clear to what extent it is ancestral to eukaryotes. Here
we focus on pre-mRNA splicing, one of the most important RNA-processing mechanisms in eukaryotes. In most eukaryotes
splicing is predominantly catalysed by the major spliceosome complex, which consists of five uridine-rich small nuclear
RNAs (U-snRNAs) and over 200 proteins in humans. Three major spliceosomal introns have been found experimentally in
Giardia; one Giardia U-snRNA (U5) and a number of spliceosomal proteins have also been identified. However, because of
the low sequence similarity between the Giardia ncRNAs and those of other eukaryotes, the other U-snRNAs of Giardia had
not been found. Using two computational methods, candidates for Giardia U1, U2, U4 and U6 snRNAs were identified in this
study and shown by RT-PCR to be expressed. We found that identifying a U2 candidate helped identify U6 and U4 based on
interactions between them. Secondary structural modelling of the Giardia U-snRNA candidates revealed typical features of
eukaryotic U-snRNAs. We demonstrate a successful approach to combine computational and experimental methods to
identify expected ncRNAs in a highly divergent protist genome. Our findings reinforce the conclusion that spliceosomal
small-nuclear RNAs existed in the last common ancestor of eukaryotes
Heuristic Algorithms for the Maximum Colorful Subtree Problem
In metabolomics, small molecules are structurally elucidated using tandem mass spectrometry (MS/MS); this computational task can be formulated as the Maximum Colorful Subtree problem, which is NP-hard. Unfortunately, data from a single metabolite requires us to solve hundreds or thousands of instances of this problem - and in a single Liquid Chromatography MS/MS run, hundreds or thousands of metabolites are measured.
Here, we comprehensively evaluate the performance of several heuristic algorithms for the problem. Unfortunately, as is often the case in bioinformatics, the structure of the (chemically) true solution is not known to us; therefore we can only evaluate against the optimal solution of an instance. Evaluating the quality of a heuristic based on scores can be misleading: Even a slightly suboptimal solution can be structurally very different from the optimal solution, but it is the structure of a solution and not its score that is relevant for the downstream analysis. To this end, we propose a different evaluation setup: Given a set of candidate instances of which exactly one is known to be correct, the heuristic in question solves each instance to the best of its ability, producing a score for each instance, which is then used to rank the instances. We then evaluate whether the correct instance is ranked highly by the heuristic.
We find that one particular heuristic consistently ranks the correct instance in a top position. We also find that the scores of the best heuristic solutions are very close to the optimal score; in contrast, the structure of the solutions can deviate significantly from the optimal structures. Integrating the heuristic allowed us to speed up computations in practice by a factor of 100-fold
The Challenge of Wide-Field Transit Surveys: The Case of GSC 01944-02289
Wide-field searches for transiting extra-solar giant planets face the
difficult challenge of separating true transit events from the numerous false
positives caused by isolated or blended eclipsing binary systems. We describe
here the investigation of GSC 01944-02289, a very promising candidate for a
transiting brown dwarf detected by the Transatlantic Exoplanet Survey (TrES)
network. The photometry and radial velocity observations suggested that the
candidate was an object of substellar mass in orbit around an F star. However,
careful analysis of the spectral line shapes revealed a pattern of variations
consistent with the presence of another star whose motion produced the
asymmetries observed in the spectral lines of the brightest star. Detailed
simulations of blend models composed of an eclipsing binary plus a third star
diluting the eclipses were compared with the observed light curve and used to
derive the properties of the three components. Our photometric and
spectroscopic observations are fully consistent with a blend model of a
hierarchical triple system composed of an eclipsing binary with G0V and M3V
components in orbit around a slightly evolved F5 dwarf. We believe that this
investigation will be helpful to other groups pursuing wide-field transit
searches as this type of false detection could be more common than true
transiting planets, and difficult to identify.Comment: To appear in ApJ, v. 621, 2005 March 1
The Kepler Smear Campaign: Light curves for 102 Very Bright Stars
We present the first data release of the Kepler Smear Campaign, using
collateral 'smear' data obtained in the Kepler four-year mission to reconstruct
light curves of 102 stars too bright to have been otherwise targeted. We
describe the pipeline developed to extract and calibrate these light curves,
and show that we attain photometric precision comparable to stars analyzed by
the standard pipeline in the nominal Kepler mission. In this paper, aside from
publishing the light curves of these stars, we focus on 66 red giants for which
we detect solar-like oscillations, characterizing 33 of these in detail with
spectroscopic chemical abundances and asteroseismic masses as benchmark stars.
We also classify the whole sample, finding nearly all to be variable, with
classical pulsations and binary effects. All source code, light curves, TRES
spectra, and asteroseismic and stellar parameters are publicly available as a
Kepler legacy sample.Comment: 35 pages, accepted ApJ
Data fusion for a multi-scale model of a wheat leaf surface: a unifying approach using a radial basis function partition of unity method
Realistic digital models of plant leaves are crucial to fluid dynamics
simulations of droplets for optimising agrochemical spray technologies. The
presence and nature of small features (on the order of 100)
such as ridges and hairs on the surface have been shown to significantly affect
the droplet evaporation, and thus the leaf's potential uptake of active
ingredients. We show that these microstructures can be captured by implicit
radial basis function partition of unity (RBFPU) surface reconstructions from
micro-CT scan datasets. However, scanning a whole leaf () at
micron resolutions is infeasible due to both extremely large data storage
requirements and scanner time constraints. Instead, we micro-CT scan only a
small segment of a wheat leaf (). We fit a RBFPU implicit
surface to this segment, and an explicit RBFPU surface to a lower resolution
laser scan of the whole leaf. Parameterising the leaf using a locally
orthogonal coordinate system, we then replicate the now resolved microstructure
many times across a larger, coarser, representation of the leaf surface that
captures important macroscale features, such as its size, shape, and
orientation. The edge of one segment of the microstructure model is blended
into its neighbour naturally by the partition of unity method. The result is
one implicit surface reconstruction that captures the wheat leaf's features at
both the micro- and macro-scales.Comment: 23 pages, 11 figure
Asteroseismic properties of solar-type stars observed with the NASA K2 mission: results from Campaigns 1-3 and prospects for future observations
We present an asteroseismic analysis of 33 solar-type stars observed in short
cadence during Campaigns (C) 1-3 of the NASA K2 mission. We were able to
extract both average seismic parameters and individual mode frequencies for
stars with dominant frequencies up to ~3300{\mu}Hz, and we find that data for
some targets are good enough to allow for a measurement of the rotational
splitting. Modelling of the extracted parameters is performed by using
grid-based methods using average parameters and individual frequencies together
with spectroscopic parameters. For the target selection in C3, stars were
chosen as in C1 and C2 to cover a wide range in parameter space to better
understand the performance and noise characteristics. For C3 we still detected
oscillations in 73% of the observed stars that we proposed. Future K2 campaigns
hold great promise for the study of nearby clusters and the chemical evolution
and age-metallicity relation of nearby field stars in the solar neighbourhood.
We expect oscillations to be detected in ~388 short-cadence targets if the K2
mission continues until C18, which will greatly complement the ~500 detections
of solar-like oscillations made for short-cadence targets during the nominal
Kepler mission. For ~30-40 of these, including several members of the Hyades
open cluster, we furthermore expect that inference from interferometry should
be possible.Comment: 17 pages, 15 figures, 4 tables; accepted for publication in PAS
- β¦