1,928 research outputs found

    Compressing DNA sequence databases with coil

    Get PDF
    Background: Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression – an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. Results: We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression – the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database. Conclusion: coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work

    Computational Identification of Four Spliceosomal snRNAs from the Deep-Branching Eukaryote Giardia intestinalis

    Get PDF
    Funding: Marsden Fund New Zealand Allan Wilson Centre The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.RNAs processing other RNAs is very general in eukaryotes, but is not clear to what extent it is ancestral to eukaryotes. Here we focus on pre-mRNA splicing, one of the most important RNA-processing mechanisms in eukaryotes. In most eukaryotes splicing is predominantly catalysed by the major spliceosome complex, which consists of five uridine-rich small nuclear RNAs (U-snRNAs) and over 200 proteins in humans. Three major spliceosomal introns have been found experimentally in Giardia; one Giardia U-snRNA (U5) and a number of spliceosomal proteins have also been identified. However, because of the low sequence similarity between the Giardia ncRNAs and those of other eukaryotes, the other U-snRNAs of Giardia had not been found. Using two computational methods, candidates for Giardia U1, U2, U4 and U6 snRNAs were identified in this study and shown by RT-PCR to be expressed. We found that identifying a U2 candidate helped identify U6 and U4 based on interactions between them. Secondary structural modelling of the Giardia U-snRNA candidates revealed typical features of eukaryotic U-snRNAs. We demonstrate a successful approach to combine computational and experimental methods to identify expected ncRNAs in a highly divergent protist genome. Our findings reinforce the conclusion that spliceosomal small-nuclear RNAs existed in the last common ancestor of eukaryotes

    Heuristic Algorithms for the Maximum Colorful Subtree Problem

    Get PDF
    In metabolomics, small molecules are structurally elucidated using tandem mass spectrometry (MS/MS); this computational task can be formulated as the Maximum Colorful Subtree problem, which is NP-hard. Unfortunately, data from a single metabolite requires us to solve hundreds or thousands of instances of this problem - and in a single Liquid Chromatography MS/MS run, hundreds or thousands of metabolites are measured. Here, we comprehensively evaluate the performance of several heuristic algorithms for the problem. Unfortunately, as is often the case in bioinformatics, the structure of the (chemically) true solution is not known to us; therefore we can only evaluate against the optimal solution of an instance. Evaluating the quality of a heuristic based on scores can be misleading: Even a slightly suboptimal solution can be structurally very different from the optimal solution, but it is the structure of a solution and not its score that is relevant for the downstream analysis. To this end, we propose a different evaluation setup: Given a set of candidate instances of which exactly one is known to be correct, the heuristic in question solves each instance to the best of its ability, producing a score for each instance, which is then used to rank the instances. We then evaluate whether the correct instance is ranked highly by the heuristic. We find that one particular heuristic consistently ranks the correct instance in a top position. We also find that the scores of the best heuristic solutions are very close to the optimal score; in contrast, the structure of the solutions can deviate significantly from the optimal structures. Integrating the heuristic allowed us to speed up computations in practice by a factor of 100-fold

    The Challenge of Wide-Field Transit Surveys: The Case of GSC 01944-02289

    Full text link
    Wide-field searches for transiting extra-solar giant planets face the difficult challenge of separating true transit events from the numerous false positives caused by isolated or blended eclipsing binary systems. We describe here the investigation of GSC 01944-02289, a very promising candidate for a transiting brown dwarf detected by the Transatlantic Exoplanet Survey (TrES) network. The photometry and radial velocity observations suggested that the candidate was an object of substellar mass in orbit around an F star. However, careful analysis of the spectral line shapes revealed a pattern of variations consistent with the presence of another star whose motion produced the asymmetries observed in the spectral lines of the brightest star. Detailed simulations of blend models composed of an eclipsing binary plus a third star diluting the eclipses were compared with the observed light curve and used to derive the properties of the three components. Our photometric and spectroscopic observations are fully consistent with a blend model of a hierarchical triple system composed of an eclipsing binary with G0V and M3V components in orbit around a slightly evolved F5 dwarf. We believe that this investigation will be helpful to other groups pursuing wide-field transit searches as this type of false detection could be more common than true transiting planets, and difficult to identify.Comment: To appear in ApJ, v. 621, 2005 March 1

    The Kepler Smear Campaign: Light curves for 102 Very Bright Stars

    Full text link
    We present the first data release of the Kepler Smear Campaign, using collateral 'smear' data obtained in the Kepler four-year mission to reconstruct light curves of 102 stars too bright to have been otherwise targeted. We describe the pipeline developed to extract and calibrate these light curves, and show that we attain photometric precision comparable to stars analyzed by the standard pipeline in the nominal Kepler mission. In this paper, aside from publishing the light curves of these stars, we focus on 66 red giants for which we detect solar-like oscillations, characterizing 33 of these in detail with spectroscopic chemical abundances and asteroseismic masses as benchmark stars. We also classify the whole sample, finding nearly all to be variable, with classical pulsations and binary effects. All source code, light curves, TRES spectra, and asteroseismic and stellar parameters are publicly available as a Kepler legacy sample.Comment: 35 pages, accepted ApJ

    Data fusion for a multi-scale model of a wheat leaf surface: a unifying approach using a radial basis function partition of unity method

    Full text link
    Realistic digital models of plant leaves are crucial to fluid dynamics simulations of droplets for optimising agrochemical spray technologies. The presence and nature of small features (on the order of 100ΞΌm\mathrm{\mu m}) such as ridges and hairs on the surface have been shown to significantly affect the droplet evaporation, and thus the leaf's potential uptake of active ingredients. We show that these microstructures can be captured by implicit radial basis function partition of unity (RBFPU) surface reconstructions from micro-CT scan datasets. However, scanning a whole leaf (20cm220\mathrm{cm^2}) at micron resolutions is infeasible due to both extremely large data storage requirements and scanner time constraints. Instead, we micro-CT scan only a small segment of a wheat leaf (4mm24\mathrm{mm^2}). We fit a RBFPU implicit surface to this segment, and an explicit RBFPU surface to a lower resolution laser scan of the whole leaf. Parameterising the leaf using a locally orthogonal coordinate system, we then replicate the now resolved microstructure many times across a larger, coarser, representation of the leaf surface that captures important macroscale features, such as its size, shape, and orientation. The edge of one segment of the microstructure model is blended into its neighbour naturally by the partition of unity method. The result is one implicit surface reconstruction that captures the wheat leaf's features at both the micro- and macro-scales.Comment: 23 pages, 11 figure

    Asteroseismic properties of solar-type stars observed with the NASA K2 mission: results from Campaigns 1-3 and prospects for future observations

    Get PDF
    We present an asteroseismic analysis of 33 solar-type stars observed in short cadence during Campaigns (C) 1-3 of the NASA K2 mission. We were able to extract both average seismic parameters and individual mode frequencies for stars with dominant frequencies up to ~3300{\mu}Hz, and we find that data for some targets are good enough to allow for a measurement of the rotational splitting. Modelling of the extracted parameters is performed by using grid-based methods using average parameters and individual frequencies together with spectroscopic parameters. For the target selection in C3, stars were chosen as in C1 and C2 to cover a wide range in parameter space to better understand the performance and noise characteristics. For C3 we still detected oscillations in 73% of the observed stars that we proposed. Future K2 campaigns hold great promise for the study of nearby clusters and the chemical evolution and age-metallicity relation of nearby field stars in the solar neighbourhood. We expect oscillations to be detected in ~388 short-cadence targets if the K2 mission continues until C18, which will greatly complement the ~500 detections of solar-like oscillations made for short-cadence targets during the nominal Kepler mission. For ~30-40 of these, including several members of the Hyades open cluster, we furthermore expect that inference from interferometry should be possible.Comment: 17 pages, 15 figures, 4 tables; accepted for publication in PAS
    • …
    corecore