74 research outputs found
Revisiting detrended fluctuation analysis
Half a century ago Hurst introduced Rescaled Range (R/S) Analysis to study fluctuations in time series. Thousands of works have investigated or applied the original methodology and similar techniques, with Detrended Fluctuation Analysis becoming preferred due to its purported ability to mitigate nonstationaries. We show Detrended Fluctuation Analysis introduces artifacts for nonlinear trends, in contrast to common expectation, and demonstrate that the empirically observed curvature induced is a serious finite-size effect which will always be present. Explicit detrending followed by measurement of the diffusional spread of a signals' associated random walk is preferable, a surprising conclusion given that Detrended Fluctuation Analysis was crafted specifically to replace this approach. The implications are simple yet sweeping: there is no compelling reason to apply Detrended Fluctuation Analysis as it 1) introduces uncontrolled bias; 2) is computationally more expensive than the unbiased estimator; and 3) cannot provide generic or useful protection against nonstationaries
Screening non-coding RNAs in transcriptomes from neglected species using PORTRAIT: case study of the pathogenic fungus Paracoccidioides brasiliensis
<p>Abstract</p> <p>Background</p> <p>Transcriptome sequences provide a complement to structural genomic information and provide snapshots of an organism's transcriptional profile. Such sequences also represent an alternative method for characterizing neglected species that are not expected to undergo whole-genome sequencing. One difficulty for transcriptome sequencing of these organisms is the low quality of reads and incomplete coverage of transcripts, both of which compromise further bioinformatics analyses. Another complicating factor is the lack of known protein homologs, which frustrates searches against established protein databases. This lack of homologs may be caused by divergence from well-characterized and over-represented model organisms. Another explanation is that non-coding RNAs (ncRNAs) may be caught during sequencing. NcRNAs are RNA sequences that, unlike messenger RNAs, do not code for protein products and instead perform unique functions by folding into higher order structural conformations. There is ncRNA screening software available that is specific for transcriptome sequences, but their analyses are optimized for those transcriptomes that are well represented in protein databases, and also assume that input ESTs are full-length and high quality.</p> <p>Results</p> <p>We propose an algorithm called PORTRAIT, which is suitable for ncRNA analysis of transcriptomes from poorly characterized species. Sequences are translated by software that is resistant to sequencing errors, and the predicted putative proteins, along with their source transcripts, are evaluated for coding potential by a support vector machine (SVM). Either of two SVM models may be employed: if a putative protein is found, a protein-dependent SVM model is used; if it is not found, a protein-independent SVM model is used instead. Only <it>ab initio </it>features are extracted, so that no homology information is needed. We illustrate the use of PORTRAIT by predicting ncRNAs from the transcriptome of the pathogenic fungus <it>Paracoccidoides brasiliensis </it>and five other related fungi.</p> <p>Conclusion</p> <p>PORTRAIT can be integrated into pipelines, and provides a low computational cost solution for ncRNA detection in transcriptome sequencing projects.</p
Magnetotransport in an aluminum thin film on a GaAs substrate grown by molecular beam epitaxy
Magnetotransport measurements are performed on an aluminum thin film grown on a GaAs substrate. A crossover from electron- to hole-dominant transport can be inferred from both longitudinal resistivity and Hall resistivity with increasing the perpendicular magnetic field B. Also, phenomena of localization effects can be seen at low B. By analyzing the zero-field resistivity as a function of temperature T, we show the importance of surface scattering in such a nanoscale film
Cross-species inference of long non-coding RNAs greatly expands the ruminant transcriptome
Additional file 3. This file contains all supplementary tables relating to lncRNA identification via the conservation of synteny. Table S3. lncRNAs inferred in one species by the genomic alignment of a transcript assembled with the RNA-seq libraries from a related spdecies. Table S12. Presence of intergenic lncRNAs both in sheep and cattle, in regions of conserved synteny. Table S13. Presence of intergenic lncRNAs both in sheep and goat, in regions of conserved synteny. Table S14. Presence of intergenic lncRNAs both in cattle and goat, in regions of conserved synteny. Table S15. Presence of intergenic lncRNAs both in sheep and humans, in regions of conserved synteny. Table S16. Presence of intergenic lncRNAs both in goat and humans, in regions of conserved synteny. Table S17. Presence of intergenic lncRNAs both in cattle and humans, in regions of conserved synteny. Table S18. High-confidence lncRNA pairs, those conserved across species both sequentially and positionally
Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data
<p>Abstract</p> <p>Background</p> <p>In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models.</p> <p>Results</p> <p>The latest advances in the field allowed us to use a technique of optimal Markov chain embedding based on deterministic finite automata to introduce three innovative algorithms. Algorithm 1 is the only one able to deal with heterogeneous models. It also permits to avoid any product of convolution of the pattern distribution in individual sequences. When working with homogeneous models, Algorithm 2 yields a dramatic reduction in the complexity by taking advantage of previous computations to obtain moment generating functions efficiently. In the particular case of low or moderate complexity patterns, Algorithm 3 exploits power computation and binary decomposition to further reduce the time complexity to a logarithmic scale. All these algorithms and their relative interest in comparison with existing ones were then tested and discussed on a toy-example and three biological data sets: structural patterns in protein loop structures, PROSITE signatures in a bacterial proteome, and transcription factors in upstream gene regions. On these data sets, we also compared our exact approaches to the tempting approximation that consists in concatenating the sequences in the data set into a single sequence.</p> <p>Conclusions</p> <p>Our algorithms prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity (PROSITE signatures for example). In addition, these exact algorithms allow us to avoid the edge effect observed under the single sequence approximation, which leads to erroneous results, especially when the marginal distribution of the model displays a slow convergence toward the stationary distribution. We end up with a discussion on our method and on its potential improvements.</p
Hierarchical structure of cascade of primary and secondary periodicities in Fourier power spectrum of alphoid higher order repeats
<p>Abstract</p> <p>Background</p> <p>Identification of approximate tandem repeats is an important task of broad significance and still remains a challenging problem of computational genomics. Often there is no single best approach to periodicity detection and a combination of different methods may improve the prediction accuracy. Discrete Fourier transform (DFT) has been extensively used to study primary periodicities in DNA sequences. Here we investigate the application of DFT method to identify and study alphoid higher order repeats.</p> <p>Results</p> <p>We used method based on DFT with mapping of symbolic into numerical sequence to identify and study alphoid higher order repeats (HOR). For HORs the power spectrum shows equidistant frequency pattern, with characteristic two-level hierarchical organization as signature of HOR. Our case study was the 16 mer HOR tandem in AC017075.8 from human chromosome 7. Very long array of equidistant peaks at multiple frequencies (more than a thousand higher harmonics) is based on fundamental frequency of 16 mer HOR. Pronounced subset of equidistant peaks is based on multiples of the fundamental HOR frequency (multiplication factor <it>n </it>for <it>n</it>mer) and higher harmonics. In general, <it>n</it>mer HOR-pattern contains equidistant secondary periodicity peaks, having a pronounced subset of equidistant primary periodicity peaks. This hierarchical pattern as signature for HOR detection is robust with respect to monomer insertions and deletions, random sequence insertions etc. For a monomeric alphoid sequence only primary periodicity peaks are present. The 1/<it>f</it><sup><it>β </it></sup>– noise and periodicity three pattern are missing from power spectra in alphoid regions, in accordance with expectations.</p> <p>Conclusion</p> <p>DFT provides a robust detection method for higher order periodicity. Easily recognizable HOR power spectrum is characterized by hierarchical two-level equidistant pattern: higher harmonics of the fundamental HOR-frequency (secondary periodicity) and a subset of pronounced peaks corresponding to constituent monomers (primary periodicity). The number of lower frequency peaks (secondary periodicity) below the frequency of the first primary periodicity peak reveals the size of <it>n</it>mer HOR, i.e., the number <it>n </it>of monomers contained in consensus HOR.</p
Genome-Wide Identification of Transcription Start Sites, Promoters and Transcription Factor Binding Sites in E. coli
Despite almost 40 years of molecular genetics research in Escherichia coli a major fraction of its Transcription Start Sites (TSSs) are still unknown, limiting therefore our understanding of the regulatory circuits that control gene expression in this model organism. RegulonDB (http://regulondb.ccg.unam.mx/) is aimed at integrating the genetic regulatory network of E. coli K12 as an entirely bioinformatic project up till now. In this work, we extended its aims by generating experimental data at a genome scale on TSSs, promoters and regulatory regions. We implemented a modified 5′ RACE protocol and an unbiased High Throughput Pyrosequencing Strategy (HTPS) that allowed us to map more than 1700 TSSs with high precision. From this collection, about 230 corresponded to previously reported TSSs, which helped us to benchmark both our methodologies and the accuracy of the previous mapping experiments. The other ca 1500 TSSs mapped belong to about 1000 different genes, many of them with no assigned function. We identified promoter sequences and type of σ factors that control the expression of about 80% of these genes. As expected, the housekeeping σ70 was the most common type of promoter, followed by σ38. The majority of the putative TSSs were located between 20 to 40 nucleotides from the translational start site. Putative regulatory binding sites for transcription factors were detected upstream of many TSSs. For a few transcripts, riboswitches and small RNAs were found. Several genes also had additional TSSs within the coding region. Unexpectedly, the HTPS experiments revealed extensive antisense transcription, probably for regulatory functions. The new information in RegulonDB, now with more than 2400 experimentally determined TSSs, strengthens the accuracy of promoter prediction, operon structure, and regulatory networks and provides valuable new information that will facilitate the understanding from a global perspective the complex and intricate regulatory network that operates in E. coli
Replication Fork Polarity Gradients Revealed by Megabase-Sized U-Shaped Replication Timing Domains in Human Cell Lines
In higher eukaryotes, replication program specification in different cell types remains to be fully understood. We show for seven human cell lines that about half of the genome is divided in domains that display a characteristic U-shaped replication timing profile with early initiation zones at borders and late replication at centers. Significant overlap is observed between U-domains of different cell lines and also with germline replication domains exhibiting a N-shaped nucleotide compositional skew. From the demonstration that the average fork polarity is directly reflected by both the compositional skew and the derivative of the replication timing profile, we argue that the fact that this derivative displays a N-shape in U-domains sustains the existence of large-scale gradients of replication fork polarity in somatic and germline cells. Analysis of chromatin interaction (Hi-C) and chromatin marker data reveals that U-domains correspond to high-order chromatin structural units. We discuss possible models for replication origin activation within U/N-domains. The compartmentalization of the genome into replication U/N-domains provides new insights on the organization of the replication program in the human genome
Evidence for a Fourteenth mtDNA-Encoded Protein in the Female-Transmitted mtDNA of Marine Mussels (Bivalvia: Mytilidae)
BACKGROUND: A novel feature for animal mitochondrial genomes has been recently established: i.e., the presence of additional, lineage-specific, mtDNA-encoded proteins with functional significance. This feature has been observed in freshwater mussels with doubly uniparental inheritance of mtDNA (DUI). The latter unique system of mtDNA transmission, which also exists in some marine mussels and marine clams, is characterized by one mt genome inherited from the female parent (F mtDNA) and one mt genome inherited from the male parent (M mtDNA). In freshwater mussels, the novel mtDNA-encoded proteins have been shown to be mt genome-specific (i.e., one novel protein for F genomes and one novel protein for M genomes). It has been hypothesized that these novel, F- and M-specific, mtDNA-encoded proteins (and/or other F- and/or M-specific mtDNA sequences) could be responsible for the different modes of mtDNA transmission in bivalves but this remains to be demonstrated. METHODOLOGY/PRINCIPAL FINDINGS: We investigated all complete (or nearly complete) female- and male-transmitted marine mussel mtDNAs previously sequenced for the presence of ORFs that could have functional importance in these bivalves. Our results confirm the presence of a novel F genome-specific mt ORF, of significant length (>100aa) and located in the control region, that most likely has functional significance in marine mussels. The identification of this ORF in five Mytilus species suggests that it has been maintained in the mytilid lineage (subfamily Mytilinae) for ∼13 million years. Furthermore, this ORF likely has a homologue in the F mt genome of Musculista senhousia, a DUI-containing mytilid species in the subfamily Crenellinae. We present evidence supporting the functionality of this F-specific ORF at the transcriptional, amino acid and nucleotide levels. CONCLUSIONS/SIGNIFICANCE: Our results offer support for the hypothesis that "novel F genome-specific mitochondrial genes" are involved in key biological functions in bivalve species with DUI
- …