13,153 research outputs found

    Local Binary Patterns as a Feature Descriptor in Alignment-free Visualisation of Metagenomic Data

    Get PDF
    Shotgun sequencing has facilitated the analysis of complex microbial communities. However, clustering and visualising these communities without prior taxonomic information is a major challenge. Feature descriptor methods can be utilised to extract these taxonomic relations from the data. Here, we present a novel approach consisting of local binary patterns (LBP) coupled with randomised singular value decomposition (RSVD) and Barnes-Hut t-stochastic neighbor embedding (BH-tSNE) to highlight the underlying taxonomic structure of the metagenomic data. The effectiveness of our approach is demonstrated using several simulated and a real metagenomic datasets

    Fibronectin Contributes To Notochord Intercalation In The Invertebrate Chordate, Ciona Intestinalis

    Get PDF
    Background: Genomic analysis has upended chordate phylogeny, placing the tunicates as the sister group to the vertebrates. This taxonomic rearrangement raises questions about the emergence of a tunicate/vertebrate ancestor. Results: Characterization of developmental genes uniquely shared by tunicates and vertebrates is one promising approach for deciphering developmental shifts underlying acquisition of novel, ancestral traits. The matrix glycoprotein Fibronectin (FN) has long been considered a vertebrate-specific gene, playing a major instructive role in vertebrate embryonic development. However, the recent computational prediction of an orthologous “vertebrate-like” Fn gene in the genome of a tunicate, Ciona savignyi, challenges this viewpoint suggesting that Fn may have arisen in the shared tunicate/vertebrate ancestor. Here we verify the presence of a tunicate Fn ortholog. Transgenic reporter analysis was used to characterize a Ciona Fn enhancer driving expression in the notochord. Targeted knockdown in the notochord lineage indicates that FN is required for proper convergent extension. Conclusions: These findings suggest that acquisition of Fn was associated with altered notochord morphogenesis in the vertebrate/tunicate ancestor

    CLP-based protein fragment assembly

    Full text link
    The paper investigates a novel approach, based on Constraint Logic Programming (CLP), to predict the 3D conformation of a protein via fragments assembly. The fragments are extracted by a preprocessor-also developed for this work- from a database of known protein structures that clusters and classifies the fragments according to similarity and frequency. The problem of assembling fragments into a complete conformation is mapped to a constraint solving problem and solved using CLP. The constraint-based model uses a medium discretization degree Ca-side chain centroid protein model that offers efficiency and a good approximation for space filling. The approach adapts existing energy models to the protein representation used and applies a large neighboring search strategy. The results shows the feasibility and efficiency of the method. The declarative nature of the solution allows to include future extensions, e.g., different size fragments for better accuracy.Comment: special issue dedicated to ICLP 201

    From Nonspecific DNA–Protein Encounter Complexes to the Prediction of DNA–Protein Interactions

    Get PDF
    ©2009 Gao, Skolnick. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.doi:10.1371/journal.pcbi.1000341DNA–protein interactions are involved in many essential biological activities. Because there is no simple mapping code between DNA base pairs and protein amino acids, the prediction of DNA–protein interactions is a challenging problem. Here, we present a novel computational approach for predicting DNA-binding protein residues and DNA–protein interaction modes without knowing its specific DNA target sequence. Given the structure of a DNA-binding protein, the method first generates an ensemble of complex structures obtained by rigid-body docking with a nonspecific canonical B-DNA. Representative models are subsequently selected through clustering and ranking by their DNA–protein interfacial energy. Analysis of these encounter complex models suggests that the recognition sites for specific DNA binding are usually favorable interaction sites for the nonspecific DNA probe and that nonspecific DNA–protein interaction modes exhibit some similarity to specific DNA–protein binding modes. Although the method requires as input the knowledge that the protein binds DNA, in benchmark tests, it achieves better performance in identifying DNA-binding sites than three previously established methods, which are based on sophisticated machine-learning techniques. We further apply our method to protein structures predicted through modeling and demonstrate that our method performs satisfactorily on protein models whose root-mean-square Ca deviation from native is up to 5 Å from their native structures. This study provides valuable structural insights into how a specific DNA-binding protein interacts with a nonspecific DNA sequence. The similarity between the specific DNA–protein interaction mode and nonspecific interaction modes may reflect an important sampling step in search of its specific DNA targets by a DNA-binding protein

    Computational analysis of human genomic variants and lncRNAs from sequence data

    Get PDF
    The high-throughput sequencing technologies have been developed and applied to the human genome studies for nearly 20 years. These technologies have provided numerous research applications and have significantly expanded our knowledge about the human genome. In this thesis, computational methods that utilize sequence data to study human genomic variants and transcripts were evaluated and developed. Indel represents insertion and deletion, which are two types of common genomic variants that are widespread in the human genome. Detecting indels from human genomes is the crucial step for diagnosing indel related genomic disorders and may potentially identify novel indel makers for studying certain diseases. Compared with previous techniques, the high-throughput sequencing technologies, especially the next- generation sequencing (NGS) technology, enable to detect indels accurately and efficiently in wide ranges of genome. In the first part of the thesis, tools with indel calling abilities are evaluated with an assortment of indels and different NGS settings. The results show that the selection of tools and NGS settings impact on indel detection significantly, which provide suggestions for tool selection and future developments. In bioinformatics analysis, an indel’s position can be marked inconsistently on the reference genome, which may result in an indel having different but equivalent representations and cause troubles for downstream. This problem is related to the complex sequence context of the indels, for example, short tandem repeats (STRs), where the same short stretch of nucleotides is amplified. In the second part of the thesis, a novel computational tool VarSCAT was described, which has various functions for annotating the sequence context of variants, including ambiguous positions, STRs, and other sequence context features. Analysis of several high- confidence human variant sets with VarSCAT reveals that a large number of genomic variants, especially indels, have sequence features associated with STRs. In the human genome, not all genes and their transcripts are translated into proteins. Long non-coding ribonucleic acid (lncRNA) is a typical example. Sequence recognition built with machine learning models have improved significantly in recent years. In the last part of the thesis, several machine learning-based lncRNA prediction tools were evaluated on their predictions for coding potentiality of transcripts. The results suggest that tools based on deep learning identify lncRNAs best. Ihmisen genomivarianttien ja lncRNA:iden laskennallinen analyysi sekvenssiaineistosta Korkean suorituskyvyn sekvensointiteknologioita on kehitetty ja sovellettu ihmisen genomitutkimuksiin lähes 20 vuoden ajan. Nämä teknologiat ovat mahdollistaneet ihmisen genomin laaja-alaisen tutkimisen ja lisänneet merkittävästi tietoamme siitä. Tässä väitöstyössä arvioitiin ja kehitettiin sekvenssiaineistoa hyödyntäviä laskennallisia menetelmiä ihmisen genomivarianttien sekä transkriptien tutkimiseen. Indeli on yhteisnimitys lisäys- eli insertio-varianteille ja häviämä- eli deleetio-varianteille, joita esiintyy koko genomin alueella. Indelien tunnistaminen on ratkaisevaa geneettisten poikkeavuuksien diagnosoinnissa ja eri sairauksiin liittyvien uusien indeli-markkereiden löytämisessä. Aiempiin teknologioihin verrattuna korkean suorituskyvyn sekvensointiteknologiat, erityisesti seuraavan sukupolven sekvensointi (NGS) mahdollistavat indelien havaitsemisen tarkemmin ja tehokkaammin laajemmilta genomialueilta. Väitöstyön ensimmäisessä osassa indelien kutsumiseen tarkoitettuja laskentatyökaluja arvioitiin käyttäen laajaa valikoimaa indeleitä ja erilaisia NGS-asetuksia. Tulokset osoittivat, että työkalujen valinta ja NGS-asetukset vaikuttivat indelien tunnistukseen merkittävästi ja siten ne voivat ohjata työkalujen valinnassa ja kehitystyössä. Bioinformatiivisessa analyysissä saman indelin sijainti voidaan merkitä eri kohtiin referenssigenomia, joka voi aiheuttaa ongelmia loppupään analyysiin, kuten indeli-kutsujen arviointiin. Tämä ongelma liittyy sekvenssikontekstiin, koska variantit voivat sijoittua lyhyille perättäisille tandem-toistojaksoille (STR), jossa sama lyhyt nukleotidijakso on monistunut. Väitöstyön toisessa osassa kehitettiin laskentatyökalu VarSCAT, jossa on eri toimintoja, mm. monitulkintaisten sijaintitietojen, vierekkäisten alueiden ja STR-alueiden tarkasteluun. Luotettaviksi arvioitujen ihmisen varianttiaineistojen analyysi VarSCAT-työkalulla paljasti, että monien geneettisten varianttien ja erityisesti indelien ominaisuudet liittyvät STR-alueisiin. Kaikkia ihmisen geenejä ja niiden geenituotteita, kuten esimerkiksi ei-koodaavia RNA:ta (lncRNA) ei käännetä proteiiniksi. Koneoppimismenetelmissä ja sekvenssitunnistuksessa on tapahtunut huomattavaa parannusta viime vuosina. Väitöstyön viimeisessä osassa arvioitiin useiden koneoppimiseen perustuvien lncRNA-ennustustyökalujen ennusteita. Tulokset viittaavat siihen, että syväoppimiseen perustuvat työkalut tunnistavat lncRNA:t parhaiten

    In silico characterization of the neural alpha tubulin gene promoter of the sea urchin embryo Paracentrotus lividus by phylogenetic footprinting

    Get PDF
    During Paracentrotus lividus sea urchin embryo development one alpha and one beta tubulin genes are expressed specifically in the neural cells and they are early end output of the gene regulatory network that specifies the neural commitment. In this paper we have used a comparative genomics approach to identify conserved regulatory elements in the P. lividus neural alpha tubulin gene. To this purpose, we have first isolated a genomic clone containing the entire gene plus 4.5 Kb of 50 upstream sequences. Then, we have shown by gene transfer experiments that its non-coding region drives the spatiotemporal gene expression corresponding substantially to that of the endogenous gene. In addition, we have identified by genome and EST sequence analysis the S. purpuratus alpha tubulin orthologous gene and we propose a revised annotation of some tubulin family members. Moreover, by computational techniques we delineate at least three putative regulatory regions located both in the upstream region and in the first intron containing putative binding sites for Forkhead and Nkx transcription factor families

    Characterization of the prohormone complement in cattle using genomic libraries and cleavage prediction approaches

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Neuropeptides are cell to cell signalling molecules that regulate many critical biological processes including development, growth and reproduction. These peptides result from the complex processing of prohormone proteins, making their characterization both challenging and resource demanding. In fact, only 42 neuropeptide genes have been empirically confirmed in cattle. Neuropeptide research using high-throughput technologies such as microarray and mass spectrometry require accurate annotation of prohormone genes and products. However, the annotation and associated prediction efforts, when based solely on sequence homology to species with known neuropeptides, can be problematic.</p> <p>Results</p> <p>Complementary bioinformatic resources were integrated in the first survey of the cattle neuropeptide complement. Functional neuropeptide characterization was based on gene expression profiles from microarray experiments. Once a gene is identified, knowledge of the enzymatic processing allows determination of the final products. Prohormone cleavage sites were predicted using several complementary cleavage prediction models and validated against known cleavage sites in cattle and other species. Our bioinformatics approach identified 92 cattle prohormone genes, with 84 of these supported by expressed sequence tags. Notable findings included an absence of evidence for a cattle relaxin 1 gene and evidence for a cattle galanin-like peptide pseudogene. The prohormone processing predictions are likely accurate as the mammalian proprotein convertase enzymes, except for proprotein convertase subtilisin/kexin type 9, were also identified. Microarray analysis revealed the differential expression of 21 prohormone genes in the liver associated with nutritional status and 8 prohormone genes in the placentome of embryos generated using different reproductive techniques. The neuropeptide cleavage prediction models had an exceptional performance, correctly predicting cleavage in more than 86% of the prohormone sequence positions.</p> <p>Conclusion</p> <p>A substantial increase in the number of cattle prohormone genes identified and insights into the expression profiles of neuropeptide genes were obtained from the integration of bioinformatics tools and database resources and gene expression information. Approximately 20 prohormones with no empirical evidence were detected and the prohormone cleavage sites were predicted with high accuracy. Most prohormones were supported by expressed sequence tag data and many were differentially expressed across nutritional and reproductive conditions. The complete set of cattle prohormone sequences identified and the cleavage prediction approaches are available at <url>http://neuroproteomics.scs.uiuc.edu/neuropred.html</url>.</p
    • …
    corecore