3,789 research outputs found

    Clustering of Imperfect Transcripts Using a Novel Similarity Measure

    Full text link
    There has been a surge of interest in last several years in methods for automatic generation of content indices for multimedia documents, particularly with respect to video and audio documents. As a result, there is much interest in methods for analyzing transcribed documents from audio and video broadcasts and telephone conversations and messages. The present paper deals with such an analysis by presenting a clustering technique to partition a set of transcribed documents into different meaningful topics. Our method determines the intersection between matching transcripts, evaluates the information contribution by each transcript, assesses the information closeness of overlapping words and calculates similarity based on Chi-square method. The main novelty of our method lies in the proposed similarity measure that is designed to withstand the imperfections of transcribed documents. Preliminary experimental results using an archive of transcribed news broadcasts demonstrate the efficacy of the proposed methodology. 1

    Utilizing microarray spot characteristics to improve cross-species hybridization results

    Get PDF
    AbstractCross-species hybridization (CSH), i.e., the hybridization of a (target) species RNA to a DNA microarray that represents another (reference) species, is often used to study species diversity. However, filtration of CSH data has to be applied to extract valid information. We present a novel approach to filtering the CSH data, which utilizes spot characteristics (SCs) of image-quantification data from scanned spotted cDNA microarrays. Five SCs that were affected by sequence similarity between probe and target sequences were identified (designated as BS-SCs). Filtration by all five BS-SC thresholds demonstrated improved clustering for two of the three examined experiments, suggesting that BS-SCs may serve for filtration of data obtained by CSH, to improve the validity of the results. This CSH data-filtration approach could become a promising tool for studying a variety of species, especially when no genomic information is available for the target species

    Small Open Reading Frames, Non-Coding RNAs and Repetitive Elements in Bradyrhizobium japonicum USDA 110

    Get PDF
    Small open reading frames (sORFs) and genes for non-coding RNAs are poorly investigated components of most genomes. Our analysis of 1391 ORFs recently annotated in the soybean symbiont Bradyrhizobium japonicum USDA 110 revealed that 78% of them contain less than 80 codons. Twenty-one of these sORFs are conserved in or outside Alphaproteobacteria and most of them are similar to genes found in transposable elements, in line with their broad distribution. Stabilizing selection was demonstrated for sORFs with proteomic evidence and bll1319_ISGA which is conserved at the nucleotide level in 16 alphaproteobacterial species, 79 species from other taxa and 49 other Proteobacteria. Further we used Northern blot hybridization to validate ten small RNAs (BjsR1 to BjsR10) belonging to new RNA families. We found that BjsR1 and BjsR3 have homologs outside the genus Bradyrhizobium, and BjsR5, BjsR6, BjsR7, and BjsR10 have up to four imperfect copies in Bradyrhizobium genomes. BjsR8, BjsR9, and BjsR10 are present exclusively in nodules, while the other sRNAs are also expressed in liquid cultures. We also found that the level of BjsR4 decreases after exposure to tellurite and iron, and this down-regulation contributes to survival under high iron conditions. Analysis of additional small RNAs overlapping with 3Â’-UTRs revealed two new repetitive elements named Br-REP1 and Br-REP2. These REP elements may play roles in the genomic plasticity and gene regulation and could be useful for strain identification by PCR-fingerprinting. Furthermore, we studied two potential toxin genes in the symbiotic island and confirmed toxicity of the yhaV homolog bll1687 but not of the newly annotated higB homolog blr0229_ISGA in E. coli. Finally, we revealed transcription interference resulting in an antisense RNA complementary to blr1853, a gene induced in symbiosis. The presented results expand our knowledge on sORFs, non-coding RNAs and repetitive elements in B. japonicum and related bacteria

    A Novel Putative miRNA Target Enhancer Signal

    Get PDF
    It is known that miRNA target sites are very short and the effect of miRNA-target site interaction alone appears as being unspecific. Recent experiments suggest further context signals involved in miRNA target site recognition and regulation. Here, we present a novel GC-rich RNA motif downstream of experimentally supported miRNA target sites in human mRNAs with no similarity to previously reported functional motifs. We demonstrate that the novel motif can be found in at least one third of all transcripts regulated by miRNAs. Furthermore, we show that motif occurrence and the frequency of miRNA target sites as well as the stability of their duplex structures correlate. The finding, that the novel motif is significantly associated with miRNA target sites, suggests a functional role of the motif in miRNA target site biology. Beyond, the novel motif has the impact to improve prediction of miRNA target sites significantly

    Gene Set Enrichment and Projection: A Computational Tool for Knowledge Discovery in Transcriptomes

    Get PDF
    Explaining the mechanism behind a genetic disease involves two phases, collecting and analyzing data associated to the disease, then interpreting those data in the context of biological systems. The objective of this dissertation was to develop a method of integrating complementary datasets surrounding any single biological process, with the goal of presenting the response to a signal in terms of a set of downstream biological effects. This dissertation specifically tests the hypothesis that computational projection methods overlaid with domain expertise can direct research towards relevant systems-level signals underlying complex genetic disease. To this end, I developed a software algorithm named Geneset Enrichment and Projection Displays (GSEPD) that can visualize multidimensional genetic expression to identify the biologically relevant gene sets that are altered in response to a biological process. This dissertation highlights a problem of data interpretation facing the medical research community, and shows how computational sciences can help. By bringing annotation and expression datasets together, a new analytical and software method was produced that helps unravel complicated experimental and biological data. The dissertation shows four coauthored studies where the experts in their field have desired to annotate functional significance to a gene-centric experiment. Using GSEPD to show inherently high dimensional data as a simple colored graph, a subspace vector projection directly calculated how each sample behaves like test conditions. The end-user medical researcher understands their data as a series of somewhat-independent subsystems, and GSEPD provides a dimensionality reduction for high throughput experiments of limited sample size. Gene Ontology analyses are accessible on a sample-to-sample level, and this work highlights not just the expected biological systems, but many annotated results available in vast online databases

    Inferring Noncoding RNA Families and Classes by Means of Genome-Scale Structure-Based Clustering

    Get PDF
    The RFAM database defines families of ncRNAs by means of sequence similarities that are sufficient to establish homology. In some cases, such as microRNAs and box H/ACA snoRNAs, functional commonalities define classes of RNAs that are characterized by structural similarities, and typically consist of multiple RNA families. Recent advances in high-throughput transcriptomics and comparative genomics have produced very large sets of putative noncoding RNAs and regulatory RNA signals. For many of them, evidence for stabilizing selection acting on their secondary structures has been derived, and at least approximate models of their structures have been computed. The overwhelming majority of these hypothetical RNAs cannot be assigned to established families or classes. We present here a structure-based clustering approach that is capable of extracting putative RNA classes from genome-wide surveys for structured RNAs. The LocARNA (local alignment of RNA) tool implements a novel variant of the Sankoff algorithm that is sufficiently fast to deal with several thousand candidate sequences. The method is also robust against false positive predictions, i.e., a contamination of the input data with unstructured or nonconserved sequences. We have successfully tested the LocARNA-based clustering approach on the sequences of the RFAM-seed alignments. Furthermore, we have applied it to a previously published set of 3,332 predicted structured elements in the Ciona intestinalis genome (Missal K, Rose D, Stadler PF (2005) Noncoding RNAs in Ciona intestinalis. Bioinformatics 21 (Supplement 2): i77–i78). In addition to recovering, e.g., tRNAs as a structure-based class, the method identifies several RNA families, including microRNA and snoRNA candidates, and suggests several novel classes of ncRNAs for which to date no representative has been experimentally characterized

    Are we there yet? : reliably estimating the completeness of plant genome sequences

    Get PDF
    Genome sequencing is becoming cheaper and faster thanks to the introduction of next-generation sequencing techniques. Dozens of new plant genome sequences have been released in recent years, ranging from small to gigantic repeat-rich or polyploid genomes. Most genome projects have a dual purpose: delivering a contiguous, complete genome assembly and creating a full catalog of correctly predicted genes. Frequently, the completeness of a species' gene catalog is measured using a set of marker genes that are expected to be present. This expectation can be defined along an evolutionary gradient, ranging from highly conserved genes to species-specific genes. Large-scale population resequencing studies have revealed that gene space is fairly variable even between closely related individuals, which limits the definition of the expected gene space, and, consequently, the accuracy of estimates used to assess genome and gene space completeness. We argue that, based on the desired applications of a genome sequencing project, different completeness scores for the genome assembly and/or gene space should be determined. Using examples from several dicot and monocot genomes, we outline some pitfalls and recommendations regarding methods to estimate completeness during different steps of genome assembly and annotation

    A Quantitative Methodology for Vetting Dark Network Intelligence Sources for Social Network Analysis

    Get PDF
    Social network analysis (SNA) is used by the DoD to describe and analyze social networks, leading to recommendations for operational decisions. However, social network models are constructed from various information sources of indeterminate reliability. Inclusion of unreliable information can lead to incorrect models resulting in flawed analysis and decisions. This research develops a methodology to assist the analyst by quantitatively identifying and categorizing information sources so that determinations on including or excluding provided data can be made. This research pursued three main thrusts. It consolidated binary similarity measures to determine social network information sources\u27 concordance and developed a methodology to select suitable measures dependent upon application considerations. A methodology was developed to assess the validity of individual sources of social network data. This methodology utilized source pairwise comparisons to measure information sources\u27 concordance and a weighting schema to account for sources\u27 unique perspectives of the underlying social network. Finally, the developed methodology was tested over a variety of generated networks with varying parameters in a design of experiments paradigm (DOE). Various factors relevant to conditions faced by SNA analysts potentially employing this methodology were examined. The DOE was comprised of a 24 full factorial design augmented with a nearly orthogonal Latin hypercube. A linear model was constructed using quantile regression to mitigate the non-normality of the error terms
    • …
    corecore