18 research outputs found

    Large expert-curated database for benchmarking document similarity detection in biomedical literature search

    Get PDF
    Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium consisting of more than 1500 scientists from 84 countries, who have collectively annotated the relevance of over 180 000 PubMed-listed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data cover 76% of all unique PubMed Medical Subject Headings descriptors. No systematic biases were observed across different experience levels, research fields or time spent on annotations. More importantly, annotations of the same document pairs contributed by different scientists were highly concordant. We further show that the three representative baseline methods used to generate recommended articles for evaluation (Okapi Best Matching 25, Term Frequency-Inverse Document Frequency and PubMed Related Articles) had similar overall performances. Additionally, we found that these methods each tend to produce distinct collections of recommended articles, suggesting that a hybrid method may be required to completely capture all relevant articles. The established database server located at https://relishdb.ict.griffith.edu.au is freely available for the downloading of annotation data and the blind testing of new methods. We expect that this benchmark will be useful for stimulating the development of new powerful techniques for title and title/abstract-based search engines for relevant articles in biomedical research.Peer reviewe

    Evolutionarily significant A-to-I RNA editing events originated through G-to-A mutations in primates

    No full text
    Abstract Background Recent studies have revealed thousands of A-to-I RNA editing events in primates, but the origination and general functions of these events are not well addressed. Results Here, we perform a comparative editome study in human and rhesus macaque and uncover a substantial proportion of macaque A-to-I editing sites that are genomically polymorphic in some animals or encoded as non-editable nucleotides in human. The occurrence of these recent gain and loss of RNA editing through DNA point mutation is significantly more prevalent than that expected for the nearby regions. Ancestral state analyses further demonstrate that an increase in recent gain of editing events contribute to the over-representation, with G-to-A mutation site as a favorable location for the origination of robust A-to-I editing events. Population genetics analyses of the focal editing sites further reveal that a portion of these young editing events are evolutionarily significant, indicating general functional relevance for at least a fraction of these sites. Conclusions Overall, we report a list of A-to-I editing events that recently originated through G-to-A mutations in primates, representing a valuable resource to investigate the features and evolutionary significance of A-to-I editing events at the population and species levels. The unique subset of primate editome also illuminates the general functions of RNA editing by connecting it to particular gene regulatory processes, based on the characterized outcome of a gene regulatory level in different individuals or primate species with or without these editing events

    Emergence, Retention and Selection: A Trilogy of Origination for Functional <i>De Novo</i> Proteins from Ancestral LncRNAs in Primates

    No full text
    <div><p>While some human-specific protein-coding genes have been proposed to originate from ancestral lncRNAs, the transition process remains poorly understood. Here we identified 64 hominoid-specific <i>de novo</i> genes and report a mechanism for the origination of functional <i>de novo</i> proteins from ancestral lncRNAs with precise splicing structures and specific tissue expression profiles. Whole-genome sequencing of dozens of rhesus macaque animals revealed that these lncRNAs are generally not more selectively constrained than other lncRNA loci. The existence of these newly-originated <i>de novo</i> proteins is also not beyond anticipation under neutral expectation, as they generally have longer theoretical lifespan than their current age, due to their GC-rich sequence property enabling stable ORFs with lower chance of non-sense mutations. Interestingly, although the emergence and retention of these <i>de novo</i> genes are likely driven by neutral forces, population genetics study in 67 human individuals and 82 macaque animals revealed signatures of purifying selection on these genes specifically in human population, indicating a proportion of these newly-originated proteins are already functional in human. We thus propose a mechanism for creation of functional <i>de novo</i> proteins from ancestral lncRNAs during the primate evolution, which may contribute to human-specific genetic novelties by taking advantage of existed genomic contexts.</p></div

    <i>De novo</i> proteins originate from lncRNAs precursors irrespectively of their functional status at RNA level.

    No full text
    <p><b>(A)</b> Flow chart showing the computational pipeline of lncRNAome identification in rhesus macaque. (<b>B, C</b>) For lncRNAs identified in this study, the distribution of distances between 5’ end of lncRNAs and the nearest annotated transcript start site (TSS) (<b>B</b>) or CpG island (<b>C</b>) are shown. The numbers of TSS and CpG islands within 1-kb of the transcripts are shown in the inserted histograms. Annotated genes and randomly selected intergenic sites are also shown as positive and negative controls, respectively. (<b>D</b>) On the basis of population genetics data in rhesus macaque, the distribution of π for synonymous sites (<b><i>Syn Sites</i></b>), non-synonymous sites (<b><i>Nonsyn Sites</i></b>), all lncRNAs, lncRNA precursors and non-coding genes (<b><i>Functional</i></b>) are summarized in boxplots. <i>NS</i>: not significant, **<i>p</i>-value <0.01.</p

    <i>De novo</i> protein-coding genes originating from lncRNAs.

    No full text
    <p>(<b>A</b>) Computational pipeline for <i>ab inito</i> identification and meta-analysis of <i>de novo</i> genes in the hominoid lineage. (<b>B</b>) Number of <i>de novo</i> genes on the phylogenetic tree, with the branch length proportional to the divergence time. (<b>C</b>) Stacked histogram showing the percentage of <i>de novo</i> gene orthologs that also show expression in chimpanzee or rhesus macaque. (<b>D</b>) Boxplot showing relative expression levels of the transcripts and their nearby regions corresponding to <i>de novo</i> genes (orthologs) in human (chimpanzee or macaque). The nearby regions are defined as upstream and downstream regions with equal length to the corresponding genes. For each region, the relative expression was calculated by normalizing the expression level of this region with the sum of the expression levels of the genic region and the nearby regions. (<b>E</b>) Percentage of splicing junctions with supporting RNA-Seq reads in human, chimpanzee and rhesus macaque. (<b>F</b>) For each pair of tissues, <i>Spearman</i> correlation coefficients were computed separately, and the extent of tissue-specific differences in <i>de novo</i> gene expressions are shown (based on the color scale). Dotted lines highlight parallel comparisons between two different species.</p

    Profiling of polymorphisms in human and rhesus macaque.

    No full text
    <p><b>(A)</b> Comparison of human polymorphism sites profiled in this study with those in the 1000 Genomes Project. (<b>B</b>) The sequencing coverages of whole genome sequencing from one macaque animal and for the targeted re-sequencing of 82 macaque animals are summarized in green barplot and heatmaps inside the <i>Circos</i> map, respectively. The depths of the sequencing coverage are proportional to the color depth. Black rectangles outside the colored chromosome block represent the genomic locations of macaque orthologous regions of human <i>de novo</i> genes. The bottom panel illustrates the sequencing details of one region of interest. (<b>C</b>) Cumulative frequency of mean sequencing coverage on different genic regions of <i>de novo</i> genes is shown. Intergenic regions: 1-kb regions upstream and downstream of the gene. (<b>D, E</b>) Venn diagrams showing the distributions of macaque polymorphism sites identified by whole-genome sequencing and targeted re-sequencing, in terms of polymorphism sites (<b>D</b>) and genotypes (<b>E</b>).</p

    Emergence of human <i>de novo</i> proteins from GC-rich lncRNA precursors.

    No full text
    <p><b>(A)</b> GC contents for randomly-selected intergenic regions, all lncRNAs and lncRNA precursors in rhesus macaque are summarized in boxplots. (<b>B</b>) GC contents of different genomic regions are shown for <i>de novo</i> genes in human, as well as the orthologous non-coding regions in chimpanzee and rhesus macaque. For lncRNA precursors, the <i>pseudo</i>-CDS and <i>pseudo</i>-UTR regions were defined according to the orthologous relationship with the corresponding CDS and UTR regions of human <i>de novo</i> proteins. (<b>C</b>) GC contents for CDS regions of RefSeq proteins and <i>de novo</i> proteins in human are summarized in boxplots. A: all <i>de novo</i> genes, Y: younger <i>de novo</i> genes, O: older <i>de novo</i> genes. <b>(D)</b> Boxplot showing the distribution of fragile codon composition of <i>de novo</i> genes and RefSeq proteins in human. <b>(E)</b> Boxplot showing the distribution of half-life time of <i>de novo</i> genes and RefSeq proteins in human. (<b>F</b>) Dot plot showing the survival probability of the <i>de novo</i> ORFs. The probability of 0.05 was marked by red dashed line.</p
    corecore