68 research outputs found

    Understanding Task Design Trade-offs in Crowdsourced Paraphrase Collection

    Full text link
    Linguistically diverse datasets are critical for training and evaluating robust machine learning systems, but data collection is a costly process that often requires experts. Crowdsourcing the process of paraphrase generation is an effective means of expanding natural language datasets, but there has been limited analysis of the trade-offs that arise when designing tasks. In this paper, we present the first systematic study of the key factors in crowdsourcing paraphrase collection. We consider variations in instructions, incentives, data domains, and workflows. We manually analyzed paraphrases for correctness, grammaticality, and linguistic diversity. Our observations provide new insight into the trade-offs between accuracy and diversity in crowd responses that arise as a result of task design, providing guidance for future paraphrase generation procedures.Comment: Published at ACL 201

    AGEMAP: A Gene Expression Database for Aging in Mice

    Get PDF
    We present the AGEMAP (Atlas of Gene Expression in Mouse Aging Project) gene expression database, which is a resource that catalogs changes in gene expression as a function of age in mice. The AGEMAP database includes expression changes for 8,932 genes in 16 tissues as a function of age. We found great heterogeneity in the amount of transcriptional changes with age in different tissues. Some tissues displayed large transcriptional differences in old mice, suggesting that these tissues may contribute strongly to organismal decline. Other tissues showed few or no changes in expression with age, indicating strong levels of homeostasis throughout life. Based on the pattern of age-related transcriptional changes, we found that tissues could be classified into one of three aging processes: (1) a pattern common to neural tissues, (2) a pattern for vascular tissues, and (3) a pattern for steroid-responsive tissues. We observed that different tissues age in a coordinated fashion in individual mice, such that certain mice exhibit rapid aging, whereas others exhibit slow aging for multiple tissues. Finally, we compared the transcriptional profiles for aging in mice to those from humans, flies, and worms. We found that genes involved in the electron transport chain show common age regulation in all four species, indicating that these genes may be exceptionally good markers of aging. However, we saw no overall correlation of age regulation between mice and humans, suggesting that aging processes in mice and humans may be fundamentally different

    CODA: Accurate Detection of Functional Associations between Proteins in Eukaryotic Genomes Using Domain Fusion

    Get PDF
    Background: In order to understand how biological systems function it is necessary to determine the interactions and associations between proteins. Gene fusion prediction is one approach to detection of such functional relationships. Its use is however known to be problematic in higher eukaryotic genomes due to the presence of large homologous domain families. Here we introduce CODA (Co-Occurrence of Domains Analysis), a method to predict functional associations based on the gene fusion idiom.Methodology/Principal Findings: We apply a novel scoring scheme which takes account of the genome-specific size of homologous domain families involved in fusion to improve accuracy in predicting functional associations. We show that CODA is able to accurately predict functional similarities in human with comparison to state-of-the-art methods and show that different methods can be complementary. CODA is used to produce evidence that a currently uncharacterised human protein may be involved in pathways related to depression and that another is involved in DNA replication.Conclusions/Significance: The relative performance of different gene fusion methodologies has not previously been explored. We find that they are largely complementary, with different methods being more or less appropriate in different genomes. Our method is the only one currently available for download and can be run on an arbitrary dataset by the user. The CODA software and datasets are freely available from ftp://ftp.biochem.ucl.ac.uk/pub/gene3d_data/v6.1.0/CODA/. Predictions are also available via web services from http://funcnet.eu/

    Just how versatile are domains?

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Creating new protein domain arrangements is a frequent mechanism of evolutionary innovation. While some domains always form the same combinations, others form many different arrangements. This ability, which is often referred to as versatility or promiscuity of domains, its a random evolutionary model in which a domain's promiscuity is based on its relative frequency of domains.</p> <p>Results</p> <p>We show that there is a clear relationship across genomes between the promiscuity of a given domain and its frequency. However, the strength of this relationship differs for different domains. We thus redefine domain promiscuity by defining a new index, <it>DV I </it>("domain versatility index"), which eliminates the effect of domain frequency. We explore links between a domain's versatility, when unlinked from abundance, and its biological properties.</p> <p>Conclusion</p> <p>Our results indicate that domains occurring as single domain proteins and domains appearing frequently at protein termini have a higher <it>DV I</it>. This is consistent with previous observations that the evolution of domain re-arrangements is primarily driven by fusion of pre-existing arrangements and single domains as well as loss of domains at protein termini. Furthermore, we studied the link between domain age, defined as the first appearance of a domain in the species tree, and the <it>DV I</it>. Contrary to previous studies based on domain promiscuity, it seems as if the <it>DV I </it>is age independent. Finally, we find that contrary to previously reported findings, versatility is lower in Eukaryotes. In summary, our measure of domain versatility indicates that a random attachment process is sufficient to explain the observed distribution of domain arrangements and that several views on domain promiscuity need to be revised.</p

    Molecular evolution of the LNX gene family

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>LNX (Ligand of Numb Protein-X) proteins typically contain an amino-terminal RING domain adjacent to either two or four PDZ domains - a domain architecture that is unique to the LNX family. LNX proteins function as E3 ubiquitin ligases and their domain organisation suggests that their ubiquitin ligase activity may be targeted to specific substrates or subcellular locations by PDZ domain-mediated interactions. Indeed, numerous interaction partners for LNX proteins have been identified, but the <it>in vivo </it>functions of most family members remain largely unclear.</p> <p>Results</p> <p>To gain insights into their function we examined the phylogenetic origins and evolution of the <it>LNX </it>gene family. We find that a <it>LNX1/LNX2</it>-like gene arose in an early metazoan lineage by gene duplication and fusion events that combined a RING domain with four PDZ domains. These PDZ domains are closely related to the four carboxy-terminal domains from multiple PDZ domain containing protein-1 (MUPP1). Duplication of the <it>LNX1/LNX2</it>-like gene and subsequent loss of PDZ domains appears to have generated a gene encoding a LNX3/LNX4-like protein, with just two PDZ domains. This protein has novel carboxy-terminal sequences that include a potential modular LNX3 homology domain. The two ancestral <it>LNX </it>genes are present in some, but not all, invertebrate lineages. They were, however, maintained in the vertebrate lineage, with further duplication events giving rise to five LNX family members in most mammals. In addition, we identify novel interactions of LNX1 and LNX2 with three known MUPP1 ligands using yeast two-hybrid asssays. This demonstrates conservation of binding specificity between LNX and MUPP1 PDZ domains.</p> <p>Conclusions</p> <p>The <it>LNX </it>gene family has an early metazoan origin with a LNX1/LNX2-like protein likely giving rise to a LNX3/LNX4-like protein through the loss of PDZ domains. The absence of LNX orthologs in some lineages indicates that LNX proteins are not essential in invertebrates. In contrast, the maintenance of both ancestral <it>LNX </it>genes in the vertebrate lineage suggests the acquisition of essential vertebrate specific functions. The revelation that the LNX PDZ domains are phylogenetically related to domains in MUPP1, and have common binding specificities, suggests that LNX and MUPP1 may have similarities in their cellular functions.</p
    corecore