80 research outputs found

    SQLformer: Deep Auto-Regressive Query Graph Generation for Text-to-SQL Translation

    Full text link
    In recent years, there has been growing interest in text-to-SQL translation, which is the task of converting natural language questions into executable SQL queries. This technology is important for its potential to democratize data extraction from databases. However, some of its key hurdles include domain generalisation, which is the ability to adapt to previously unseen databases, and alignment of natural language questions with the corresponding SQL queries. To overcome these challenges, we introduce SQLformer, a novel Transformer architecture specifically crafted to perform text-to-SQL translation tasks. Our model predicts SQL queries as abstract syntax trees (ASTs) in an autoregressive way, incorporating structural inductive bias in the encoder and decoder layers. This bias, guided by database table and column selection, aids the decoder in generating SQL query ASTs represented as graphs in a Breadth-First Search canonical order. Comprehensive experiments illustrate the state-of-the-art performance of SQLformer in the challenging text-to-SQL Spider benchmark. Our implementation is available at https://github.com/AdrianBZG/SQLformerComment: 11 pages, 4 figure

    Unsupervised Fact Verification by Language Model Distillation

    Full text link
    Unsupervised fact verification aims to verify a claim using evidence from a trustworthy knowledge base without any kind of data annotation. To address this challenge, algorithms must produce features for every claim that are both semantically meaningful, and compact enough to find a semantic alignment with the source information. In contrast to previous work, which tackled the alignment problem by learning over annotated corpora of claims and their corresponding labels, we propose SFAVEL (Self-supervised Fact Verification via Language Model Distillation), a novel unsupervised framework that leverages pre-trained language models to distil self-supervised features into high-quality claim-fact alignments without the need for annotations. This is enabled by a novel contrastive loss function that encourages features to attain high-quality claim and evidence alignments whilst preserving the semantic relationships across the corpora. Notably, we present results that achieve a new state-of-the-art on the standard FEVER fact verification benchmark (+8% accuracy) with linear evaluation

    Expression of multiple horizontally acquired genes is a hallmark of both vertebrate and invertebrate genomes.

    Get PDF
    BACKGROUND: A fundamental concept in biology is that heritable material, DNA, is passed from parent to offspring, a process called vertical gene transfer. An alternative mechanism of gene acquisition is through horizontal gene transfer (HGT), which involves movement of genetic material between different species. HGT is well-known in single-celled organisms such as bacteria, but its existence in higher organisms, including animals, is less well established, and is controversial in humans. RESULTS: We have taken advantage of the recent availability of a sufficient number of high-quality genomes and associated transcriptomes to carry out a detailed examination of HGT in 26 animal species (10 primates, 12 flies and four nematodes) and a simplified analysis in a further 14 vertebrates. Genome-wide comparative and phylogenetic analyses show that HGT in animals typically gives rise to tens or hundreds of active 'foreign' genes, largely concerned with metabolism. Our analyses suggest that while fruit flies and nematodes have continued to acquire foreign genes throughout their evolution, humans and other primates have gained relatively few since their common ancestor. We also resolve the controversy surrounding previous evidence of HGT in humans and provide at least 33 new examples of horizontally acquired genes. CONCLUSIONS: We argue that HGT has occurred, and continues to occur, on a previously unsuspected scale in metazoans and is likely to have contributed to biochemical diversification during animal evolution.This work was supported by the European Research Council (AdG233232).This is the final published version. It first appeared at http://genomebiology.com/2015/16/1/50

    Translating synthetic natural language to database queries with a polyglot deep learning framework

    Get PDF
    Abstract: The number of databases as well as their size and complexity is increasing. This creates a barrier to use especially for non-experts, who have to come to grips with the nature of the data, the way it has been represented in the database, and the specific query languages or user interfaces by which data are accessed. These difficulties worsen in research settings, where it is common to work with many different databases. One approach to improving this situation is to allow users to pose their queries in natural language. In this work we describe a machine learning framework, Polyglotter, that in a general way supports the mapping of natural language searches to database queries. Importantly, it does not require the creation of manually annotated data for training and therefore can be applied easily to multiple domains. The framework is polyglot in the sense that it supports multiple different database engines that are accessed with a variety of query languages, including SQL and Cypher. Furthermore Polyglotter supports multi-class queries. Good performance is achieved on both toy and real databases, as well as a human-annotated WikiSQL query set. Thus Polyglotter may help database maintainers make their resources more accessible

    Empirical Bayes method for reducing false discovery rates of correlation matrices with block diagonal structure

    Get PDF
    Background:\textbf{Background:} Correlation matrices are important in inferring relationships and networks between regulatory or signalling elements in biological systems. With currently available technology sample sizes for experiments are typically small, meaning that these correlations can be difficult to estimate. At a genome-wide scale estimation of correlation matrices can also be computationally demanding. Results:\textbf{Results:} We develop an empirical Bayes approach to improve covariance estimates for gene expression, where we assume the covariance matrix takes a block diagonal form. Our method shows lower false discovery rates than existing methods on simulated data. Applied to a real data set from Bacillus subtilis\textit{Bacillus subtilis} we demonstrate it's ability to detecting known regulatory units and interactions between them. Conclusions:\textbf{Conclusions:} We demonstrate that, compared to existing methods, our method is able to find significant covariances and also to control false discovery rates, even when the sample size is small (nn=10). The method can be used to find potential regulatory networks, and it may also be used as a pre-processing step for methods that calculate, for example, partial correlations, so enabling the inference of the causal and hierarchical structure of the networks.CP is funded by the Engineering and Physical Sciences Research Council (EPSRC) doctoral program

    esyN: network building, sharing and publishing.

    Get PDF
    The construction and analysis of networks is increasingly widespread in biological research. We have developed esyN ("easy networks") as a free and open source tool to facilitate the exchange of biological network models between researchers. esyN acts as a searchable database of user-created networks from any field. We have developed a simple companion web tool that enables users to view and edit networks using data from publicly available databases. Both normal interaction networks (graphs) and Petri nets can be created. In addition to its basic tools, esyN contains a number of logical templates that can be used to create models more easily. The ability to use previously published models as building blocks makes esyN a powerful tool for the construction of models and network graphs. Users are able to save their own projects online and share them either publicly or with a list of collaborators. The latter can be given the ability to edit the network themselves, allowing online collaboration on network construction. esyN is designed to facilitate unrestricted exchange of this increasingly important type of biological information. Ultimately, the aim of esyN is to bring the advantages of Open Source software development to the construction of biological networks.This is the final published version of the paper. It's also available from PLOS at http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0106035

    The InterMine Android app: Cross-organism genomic data in your pocket.

    Get PDF
    InterMine is a data integration and analysis software system that has been used to create both inter-connected and stand-alone biological databases for the analysis of large and complex biological data sets. Together, the InterMine databases provide access to extensive data across multiple organisms. To provide more convenient access to these data from Android mobile devices, we have developed the InterMine app, an application that can be run on any Android mobile phone or tablet. The InterMine app provides a single interface for data access, search and exploration of the InterMine databases. It can be used to retrieve information on genes and gene lists, and their relatives across species. Simple searches can be used to access a range of data about a specific gene, while links to the InterMine databases provide access to more detailed report pages and gene list analysis tools. The InterMine app thus facilitates rapid exploration of genes across multiple organisms and kinds of data

    The impact of quantitative optimization of hybridization conditions on gene expression analysis.

    Get PDF
    BACKGROUND: With the growing availability of entire genome sequences, an increasing number of scientists can exploit oligonucleotide microarrays for genome-scale expression studies. While probe-design is a major research area, relatively little work has been reported on the optimization of microarray protocols. RESULTS: As shown in this study, suboptimal conditions can have considerable impact on biologically relevant observations. For example, deviation from the optimal temperature by one degree Celsius lead to a loss of up to 44% of differentially expressed genes identified. While genes from thousands of Gene Ontology categories were affected, transcription factors and other low-copy-number regulators were disproportionately lost. Calibrated protocols are thus required in order to take full advantage of the large dynamic range of microarrays.For an objective optimization of protocols we introduce an approach that maximizes the amount of information obtained per experiment. A comparison of two typical samples is sufficient for this calibration. We can ensure, however, that optimization results are independent of the samples and the specific measures used for calibration. Both simulations and spike-in experiments confirmed an unbiased determination of generally optimal experimental conditions. CONCLUSIONS: Well calibrated hybridization conditions are thus easily achieved and necessary for the efficient detection of differential expression. They are essential for the sensitive pro filing of low-copy-number molecules. This is particularly critical for studies of transcription factor expression, or the inference and study of regulatory networks.RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are

    Multiple functionally divergent and conserved copies of alpha tubulin in bdelloid rotifers.

    Get PDF
    BACKGROUND: Bdelloid rotifers are microscopic animals that have apparently survived without sex for millions of years and are able to survive desiccation at all life stages through a process called anhydrobiosis. Both of these characteristics are believed to have played a role in shaping several unusual features of bdelloid genomes discovered in recent years. Studies into the impact of asexuality and anhydrobiosis on bdelloid genomes have focused on understanding gene copy number. Here we investigate copy number and sequence divergence in alpha tubulin. Alpha tubulin is conserved and normally present in low copy numbers in animals, but multiplication of alpha tubulin copies has occurred in animals adapted to extreme environments, such as cold-adapted Antarctic fish. Using cloning and sequencing we compared alpha tubulin copy variation in four species of bdelloid rotifers and four species of monogonont rotifers, which are facultatively sexual and cannot survive desiccation as adults. Results were verified using transcriptome data from one bdelloid species, Adineta ricciae. RESULTS: In common with the typical pattern for animals, monogonont rotifers contain either one or two copies of alpha tubulin, but bdelloid species contain between 11 and 13 different copies, distributed across five classes. Approximately half of the copies form a highly conserved group that vary by only 1.1% amino acid pairwise divergence with each other and with the monogonont copies. The other copies have divergent amino acid sequences that evolved significantly faster between classes than within them, relative to synonymous changes, and vary in predicted biochemical properties. Copies of each class were expressed under the laboratory conditions used to construct the transcriptome. CONCLUSIONS: Our findings are consistent with recent evidence that bdelloids are degenerate tetraploids and that functional divergence of ancestral copies of genes has occurred, but show how further duplication events in the ancestor of bdelloids led to proliferation in both conserved and functionally divergent copies of this gene.RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are
    corecore