186 research outputs found

    Towards Generalized Open Domain Question Answering Systems

    Get PDF
    Generalization remains a paramount yet unresolved challenge for open-domain question answering (ODQA) systems, impeding their capacity to adeptly handle novel queries and responses beyond the confines of their training data. This thesis conducts a comprehensive exploration of ODQA generalization. We commence with a meticulous investigation into the underlying challenges. Drawing upon studies on systematic generalization, we introduce and annotate questions according to three categories that measure different levels and kinds of generalization: training set overlap, compositional generalization and novel-entity generalization. When evaluating six popular parametric and non-parametric models, we find non-parametric models demonstrate proficiency with novel entities but encounter difficulties with compositional generalization. Noteworthy correlations emerge, such as a positive association between question pattern frequency and test accuracy, juxtaposed with a strong negative correlation between entity frequency and test accuracy, attributable to closely related distractors. Factors influencing generalization include cascading errors originating from the retrieval component, question pattern frequency, and entity prevalence. Building on these insights, the focus pivots towards the enhancement of passage retrieval. We propose a novel contextual clue sampling strategy using language models to address the vocabulary mismatch challenge in lexical retrieval for ODQA. This two-step method, comprising filtering and fusion, generates a diverse set of query expansion terms, yielding retrieval accuracy similar to dense methods while notably reducing the index size. The subsequent phase concentrates on refining reader models in ODQA through flat minima optimization techniques, incorporating Stochastic Weight Averaging (SWA) and Sharpness Aware Minimization (SAM). Rigorous benchmarking under- scores the impact of dataset characteristics and model architecture on optimizer effectiveness, with SAM particularly excelling in Natural Language Processing tasks. The combination of SWA and SAM yields additional gains, underscoring the pivotal role of flatter minimizers in fostering enhanced generalization for reader models in ODQA

    Towards Effective Utilization of Pretrained Language Models ā€” Knowledge Distillation from BERT

    Get PDF
    In the natural language processing (NLP) literature, neural networks are becoming increasingly deeper and more complex. Recent advancements in neural NLP are large pretrained language models (e.g. BERT), which lead to significant performance gains in various downstream tasks. Such models, however, require intensive computational resource to train and are difficult to deploy in practice due to poor inference-time efficiency. In this thesis, we are trying to solve this problem through knowledge distillation (KD), where a large pretrained model serves as teacher and transfers its knowledge to a small student model. We also want to demonstrate the competitiveness of small, shallow neural networks. We propose a simple yet effective approach that transfers the knowledge of a large pretrained network (namely, BERT) to a shallow neural architecture (namely, a bidirectional long short-term memory network). To facilitate this process, we propose heuristic data augmentation methods, so that the teacher model can better express its knowledge on the augmented corpus. Experimental results on various natural language understanding tasks show that our distilled model achieves high performance comparable to the ELMo model (a LSTM based pretrained model) in both single-sentence and sentence-pair tasks, while using roughly 60ā€“100 times fewer parameters and 8ā€“15 times less inference time. Although experiments show that small BiLSTMs are more expressive on natural language tasks than previously thought, we wish to further exploit its capacity through a different KD framework. We propose MKD, a Multi-Task Knowledge Distillation Approach. It distills the student model from different tasks jointly, so that the distilled model learns a more universal language representation by leveraging cross-task data. Furthermore, we evaluate our approach on two different student model architectures, one is bi-attentive LSTM based network, another uses three layer Transformer models. For LSTM based student, our approach keeps the advantage of inference speed while maintaining comparable performance as those specifically designed for Transformer methods. For our Transformer-based student, it does provide a modest gain, and outperforms other KD methods without using external training data

    When Do Flat Minima Optimizers Work?

    Get PDF
    Recently, flat-minima optimizers, which seek to find parameters in low-loss neighborhoods, have been shown to improve a neural network's generalization performance over stochastic and adaptive gradient-based optimizers. Two methods have received significant attention due to their scalability: 1. Stochastic Weight Averaging (SWA), and 2. Sharpness-Aware Minimization (SAM). However, there has been limited investigation into their properties and no systematic benchmarking of them across different domains. We fill this gap here by comparing the loss surfaces of the models trained with each method and through broad benchmarking across computer vision, natural language processing, and graph representation learning tasks. We discover several surprising findings from these results, which we hope will help researchers further improve deep learning optimizers, and practitioners identify the right optimizer for their problem

    Scalable Content-Based Analysis of Images in Web Archives with TensorFlow and the Archives Unleashed Toolkit

    Get PDF
    We demonstrate the integration of the Archives Unleashed Toolkit, a scalable platform for exploring web archives, with Google's TensorFlow deep learning toolkit to provide scholars with content-based image analysis capabilities. By applying pretrained deep neural networks for object detection, we are able to extract images of common objects from a 4TB web archive of GeoCities, which we then compile into browsable collages. This case study illustrates the types of interesting analyses enabled by combining big data and deep learning capabilities.This work was primarily supported by the Natural Sciences and Engineering Research Council of Canada. Additional funding for this project has come from the Andrew W. Mellon Foundation. Our sincerest thanks to the Internet Archive for providing us with the GeoCities web archive

    Model-independent test of the parity symmetry of gravity with gravitational waves

    Full text link
    Gravitational wave (GW) data can be used to test the parity symmetry of gravity by investigating the difference between left-hand and right-hand circular polarization modes. In this article, we develop a method to decompose the circular polarizations of GWs produced during the inspiralling stage of compact binaries, with the help of stationary phase approximation. The foremost advantage is that this method is simple, clean, independent of GW waveform, and is applicable to the existing detector network. Applying it to the mock data, we test the parity symmetry of gravity by constraining the velocity birefringence of GWs. If a nearly edge-on binary neutron-stars with observed electromagnetic counterparts at 40 Mpc is detected by the second-generation detector network, one could derive the model-independent test on the parity symmetry in gravity: the lower limit of the energy scale of parity violation can be constrained within O(104eV)\mathcal{O}(10^4{\rm eV}).Comment: 9 pages,4 figs, EPJC accepte

    Comparing Enterovirus 71 with Coxsackievirus A16 by analyzing nucleotide sequences and antigenicity of recombinant proteins of VP1s and VP4s

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Enterovirus 71 (EV71) and Coxsackievirus A16 (CA16) are two major etiological agents of Hand, Foot and Mouth Disease (HFMD). EV71 is associated with severe cases but not CA16. The mechanisms contributed to the different pathogenesis of these two viruses are unknown. VP1 and VP4 are two major structural proteins of these viruses, and should be paid close attention to.</p> <p>Results</p> <p>The sequences of <it>vp1s </it>from 14 EV71 and 14 CA16, and <it>vp4s </it>from 10 EV71 and 1 CA16 isolated in this study during 2007 to 2009 HFMD seasons were analyzed together with the corresponding sequences available in GenBank using DNAStar and MEGA 4.0. Phylogenetic analysis of complete <it>vp1s </it>or <it>vp4s </it>showed that EV71 isolated in Beijing belonged to C4 and CA16 belonged to lineage B2 (lineage C). VP1s and VP4s from 4 strains of viruses expressed in <it>E. coli BL21 </it>cells were used to detect IgM and IgG in human sera by Western Blot. The detection of IgM against VP1s of EV71 and CA16 showed consistent results with current infection, while none of the sera were positive against VP4s of EV71 and CA16. There was significant difference in the positive rates between EV71 VP1 and CA16 VP1 (Ļ‡<sup>2 </sup>= 5.02, P < 0.05) as well as EV71 VP4 and CA16 VP4 (Ļ‡<sup>2 </sup>= 15.30, P < 0.01) in the detection of IgG against recombinant proteins with same batch of serum samples. The sera-positive rate of IgG against VP1 was higher than that against VP4 for both EV71 (Ļ‡<sup>2 </sup>= 26.47, P < 0.01) and CA16 (Ļ‡<sup>2 </sup>= 16.78, P < 0.01), which might be because of different positions of VP1 and VP4 in the capsid of the viruses.</p> <p>Conclusions</p> <p>EV71 and CA16 were highly diverse in the nucleotide sequences of <it>vp1s </it>and <it>vp4s</it>. The sera positive rates of VP1 and VP4 of EV71 were lower than those of CA16 respectively, which suggested a less exposure rate to EV71 than CA16 in Beijing population. Human serum antibodies detected by Western blot using VP1s and VP4s as antigen indicated that the immunological reaction to VP1 and VP4 of both EV71 and CA16 was different.</p
    • ā€¦
    corecore