37 research outputs found

    Test-time Augmentation for Factual Probing

    Full text link
    Factual probing is a method that uses prompts to test if a language model "knows" certain world knowledge facts. A problem in factual probing is that small changes to the prompt can lead to large changes in model output. Previous work aimed to alleviate this problem by optimizing prompts via text mining or fine-tuning. However, such approaches are relation-specific and do not generalize to unseen relation types. Here, we propose to use test-time augmentation (TTA) as a relation-agnostic method for reducing sensitivity to prompt variations by automatically augmenting and ensembling prompts at test time. Experiments show improved model calibration, i.e., with TTA, model confidence better reflects prediction accuracy. Improvements in prediction accuracy are observed for some models, but for other models, TTA leads to degradation. Error analysis identifies the difficulty of producing high-quality prompt variations as the main challenge for TTA.Comment: 12 pages, 4 figures, accepted to EMNLP 2023 Findings (short paper

    Analyzing the Performance of GPT-3.5 and GPT-4 in Grammatical Error Correction

    Full text link
    GPT-3 and GPT-4 models are powerful, achieving high performance on a variety of Natural Language Processing tasks. However, there is a relative lack of detailed published analysis of their performance on the task of grammatical error correction (GEC). To address this, we perform experiments testing the capabilities of a GPT-3.5 model (text-davinci-003) and a GPT-4 model (gpt-4-0314) on major GEC benchmarks. We compare the performance of different prompts in both zero-shot and few-shot settings, analyzing intriguing or problematic outputs encountered with different prompt formats. We report the performance of our best prompt on the BEA-2019 and JFLEG datasets, finding that the GPT models can perform well in a sentence-level revision setting, with GPT-4 achieving a new high score on the JFLEG benchmark. Through human evaluation experiments, we compare the GPT models' corrections to source, human reference, and baseline GEC system sentences and observe differences in editing strategies and how they are scored by human raters

    Empirical Investigation of Neural Symbolic Reasoning Strategies

    Full text link
    Neural reasoning accuracy improves when generating intermediate reasoning steps. However, the source of this improvement is yet unclear. Here, we investigate and factorize the benefit of generating intermediate steps for symbolic reasoning. Specifically, we decompose the reasoning strategy w.r.t. step granularity and chaining strategy. With a purely symbolic numerical reasoning dataset (e.g., A=1, B=3, C=A+3, C?), we found that the choice of reasoning strategies significantly affects the performance, with the gap becoming even larger as the extrapolation length becomes longer. Surprisingly, we also found that certain configurations lead to nearly perfect performance, even in the case of length extrapolation. Our results indicate the importance of further exploring effective strategies for neural reasoning models.Comment: This paper is accepted as the findings at EACL 2023, and the earlier version (non-archival) of this work got the Best Paper Award in the Student Research Workshop of AACL 202

    Do Deep Neural Networks Capture Compositionality in Arithmetic Reasoning?

    Full text link
    Compositionality is a pivotal property of symbolic reasoning. However, how well recent neural models capture compositionality remains underexplored in the symbolic reasoning tasks. This study empirically addresses this question by systematically examining recently published pre-trained seq2seq models with a carefully controlled dataset of multi-hop arithmetic symbolic reasoning. We introduce a skill tree on compositionality in arithmetic symbolic reasoning that defines the hierarchical levels of complexity along with three compositionality dimensions: systematicity, productivity, and substitutivity. Our experiments revealed that among the three types of composition, the models struggled most with systematicity, performing poorly even with relatively simple compositions. That difficulty was not resolved even after training the models with intermediate reasoning steps.Comment: accepted by EACL 202

    RealTime QA: What's the Answer Right Now?

    Full text link
    We introduce REALTIME QA, a dynamic question answering (QA) platform that announces questions and evaluates systems on a regular basis (weekly in this version). REALTIME QA inquires about the current world, and QA systems need to answer questions about novel events or information. It therefore challenges static, conventional assumptions in open-domain QA datasets and pursues instantaneous applications. We build strong baseline models upon large pretrained language models, including GPT-3 and T5. Our benchmark is an ongoing effort, and this paper presents real-time evaluation results over the past year. Our experimental results show that GPT-3 can often properly update its generation results, based on newly-retrieved documents, highlighting the importance of up-to-date information retrieval. Nonetheless, we find that GPT-3 tends to return outdated answers when retrieved documents do not provide sufficient information to find an answer. This suggests an important avenue for future research: can an open-domain QA system identify such unanswerable cases and communicate with the user or even the retrieval module to modify the retrieval results? We hope that REALTIME QA will spur progress in instantaneous applications of question answering and beyond.Comment: RealTime QA Website: https://realtimeqa.github.io

    A comprehensive survey on quantum computer usage: How many qubits are employed for what purposes?

    Full text link
    Quantum computers (QCs), which work based on the law of quantum mechanics, are expected to be faster than classical computers in several computational tasks such as prime factoring and simulation of quantum many-body systems. In the last decade, research and development of QCs have rapidly advanced. Now hundreds of physical qubits are at our disposal, and one can find several remarkable experiments actually outperforming the classical computer in a specific computational task. On the other hand, it is unclear what the typical usages of the QCs are. Here we conduct an extensive survey on the papers that are posted in the quant-ph section in arXiv and claim to have used QCs in their abstracts. To understand the current situation of the research and development of the QCs, we evaluated the descriptive statistics about the papers, including the number of qubits employed, QPU vendors, application domains and so on. Our survey shows that the annual number of publications is increasing, and the typical number of qubits employed is about six to ten, growing along with the increase in the quantum volume (QV). Most of the preprints are devoted to applications such as quantum machine learning, condensed matter physics, and quantum chemistry, while quantum error correction and quantum noise mitigation use more qubits than the other topics. These imply that the increase in QV is fundamentally relevant, and more experiments for quantum error correction, and noise mitigation using shallow circuits with more qubits will take place.Comment: 14 pages, 5 figures, figures regenerate

    Type III Gustilo–Anderson open fracture does not justify routine prophylactic Gram-negative antibiotic coverage

    No full text
    Abstract Postoperative surgical site infection (SSI) is common in open long bone fractures, so early administration of prophylactic antibiotics is critical to prevent SSI. However, the necessity of initial broad-spectrum coverage for Gram-positive and -negative pathogens remains unclear. The purpose of this study was to clarify the effectiveness of prophylactic broad-spectrum antibiotics in a large, national-wide sample. We reviewed an open fracture database of prospectively collected data from 111 institutions managed by our society. A retrospective cohort study was designed to compare the rates of deep SSI between narrow- and broad-spectrum antibiotics, which were initiated within three hours after injury. A total of 1041 type III fractures were evaluated at three months after injury. Overall deep SSI rates did not differ significantly between the narrow-spectrum group (43/538, 8.0%) and broad-spectrum group (49/503, 9.8%) (p = 0.320). During propensity score-matched analysis, 425 pairs were analyzed. After matching, no significant difference in the SSI rate was seen between the narrow- and broad-spectrum groups, with 42 SSIs (9.9%) and 40 SSIs (9.4%), respectively (p = 0.816). The probability of deep SSI was not reduced by broad-spectrum antibiotics compared with narrow-spectrum antibiotics in type III open long bone fractures

    Closed Compression Nailing Using a New-Generation Intramedullary Nail without Autologous Bone Grafting for Humeral Shaft Nonunion

    No full text
    Introduction. Although the recommended treatment for humeral shaft nonunion is compression plating with autologous bone grafting, we treated a case of humeral shaft nonunion with an intramedullary nail (IMN) without bone grafting. Presentation of Case. Osteosynthesis with IMN was performed on a 24-year-old man with a humeral shaft fracture at another hospital. However, bony union was not obtained 1 year after the first surgery, and he was referred to our institution. We treated the nonunion with exchange nailing without autologous bone grafting using compression function of the nail, leading to bony union at 7 months postoperatively. At the final follow-up 2 years and 4 months postoperatively, the patient had full range of motion in the left shoulder and elbow joints. Discussion. Compression plating with autologous bone grafting is reported to be the gold standard for the treatment of humeral shaft nonunion. IMN is advantageous for minimal invasion; however, the conventional type of IMN cannot apply compression force between fragments and does not have sufficient stability against rotational force. In this case, we used an IMN that could apply compression between the fragments and which had rotational stability via many screws. We did not perform bone grafting because the current nonunion was adjudged to be biologically active, and we achieved good functional results. Conclusion. We treated humeral shaft nonunion using IMN with compression, but without bone grafting, leading to successful clinical outcomes. This strategy might be an appropriate choice for the treatment of humeral shaft nonunion with biological activity

    Minimally invasive plate osteosynthesis for humeral shaft nonunion: A report of two cases

    Get PDF
    Introduction: We treated two cases of humeral shaft nonunion by minimally invasive plate osteosynthesis (MIPO) without autogenous bone grafting. Presntation of case: Case 1: An osteosynthesis with intramedullary nailing (IMN) was performed on a 17-year-old female for a humeral shaft fracture at another hospital; however, bony union was not obtained. We removed the nail and screws, then performed MIPO without autogenous bone grafting. At the final follow-up of 4 years after the surgery, she had obtained full range of motion. Case 2: Osteosynthesis with Rush pins had been performed in a 73-year-old female for a humeral shaft fracture at another hospital. Five months later, a revision surgery using IMN was performed at the same hospital; however, this led to nonunion. We removed the IMN and performed MIPO without autogenous bone grafting. At the final follow-up 2 years after surgery, she had obtained full range of motion. Discussion: The cause of nonunion is the lack of mechanical instability and/or biological activity. In these cases, from the findings of radiography and bone scintigraphy, mechanical instability was thought to be the primary cause; therefore, in order to enhance stability, we used a locking plate. Because we can see that these cases are biologically active, we decided not to use bone grafting. Both our cases successfully achieved bony union and excellent functional recovery using this method. Conclusion: We performed MIPO without exposure of the nonunion site and autogenous bone grafting in two cases of humeral shaft nonunion, and obtained successful clinical outcomes
    corecore