37 research outputs found
Test-time Augmentation for Factual Probing
Factual probing is a method that uses prompts to test if a language model
"knows" certain world knowledge facts. A problem in factual probing is that
small changes to the prompt can lead to large changes in model output. Previous
work aimed to alleviate this problem by optimizing prompts via text mining or
fine-tuning. However, such approaches are relation-specific and do not
generalize to unseen relation types. Here, we propose to use test-time
augmentation (TTA) as a relation-agnostic method for reducing sensitivity to
prompt variations by automatically augmenting and ensembling prompts at test
time. Experiments show improved model calibration, i.e., with TTA, model
confidence better reflects prediction accuracy. Improvements in prediction
accuracy are observed for some models, but for other models, TTA leads to
degradation. Error analysis identifies the difficulty of producing high-quality
prompt variations as the main challenge for TTA.Comment: 12 pages, 4 figures, accepted to EMNLP 2023 Findings (short paper
Analyzing the Performance of GPT-3.5 and GPT-4 in Grammatical Error Correction
GPT-3 and GPT-4 models are powerful, achieving high performance on a variety
of Natural Language Processing tasks. However, there is a relative lack of
detailed published analysis of their performance on the task of grammatical
error correction (GEC). To address this, we perform experiments testing the
capabilities of a GPT-3.5 model (text-davinci-003) and a GPT-4 model
(gpt-4-0314) on major GEC benchmarks. We compare the performance of different
prompts in both zero-shot and few-shot settings, analyzing intriguing or
problematic outputs encountered with different prompt formats. We report the
performance of our best prompt on the BEA-2019 and JFLEG datasets, finding that
the GPT models can perform well in a sentence-level revision setting, with
GPT-4 achieving a new high score on the JFLEG benchmark. Through human
evaluation experiments, we compare the GPT models' corrections to source, human
reference, and baseline GEC system sentences and observe differences in editing
strategies and how they are scored by human raters
Empirical Investigation of Neural Symbolic Reasoning Strategies
Neural reasoning accuracy improves when generating intermediate reasoning
steps. However, the source of this improvement is yet unclear. Here, we
investigate and factorize the benefit of generating intermediate steps for
symbolic reasoning. Specifically, we decompose the reasoning strategy w.r.t.
step granularity and chaining strategy. With a purely symbolic numerical
reasoning dataset (e.g., A=1, B=3, C=A+3, C?), we found that the choice of
reasoning strategies significantly affects the performance, with the gap
becoming even larger as the extrapolation length becomes longer. Surprisingly,
we also found that certain configurations lead to nearly perfect performance,
even in the case of length extrapolation. Our results indicate the importance
of further exploring effective strategies for neural reasoning models.Comment: This paper is accepted as the findings at EACL 2023, and the earlier
version (non-archival) of this work got the Best Paper Award in the Student
Research Workshop of AACL 202
Do Deep Neural Networks Capture Compositionality in Arithmetic Reasoning?
Compositionality is a pivotal property of symbolic reasoning. However, how
well recent neural models capture compositionality remains underexplored in the
symbolic reasoning tasks. This study empirically addresses this question by
systematically examining recently published pre-trained seq2seq models with a
carefully controlled dataset of multi-hop arithmetic symbolic reasoning. We
introduce a skill tree on compositionality in arithmetic symbolic reasoning
that defines the hierarchical levels of complexity along with three
compositionality dimensions: systematicity, productivity, and substitutivity.
Our experiments revealed that among the three types of composition, the models
struggled most with systematicity, performing poorly even with relatively
simple compositions. That difficulty was not resolved even after training the
models with intermediate reasoning steps.Comment: accepted by EACL 202
RealTime QA: What's the Answer Right Now?
We introduce REALTIME QA, a dynamic question answering (QA) platform that
announces questions and evaluates systems on a regular basis (weekly in this
version). REALTIME QA inquires about the current world, and QA systems need to
answer questions about novel events or information. It therefore challenges
static, conventional assumptions in open-domain QA datasets and pursues
instantaneous applications. We build strong baseline models upon large
pretrained language models, including GPT-3 and T5. Our benchmark is an ongoing
effort, and this paper presents real-time evaluation results over the past
year. Our experimental results show that GPT-3 can often properly update its
generation results, based on newly-retrieved documents, highlighting the
importance of up-to-date information retrieval. Nonetheless, we find that GPT-3
tends to return outdated answers when retrieved documents do not provide
sufficient information to find an answer. This suggests an important avenue for
future research: can an open-domain QA system identify such unanswerable cases
and communicate with the user or even the retrieval module to modify the
retrieval results? We hope that REALTIME QA will spur progress in instantaneous
applications of question answering and beyond.Comment: RealTime QA Website: https://realtimeqa.github.io
A comprehensive survey on quantum computer usage: How many qubits are employed for what purposes?
Quantum computers (QCs), which work based on the law of quantum mechanics,
are expected to be faster than classical computers in several computational
tasks such as prime factoring and simulation of quantum many-body systems. In
the last decade, research and development of QCs have rapidly advanced. Now
hundreds of physical qubits are at our disposal, and one can find several
remarkable experiments actually outperforming the classical computer in a
specific computational task. On the other hand, it is unclear what the typical
usages of the QCs are. Here we conduct an extensive survey on the papers that
are posted in the quant-ph section in arXiv and claim to have used QCs in their
abstracts. To understand the current situation of the research and development
of the QCs, we evaluated the descriptive statistics about the papers, including
the number of qubits employed, QPU vendors, application domains and so on. Our
survey shows that the annual number of publications is increasing, and the
typical number of qubits employed is about six to ten, growing along with the
increase in the quantum volume (QV). Most of the preprints are devoted to
applications such as quantum machine learning, condensed matter physics, and
quantum chemistry, while quantum error correction and quantum noise mitigation
use more qubits than the other topics. These imply that the increase in QV is
fundamentally relevant, and more experiments for quantum error correction, and
noise mitigation using shallow circuits with more qubits will take place.Comment: 14 pages, 5 figures, figures regenerate
Type III Gustilo–Anderson open fracture does not justify routine prophylactic Gram-negative antibiotic coverage
Abstract Postoperative surgical site infection (SSI) is common in open long bone fractures, so early administration of prophylactic antibiotics is critical to prevent SSI. However, the necessity of initial broad-spectrum coverage for Gram-positive and -negative pathogens remains unclear. The purpose of this study was to clarify the effectiveness of prophylactic broad-spectrum antibiotics in a large, national-wide sample. We reviewed an open fracture database of prospectively collected data from 111 institutions managed by our society. A retrospective cohort study was designed to compare the rates of deep SSI between narrow- and broad-spectrum antibiotics, which were initiated within three hours after injury. A total of 1041 type III fractures were evaluated at three months after injury. Overall deep SSI rates did not differ significantly between the narrow-spectrum group (43/538, 8.0%) and broad-spectrum group (49/503, 9.8%) (p = 0.320). During propensity score-matched analysis, 425 pairs were analyzed. After matching, no significant difference in the SSI rate was seen between the narrow- and broad-spectrum groups, with 42 SSIs (9.9%) and 40 SSIs (9.4%), respectively (p = 0.816). The probability of deep SSI was not reduced by broad-spectrum antibiotics compared with narrow-spectrum antibiotics in type III open long bone fractures
Closed Compression Nailing Using a New-Generation Intramedullary Nail without Autologous Bone Grafting for Humeral Shaft Nonunion
Introduction. Although the recommended treatment for humeral shaft nonunion is compression plating with autologous bone grafting, we treated a case of humeral shaft nonunion with an intramedullary nail (IMN) without bone grafting. Presentation of Case. Osteosynthesis with IMN was performed on a 24-year-old man with a humeral shaft fracture at another hospital. However, bony union was not obtained 1 year after the first surgery, and he was referred to our institution. We treated the nonunion with exchange nailing without autologous bone grafting using compression function of the nail, leading to bony union at 7 months postoperatively. At the final follow-up 2 years and 4 months postoperatively, the patient had full range of motion in the left shoulder and elbow joints. Discussion. Compression plating with autologous bone grafting is reported to be the gold standard for the treatment of humeral shaft nonunion. IMN is advantageous for minimal invasion; however, the conventional type of IMN cannot apply compression force between fragments and does not have sufficient stability against rotational force. In this case, we used an IMN that could apply compression between the fragments and which had rotational stability via many screws. We did not perform bone grafting because the current nonunion was adjudged to be biologically active, and we achieved good functional results. Conclusion. We treated humeral shaft nonunion using IMN with compression, but without bone grafting, leading to successful clinical outcomes. This strategy might be an appropriate choice for the treatment of humeral shaft nonunion with biological activity
Minimally invasive plate osteosynthesis for humeral shaft nonunion: A report of two cases
Introduction: We treated two cases of humeral shaft nonunion by minimally invasive plate osteosynthesis (MIPO) without autogenous bone grafting. Presntation of case: Case 1: An osteosynthesis with intramedullary nailing (IMN) was performed on a 17-year-old female for a humeral shaft fracture at another hospital; however, bony union was not obtained. We removed the nail and screws, then performed MIPO without autogenous bone grafting. At the final follow-up of 4 years after the surgery, she had obtained full range of motion. Case 2: Osteosynthesis with Rush pins had been performed in a 73-year-old female for a humeral shaft fracture at another hospital. Five months later, a revision surgery using IMN was performed at the same hospital; however, this led to nonunion. We removed the IMN and performed MIPO without autogenous bone grafting. At the final follow-up 2 years after surgery, she had obtained full range of motion. Discussion: The cause of nonunion is the lack of mechanical instability and/or biological activity. In these cases, from the findings of radiography and bone scintigraphy, mechanical instability was thought to be the primary cause; therefore, in order to enhance stability, we used a locking plate. Because we can see that these cases are biologically active, we decided not to use bone grafting. Both our cases successfully achieved bony union and excellent functional recovery using this method. Conclusion: We performed MIPO without exposure of the nonunion site and autogenous bone grafting in two cases of humeral shaft nonunion, and obtained successful clinical outcomes