10 research outputs found
A Continuously Growing Dataset of Sentential Paraphrases
A major challenge in paraphrase research is the lack of parallel corpora. In
this paper, we present a new method to collect large-scale sentential
paraphrases from Twitter by linking tweets through shared URLs. The main
advantage of our method is its simplicity, as it gets rid of the classifier or
human in the loop needed to select data before annotation and subsequent
application of paraphrase identification algorithms in the previous work. We
present the largest human-labeled paraphrase corpus to date of 51,524 sentence
pairs and the first cross-domain benchmarking for automatic paraphrase
identification. In addition, we show that more than 30,000 new sentential
paraphrases can be easily and continuously captured every month at ~70%
precision, and demonstrate their utility for downstream NLP tasks through
phrasal paraphrase extraction. We make our code and data freely available.Comment: 11 pages, accepted to EMNLP 201
Peculiarities of the inverted repeats in the complete chloroplast genome of Strobilanthes bantonensis Lindau
Strobilanthes bantonensis Lindau belongs to the family Acanthaceae. It is an antiviral herb that can be used to prevent Influenza virus infections in the border areas between China and Vietnam. Local people call it âPurple Ban-lan-genâ because its root is very similar to that of Strobilanthes cusia (Nees) Kuntze, which is called âSouthern Ban-lan-genâ and is listed in Chinese Pharmacopeia. The two species have been used interchangeably locally. However, their pharmacological equivalence has caused concern for years. We have sequenced the chloroplast genome of S. cusia previously. In this study, we sequenced the complete chloroplast genome sequence of S. bantonensis to preform in-depth comparative genetic analysis of the two Strobilanthes species. The chloroplast genome of S. bantonensis is a circular DNA molecule with a total length of 144,591âbp and encodes 84 protein-coding, 8 ribosomes, and 37 transfer RNA genes. The chloroplast genome has a conservative quadripartite structure, including a large single-copy (LSC) region, a small single-copy (SSC) region, and a pair of inverted repeat (IR) regions, with lengths of 92,068âbp, 17,767âbp, and 17,378âbp, respectively. Phylogenetic analysis confirmed that S. bantonensis is closely related to the S. cusia. Compared with other species from Acanthaceae, S. bantonensis has a significantly shortened IR region, suggesting the occurrence of IR contraction events. This study will help future taxonomic, evolutionary, phylogenetic, and bioprospecting studies of the sizeable Strobilanthes genus, which contains over 400 species
UNITE: A Unified Benchmark for Text-to-SQL Evaluation
A practical text-to-SQL system should generalize well on a wide variety of
natural language questions, unseen database schemas, and novel SQL query
structures. To comprehensively evaluate text-to-SQL systems, we introduce a
UNIfied benchmark for Text-to-SQL Evaluation (UNITE). It is composed of
publicly available text-to-SQL datasets, containing natural language questions
from more than 12 domains, SQL queries from more than 3.9K patterns, and 29K
databases. Compared to the widely used Spider benchmark, we introduce
120K additional examples and a threefold increase in SQL patterns, such
as comparative and boolean questions. We conduct a systematic study of six
state-of-the-art (SOTA) text-to-SQL parsers on our new benchmark and show that:
1) Codex performs surprisingly well on out-of-domain datasets; 2) specially
designed decoding methods (e.g. constrained beam search) can improve
performance for both in-domain and out-of-domain settings; 3) explicitly
modeling the relationship between questions and schemas further improves the
Seq2Seq models. More importantly, our benchmark presents key challenges towards
compositional generalization and robustness issues -- which these SOTA models
cannot address well. Our code and data processing script are available at
https://github.com/awslabs/unified-text2sql-benchmarkComment: 5 page
ABIN1 (Q478) is Required to Prevent Hematopoietic Deficiencies through Regulating Type I IFNs Expression
Abstract A20âbinding inhibitor of NFâÎșB activation (ABIN1) is a polyubiquitinâbinding protein that regulates cell death and immune responses. Although Abin1 is located on chromosome 5q in the region commonly deleted in patients with 5q minus syndrome, the most distinct of the myelodysplastic syndromes (MDSs), the precise role of ABIN1 in MDSs remains unknown. In this study, mice with a mutation disrupting the polyubiquitinâbinding site (Abin1Q478H/Q478H) is generated. These mice develop MDSâlike diseases characterized by anemia, thrombocytopenia, and megakaryocyte dysplasia. Extramedullary hematopoiesis and bone marrow failure are also observed in Abin1Q478H/Q478H mice. Although Abin1Q478H/Q478H cells are sensitive to RIPK1 kinaseâRIPK3âMLKLâdependent necroptosis, only anemia and splenomegaly are alleviated by RIPK3 deficiency but not by MLKL deficiency or the RIPK1 kinaseâdead mutation. This indicates that the necroptosisâindependent function of RIPK3 is critical for anemia development in Abin1Q478H/Q478H mice. Notably, Abin1Q478H/Q478H mice exhibit higher levels of type I interferon (IFNâI) expression in bone marrow cells compared towildâtype mice. Consistently, blocking type I IFN signaling through the coâdeletion of Ifnar1 greatly ameliorated anemia, thrombocytopenia, and splenomegaly in Abin1Q478H/Q478H mice. Together, these results demonstrates that ABIN1(Q478) prevents the development of hematopoietic deficiencies by regulating type I IFN expression