423 research outputs found
Large-scale Hierarchical Alignment for Data-driven Text Rewriting
We propose a simple unsupervised method for extracting pseudo-parallel
monolingual sentence pairs from comparable corpora representative of two
different text styles, such as news articles and scientific papers. Our
approach does not require a seed parallel corpus, but instead relies solely on
hierarchical search over pre-trained embeddings of documents and sentences. We
demonstrate the effectiveness of our method through automatic and extrinsic
evaluation on text simplification from the normal to the Simple Wikipedia. We
show that pseudo-parallel sentences extracted with our method not only
supplement existing parallel data, but can even lead to competitive performance
on their own.Comment: RANLP 201
Character-level Chinese-English Translation through ASCII Encoding
Character-level Neural Machine Translation (NMT) models have recently
achieved impressive results on many language pairs. They mainly do well for
Indo-European language pairs, where the languages share the same writing
system. However, for translating between Chinese and English, the gap between
the two different writing systems poses a major challenge because of a lack of
systematic correspondence between the individual linguistic units. In this
paper, we enable character-level NMT for Chinese, by breaking down Chinese
characters into linguistic units similar to that of Indo-European languages. We
use the Wubi encoding scheme, which preserves the original shape and semantic
information of the characters, while also being reversible. We show promising
results from training Wubi-based models on the character- and subword-level
with recurrent as well as convolutional models.Comment: 7 pages, 3 figures, 3rd Conference on Machine Translation (WMT18),
201
SciLit: A Platform for Joint Scientific Literature Discovery, Summarization and Citation Generation
Scientific writing involves retrieving, summarizing, and citing relevant
papers, which can be time-consuming processes in large and rapidly evolving
fields. By making these processes inter-operable, natural language processing
(NLP) provides opportunities for creating end-to-end assistive writing tools.
We propose SciLit, a pipeline that automatically recommends relevant papers,
extracts highlights, and suggests a reference sentence as a citation of a
paper, taking into consideration the user-provided context and keywords. SciLit
efficiently recommends papers from large databases of hundreds of millions of
papers using a two-stage pre-fetching and re-ranking literature search system
that flexibly deals with addition and removal of a paper database. We provide a
convenient user interface that displays the recommended papers as extractive
summaries and that offers abstractively-generated citing sentences which are
aligned with the provided context and which mention the chosen keyword(s). Our
assistive tool for literature discovery and scientific writing is available at
https://scilit.vercel.appComment: Accepted at ACL 2023 System Demonstratio
Spike Correlations in a Songbird Agree with a Simple Markov Population Model
The relationships between neural activity at the single-cell and the population levels are of central importance for understanding neural codes. In many sensory systems, collective behaviors in large cell groups can be described by pairwise spike correlations. Here, we test whether in a highly specialized premotor system of songbirds, pairwise spike correlations themselves can be seen as a simple corollary of an underlying random process. We test hypotheses on connectivity and network dynamics in the motor pathway of zebra finches using a high-level population model that is independent of detailed single-neuron properties. We assume that neural population activity evolves along a finite set of states during singing, and that during sleep population activity randomly switches back and forth between song states and a single resting state. Individual spike trains are generated by associating with each of the population states a particular firing mode, such as bursting or tonic firing. With an overall modification of one or two simple control parameters, the Markov model is able to reproduce observed firing statistics and spike correlations in different neuron types and behavioral states. Our results suggest that song- and sleep-related firing patterns are identical on short time scales and result from random sampling of a unique underlying theme. The efficiency of our population model may apply also to other neural systems in which population hypotheses can be tested on recordings from small neuron groups
Bilateral neurotoxic lesions in NCM before tutoring onset do not prevent successful tutor song learning
Sensorimotor learning crucially depends on the ability to acquire a sensory memory for shaping motor commands. Such learning is conveniently studied in young songbirds when they memorize the song of an adult singer and gradually transform their own vocalizations toward the memorized target song. Here we study the involvement of the Caudal Medial Nidopallium (NCM), a higher auditory cortical area, in acquisition of a song memory. NCM has previously been shown to be involved in tutor song memorization. To study the necessity of NCM in this process, we perform large irreversible NCM lesions using ibotenic acid injections in about 40-days old juvenile zebra finches, before their first exposure to tutor song. Surprisingly, NCM-lesioned juveniles successfully copied the tutor song at least as well as untreated control animals, showing that a fully intact NCM is not required for tutor song memory formation and normal song development
MemSum: Extractive Summarization of Long Documents using Multi-step Episodic Markov Decision Processes
We introduce MemSum (Multi-step Episodic Markov decision process extractive SUMmarizer), a reinforcement-learning-based extractive summarizer enriched at any given time step with information on the current extraction history. Similar to previous models in this vein, MemSum iteratively selects sentences into the summary. Our innovation is in considering a broader information set when summarizing that would intuitively also be used by humans in this task: 1) the text content of the sentence, 2) the global text context of the rest of the document, and 3) the extraction history consisting of the set of sentences that have already been extracted. With a lightweight architecture, MemSum nonetheless obtains state-of-the-art test-set performance (ROUGE score) on long document datasets (PubMed, arXiv, and GovReport). Supporting analysis demonstrates that the added awareness of extraction history gives MemSum robustness against redundancy in the source document
MemSum-DQA: Adapting An Efficient Long Document Extractive Summarizer for Document Question Answering
We introduce MemSum-DQA, an efficient system for document question answering (DQA) that leverages MemSum, a long document extractive summarizer. By prefixing each text block in the parsed document with the provided question and question type, MemSum-DQA selectively extracts text blocks as answers from documents. On full-document answering tasks, this approach yields a 9% improvement in exact match accuracy over prior state-of-the-art baselines. Notably, MemSum-DQA excels in addressing questions related to child-relationship understanding, underscoring the potential of extractive summarization techniques for DQA tasks
Local Citation Recommendation with Hierarchical-Attention Text Encoder and SciBERT-based Reranking
The goal of local citation recommendation is to recommend a missing reference from the local citation context and optionally also from the global context. To balance the tradeoff between speed and accuracy of citation recommendation in the context of a large-scale paper database, a viable approach is to first prefetch a limited number of relevant documents using efficient ranking methods and then to perform a fine-grained reranking using more sophisticated models. In that vein, BM25 has been found to be a tough-to-beat approach to prefetching, which is why recent work has focused mainly on the reranking step. Even so, we explore prefetching with nearest neighbor search among text embeddings constructed by a hierarchical attention network. When coupled with a SciBERT reranker fine-tuned on local citation recommendation tasks, our hierarchical Attention encoder (HAtten) achieves high prefetch recall for a given number of candidates to be reranked. Consequently, our reranker requires fewer prefetch candidates to rerank, yet still achieves state-of-the-art performance on various local citation recommendation datasets such as ACL-200, FullTextPeerRead, RefSeer, and arXiv
MemSum-DQA: Adapting An Efficient Long Document Extractive Summarizer for Document Question Answering
We introduce MemSum-DQA, an efficient system for document question answering
(DQA) that leverages MemSum, a long document extractive summarizer. By
prefixing each text block in the parsed document with the provided question and
question type, MemSum-DQA selectively extracts text blocks as answers from
documents. On full-document answering tasks, this approach yields a 9%
improvement in exact match accuracy over prior state-of-the-art baselines.
Notably, MemSum-DQA excels in addressing questions related to
child-relationship understanding, underscoring the potential of extractive
summarization techniques for DQA tasks.Comment: This paper is the technical research paper of CIKM 2023 DocIU
challenges. The authors received the CIKM 2023 DocIU Winner Award, sponsored
by Google, Microsoft, and the Centre for data-driven geoscienc
Correlative Microscopy of Densely Labeled Projection Neurons Using Neural Tracers
Three-dimensional morphological information about neural microcircuits is of high interest in neuroscience, but acquiring this information remains challenging. A promising new correlative technique for brain imaging is array tomography (Micheva and Smith, 2007), in which series of ultrathin brain sections are treated with fluorescent antibodies against neurotransmitters and synaptic proteins. Treated sections are repeatedly imaged in the fluorescence light microscope (FLM) and then in the electron microscope (EM). We explore a similar correlative imaging technique in which we differentially label distinct populations of projection neurons, the key routers of electrical signals in the brain. In songbirds, projection neurons can easily be labeled using neural tracers, because the vocal control areas are segregated into separate nuclei. We inject tracers into areas afferent and efferent to the main premotor area for vocal production, HVC, to retrogradely and anterogradely label different classes of projection neurons. We optimize tissue preparation protocols to achieve high fluorescence contrast in the FLM and good ultrastructure in the EM (using osmium tetroxide). Although tracer fluorescence is lost during EM preparation, we localize the tracer molecules after fixation and embedding by using fluorescent antibodies against them. We detect signals mainly in somata and dendrites, allowing us to classify synapses within a single ultrathin section as belonging to a particular type of projection neuron. The use of our method will be to provide statistical information about connectivity among different neuron classes, and to elucidate how signals in the brain are processed and routed among different areas
- âŠ