32 research outputs found
Joint Syntacto-Discourse Parsing and the Syntacto-Discourse Treebank
Discourse parsing has long been treated as a stand-alone problem independent
from constituency or dependency parsing. Most attempts at this problem are
pipelined rather than end-to-end, sophisticated, and not self-contained: they
assume gold-standard text segmentations (Elementary Discourse Units), and use
external parsers for syntactic features. In this paper we propose the first
end-to-end discourse parser that jointly parses in both syntax and discourse
levels, as well as the first syntacto-discourse treebank by integrating the
Penn Treebank with the RST Treebank. Built upon our recent span-based
constituency parser, this joint syntacto-discourse parser requires no
preprocessing whatsoever (such as segmentation or feature extraction), achieves
the state-of-the-art end-to-end discourse parsing accuracy.Comment: Accepted at EMNLP 201
Can humain association norm evaluate latent semantic analysis?
This paper presents the comparison of word association norm created by a psycholinguistic experiment to association lists generated by algorithms operating on text corpora. We compare lists generated by Church and Hanks algorithm and lists generated by LSA algorithm. An argument is presented on how those automatically generated lists reflect real semantic relations
Recommended from our members
Deep Learning for Human and Biological Languages
We explore the application of deep learning to the disparate fields of natural language processing and computational biology. Both the sentences uttered by humans as well as the RNA and protein sequences found within the cells of their bodies can be considered formal languages in computer science, as sets of strings composed from an alphabet generated by grammar rules. To briefly characterize these languages, words in natural language sentences have a large number of types but a limited sequence of tokens, while nucleotides in biological contexts have limited types in long sequences of tokens. A sentence has a possible vocabulary size greater than 100,000 but in practice usually have less than 20-30 words; RNA sequences have 4 possible tokens but feature sequences anywhere from less than 100 to greater than 10,000 nucleotides. Protein sequences similarly have 20 possible amino acid tokens. The practical differences between these contexts inform our modeling choices to make deep learning tractable and effective, and they further influence what additional algorithms are needed to attain strong results.
These widely different domains presumably have their own forms of syntactic structure, and their respective grammars dictate the relationships on how words, nucleotides, and amino acids interact within themselves to form structures. With language this comes in the form of syntactic parse tree diagrams, with RNA this becomes secondary structure base pairings, and with proteins this becomes tertiary structure contact map pairings. We present a deep learning approach for predicting syntactic structures for human languages (parsing), and dynamic programming techniques that allow for fast linear-time decoding while maintaining close to state-of-the-art accuracy. Converting the traditional exhaustive cubic time CKY parsing algorithm into having a left-to-right, bottom-up reordering allowed us to additionally apply inexact beam search and then cube-pruning to attain linear runtime complexity. Despite being an inexact search, our model attained results (91.97 F1) better than the previous state-of-the-art model (91.79 F1) which used an exhaustive decoding upon the same underlying neural network architecture.
Analogous to linguistic grammar rules, nucleotides in RNA sequences are also subject to base pairing potentials, as Adenine (A) prefers to bind with Uracil (U) and Cytosine (C) prefers to bind with Guanine (G). The secondary structure base pairing behavior of RNA often involves interactions across the entire sequence. We present a deep learning approach for predicting secondary structure for RNA sequences (folding), and using self-attention-based Transformer models to visualize and correct errors made by other structure prediction algorithms called RNA-Fix. We find that a simple architecture consisting of LSTM and Transformer layers succeed at attaining a strong baseline, which then further improves when predictions made by another program are made available as input. Visualizing the attention weights of our model, we find that strong attention in the last layer is paid towards bracketed structural sections in the output.
We further show a connection to our human language parsing work, by presenting the Nussinov dynamic programming decoding algorithm adapted for deep learning, that guarantees a balanced and valid base pairing output. With cubic runtime complexity analogous to CKY, we show on a dataset of RNA sequences limited to length 50 accuracies surpassing our RNA-Fix models. We also discuss how to linearize the runtime which would allow us to scale to longer sequence datasets.
Even more complex than RNA, protein sequences feature even more possible interactions between the 20 different types of amino acids. A typical way to model how a protein sequence will eventually fold into a 3D molecule is to first search for many similar or homologous sequences in a database, and then use the aligned multiple sequence alignment (MSA) as the input, before predicting the distances between each amino acid to every other position, called a contact map. We present a deep learning approach for predicting tertiary structure for protein sequences (contact map prediction), and an algorithm that overall improves the input and output simultaneously by iteratively realigning the former based on the alignment of the latter. Focusing on the cases where little to no homologous sequences can be found for a given input protein sequence (MSA size 10), we find that the iterative process of realigning the input sequences and output structures results in improvement especially in short, but also in , medium, and long range contacts
Because Syntax does Matter: Improving Predicate-Argument Structures Parsing Using Syntactic Features
International audienceParsing full-fledged predicate-argument structures in a deep syntax framework requires graphs to be predicted. Using the DeepBank (Flickinger et al., 2012) and the Predicate-Argument Structure treebank (Miyao and Tsujii, 2005) as a test field, we show how transition-based parsers, extended to handle connected graphs, benefit from the use of topologically different syntactic features such as dependencies, tree fragments, spines or syntactic paths, bringing a much needed context to the parsing models, improving notably over long distance dependencies and elided coordinate structures. By confirming this positive impact on an accurate 2nd-order graph-based parser (Martins and Almeida, 2014), we establish a new state-of-the-art on these data sets
Because Syntax does Matter: Improving Predicate-Argument Structures Parsing Using Syntactic Features
International audienceParsing full-fledged predicate-argument structures in a deep syntax framework requires graphs to be predicted. Using the DeepBank (Flickinger et al., 2012) and the Predicate-Argument Structure treebank (Miyao and Tsujii, 2005) as a test field, we show how transition-based parsers, extended to handle connected graphs, benefit from the use of topologically different syntactic features such as dependencies, tree fragments, spines or syntactic paths, bringing a much needed context to the parsing models, improving notably over long distance dependencies and elided coordinate structures. By confirming this positive impact on an accurate 2nd-order graph-based parser (Martins and Almeida, 2014), we establish a new state-of-the-art on these data sets
Strategies to Address Data Sparseness in Implicit Semantic Role Labeling
Natural language texts frequently contain predicates whose complete understanding re- quires access to other parts of the discourse. Human readers can retrieve such infor- mation across sentence boundaries and infer the implicit piece of information. This capability enables us to understand complicated texts without needing to repeat the same information in every single sentence. However, for computational systems, resolv- ing such information is problematic because computational approaches traditionally rely on sentence-level processing and rarely take into account the extra-sentential context.
In this dissertation, we investigate this omission phenomena, called implicit semantic role labeling. Implicit semantic role labeling involves identification of predicate argu- ments that are not locally realized but are resolvable from the context. For example, in ”What’s the matter, Walters? asked Baynes sharply.”, the ADDRESSEE of the predicate ask, Walters, is not mentioned as one of its syntactic arguments, but can be recoverable from the previous sentence. In this thesis, we try to improve methods for the automatic processing of such predicate instances to improve natural language pro- cessing applications. Our main contribution is introducing approaches to solve the data sparseness problem of the task. We improve automatic identification of implicit roles by increasing the amount of training set without needing to annotate new instances. For this purpose, we propose two approaches. As the first one, we use crowdsourcing to annotate instances of implicit semantic roles and show that with an appropriate task de- sign, reliable annotation of implicit semantic roles can be obtained from the non-experts without the need to present precise and linguistic definition of the roles to them. As the second approach, we combine seemingly incompatible corpora to solve the problem of data sparseness of ISRL by applying a domain adaptation technique. We show that out of domain data from a different genre can be successfully used to improve a baseline implicit semantic role labeling model, when used with an appropriate domain adapta- tion technique. The results also show that the improvement occurs regardless of the predicate part of speech, that is, identification of implicit roles relies more on semantic features than syntactic ones. Therefore, annotating instances of nominal predicates, for instance, can help to improve identification of verbal predicates’ implicit roles, we well. Our findings also show that the variety of the additional data is more important than its size. That is, increasing a large amount of data does not necessarily lead to a better model
A Transition-Based Directed Acyclic Graph Parser for UCCA
We present the first parser for UCCA, a cross-linguistically applicable
framework for semantic representation, which builds on extensive typological
work and supports rapid annotation. UCCA poses a challenge for existing parsing
techniques, as it exhibits reentrancy (resulting in DAG structures),
discontinuous structures and non-terminal nodes corresponding to complex
semantic units. To our knowledge, the conjunction of these formal properties is
not supported by any existing parser. Our transition-based parser, which uses a
novel transition set and features based on bidirectional LSTMs, has value not
just for UCCA parsing: its ability to handle more general graph structures can
inform the development of parsers for other semantic DAG structures, and in
languages that frequently use discontinuous structures.Comment: 16 pages; Accepted as long paper at ACL201