49 research outputs found
Maximum Entropy Models For Natural Language Ambiguity Resolution
This thesis demonstrates that several important kinds of natural language ambiguities can be resolved to state-of-the-art accuracies using a single statistical modeling technique based on the principle of maximum entropy.
We discuss the problems of sentence boundary detection, part-of-speech tagging, prepositional phrase attachment, natural language parsing, and text categorization under the maximum entropy framework. In practice, we have found that maximum entropy models offer the following advantages:
State-of-the-art Accuracy: The probability models for all of the tasks discussed perform at or near state-of-the-art accuracies, or outperform competing learning algorithms when trained and tested under similar conditions. Methods which outperform those presented here require much more supervision in the form of additional human involvement or additional supporting resources.
Knowledge-Poor Features: The facts used to model the data, or features, are linguistically very simple, or knowledge-poor but yet succeed in approximating complex linguistic relationships.
Reusable Software Technology: The mathematics of the maximum entropy framework are essentially independent of any particular task, and a single software implementation can be used for all of the probability models in this thesis.
The experiments in this thesis suggest that experimenters can obtain state-of-the-art accuracies on a wide range of natural language tasks, with little task-specific effort, by using maximum entropy probability models
Online learning of latent linguistic structure with approximate search
Automatic analysis of natural language data is a frequently occurring application of machine learning systems. These analyses often revolve around some linguistic structure, for instance a syntactic analysis of a sentence by means of a tree. Machine learning models that carry out structured prediction, as opposed to simpler machine learning tasks such as classification or regression, have therefore received considerable attention in the language processing literature.
As an additional twist, the sought linguistic structures are sometimes not directly modeled themselves. Rather, prediction takes place in a different space where the same linguistic structure can be represented in more than one way. However, in a standard supervised learning setting, these prediction structures are not available in the training data, but only the linguistic structure. Since multiple prediction structures may correspond to the same linguistic structure, it is thus unclear which prediction structure to use for learning. One option is to treat the prediction structure as latent and let the machine learning algorithm guide this selection.
In this dissertation we present an abstract framework for structured prediction. This framework supports latent structures and is agnostic of the particular language processing task. It defines a set of hyperparameters and task-specific functions which a user must implement in order to apply it to a new task. The advantage of this modularization is that it permits comparisons and reuse across tasks in a common framework.
The framework we devise is based on the structured perceptron for learning. The perceptron is an online learning algorithm which considers one training instance at a time, makes a prediction, and carries out an update if the prediction was wrong. We couple the structured perceptron with beam search, which is a general purpose search algorithm. Beam search is, however, only approximate, meaning that there is no guarantee that it will find the optimal structure in a large search space. Therefore special attention is required to handle search errors during training. This has led to the development of special update methods such as early and max-violation updates.
The contributions of this dissertation sit at the intersection of machine learning and natural language processing. With regard to language processing, we consider three tasks: Coreference resolution, dependency parsing, and joint sentence segmentation and dependency parsing. For coreference resolution, we start from an existing latent tree model and extend it to accommodate non-local features drawn from a greater structural context. This requires us to sacrifice exact for approximate search, but we show that, assuming sufficiently advanced update methods are used for the structured perceptron, then the richer scope of features yields a stronger coreference model. We take a transition-based approach to dependency parsing, where dependency trees are constructed incrementally by transition system. Latent structures for transition-based parsing have previously not received enough attention, partly because the characterization of the prediction space is non-trivial. We provide a thorough analysis of this space with regard to the ArcStandard with Swap transition system. This characterization enables us to evaluate the role of latent structures in transition-based dependency parsing. Empirically we find that the utility of latent structures depend on the choice of approximate search -- for greedy search they improve performance, whereas with beam search they are on par, or sometimes slightly ahead of, previous approaches. We then go on to extend this transition system to do joint sentence segmentation and dependency parsing. We develop a transition system capable of handling this task and evaluate it on noisy, non-edited texts. With a set of carefully selected baselines and data sets we employ this system to measure the effectiveness of syntactic information for sentence segmentation. We show that, in the absence of obvious orthographic clues such as punctuation and capitalization, syntactic information can be used to improve sentence segmentation.
With regard to machine learning, our contributions of course include the framework itself. The task-specific evaluations, however, allow us to probe the learning machinery along certain boundary points and draw more general conclusions. A recurring observation is that some of the standard update methods for the structured perceptron with approximate search -- e.g., early and max-violation updates -- are inadequate when the predicted structure reaches a certain size. We show that the primary problem with these updates is that they may discard training data and that this effect increases as the structure size increases. This problem can be handled by using more advanced update methods that commit to using all the available training data. Here, we propose a new update method, DLaSO, which consistently outperforms all other update methods we compare to. Moreover, while this problem potentially could be handled by an increased beam size, we also show that this cannot fully compensate for the structure size and that the more advanced methods indeed are required.Bei der automatisierten Analyse natĂŒrlicher Sprache werden in der Regel maschinelle Lernverfahren eingesetzt, um verschiedenste linguistische Information wie beispielsweise syntaktische Strukturen vorherzusagen. Structured Prediction (dt. etwa Strukturvorhersage), also der Zweig des maschinellen Lernens, der sich mit der Vorhersage komplexer Strukturen wie formalen BĂ€umen oder Graphen beschĂ€ftigt, hat deshalb erhebliche Beachtung in der Forschung zur automatischen Sprachverarbeitung gefunden.
In manchen FĂ€llen ist es vorteilhaft, die gesuchte linguistische Struktur nicht direkt zu modellieren und stattdessen interne ReprĂ€sentationen zu lernen, aus denen dann die gewĂŒnschte linguistische Information abgeleitet werden kann. Da die internen ReprĂ€sentationen allerdings selten direkt in Trainingsdaten verfĂŒgbar sind, sondern erst aus der linguistischen Annotation inferiert werden mĂŒssen, kann es vorkommen, dass dabei mehrere Ă€quivalente Strukturen in Frage kommen. Anstatt nun vor dem Lernen eine Struktur beliebig auszuwĂ€hlen, kann man diese Entscheidung dem Lernverfahren selbst ĂŒberlassen, welches dann selbstĂ€ndig die fĂŒr das Modell am besten passende auszuwĂ€hlen lernt. Unter diesen UmstĂ€nden bezeichnet man die interne, nicht a priori bekannte ReprĂ€sentation fĂŒr eine gesuchte Zielstruktur als latent.
Diese Dissertation stellt ein Structured Prediction Framework vor, mit dem man den Vorteil latenter ReprĂ€sentationen nutzen kann und welches gleichzeitig von konkreten AnwendungsfĂ€llen abstrahiert. Diese Modularisierung ermöglicht die Wiederverwendbarkeit und den Vergleich ĂŒber mehrere Aufgaben und Aufgabenklassen hinweg. Um das Framework auf ein reales Problem anzuwenden, mĂŒssen nur einige Hyperparameter definiert und einige problemspezifische Funktionen implementiert werden.
Das vorgestellte Framework basiert auf dem Structured Perceptron. Der Perceptron-Algorithmus ist ein inkrementelles Lernverfahren (eng. online learning), bei dem wĂ€hrend des Trainings einzelne Trainingsinstanzen nacheinander betrachtet werden. In jedem Schritt wird mit dem aktuellen Modell eine Vorhersage gemacht. Stimmt die Vorhersage nicht mit dem vorgegebenen Ergebnis ĂŒberein, wird das Modell durch ein entsprechendes Update angepasst und mit der nĂ€chsten Trainingsinstanz fortgefahren. Der Structured Perceptron wird im vorgestellten Framework mit Beam Search kombiniert. Beam Search ist ein approximatives Suchverfahren, welches auch in sehr groĂen SuchrĂ€umen effizientes Suchen erlaubt. Es kann aus diesem Grund aber keine Garantie dafĂŒr bieten, dass das gefundene Ergebnis auch das optimale ist. Das Training eines Perceptrons mit Beam Search erfordert deshalb besondere Update-Methoden, z.B. Early- oder Max-Violation-Updates, um mögliche Vorhersagefehler, die auf den Suchalgorithmus zurĂŒckgehen, auszugleichen.
Diese Dissertation ist an der Schnittstelle zwischen maschinellem Lernen und maschineller Sprachverarbeitung angesiedelt. Im Bereich Sprachverarbeitung beschĂ€ftigt sie sich mit drei Aufgaben: Koreferenzresolution, Dependenzparsing und Dependenzparsing mit gleichzeitiger Satzsegmentierung. Das vorgestellte Modell zur Koreferenzresolution ist eine Erweiterung eines existierenden Modells, welches Koreferenz mit Hilfe latenter Baumstrukturen reprĂ€sentiert. Dieses Modell wird um Features erweitert, mit denen nicht-lokale AbhĂ€ngigkeiten innerhalb eines gröĂeren strukturellen Kontexts modelliert werden. Die Modellierung nicht-lokaler AbhĂ€ngigkeiten macht durch die kombinatorische Explosion der Features die Verwendung eines approximativen Suchverfahrens notwendig. Es zeigt sich aber, dass das so entstandene Koreferenzmodell trotz der approximativen Suche dem Modell ohne nicht-lokale Features ĂŒberlegen ist, sofern hinreichend gute Update-Verfahren beim Lernen verwendet werden. FĂŒr das Dependenzparsing verwenden wir ein transitionsbasiertes Verfahren, bei dem DependenzbĂ€ume inkrementell durch Transitionen zwischen definierten ZustĂ€nden konstruiert werden. Im ersten Schritt erarbeiten wir eine umfassende Analyse des latenten Strukturraums eines bekannten Transitionssystems, nĂ€mlich ArcStandard mit Swap. Diese Analyse erlaubt es uns, die Rolle der latenten Strukturen in einem transitionsbasierten Dependenzparser zu evaluieren. Wir zeigen dann empirisch, dass die NĂŒtzlichkeit latenter Strukturen von der Wahl des Suchverfahrens abhĂ€ngt -- in Kombination mit Greedy-Search verbessern sich die Ergebnisse, in Kombination mit Beam-Search bleiben sie gleich oder verbessern sich leicht gegenĂŒber vergleichbaren Modellen. FĂŒr die dritte Aufgabe wird der Parser noch einmal erweitert: wir entwickeln das Transitionssystem so weiter, dass es neben syntaktischer Struktur auch Satzgrenzen vorhersagt und testen das System auf verrauschten und unredigierten Textdaten. Mit Hilfe sorgfĂ€ltig ausgewĂ€hlter Baselinemodelle und Testdaten messen wir den Einfluss syntaktischer Information auf die VorhersagequalitĂ€t von Satzgrenzen und zeigen, dass sich in Abwesenheit orthographischer Information wie Interpunktion und GroĂ- und Kleinschreibung das Ergebnis durch syntaktische Information verbessert.
Zu den wissenschaftlichen BeitrĂ€gen der Arbeit gehört einerseits das Framework selbst. Unsere problemspezifischen Experimente ermöglichen es uns darĂŒber hinaus, die Lernverfahren zu untersuchen und allgemeinere SchluĂfolgerungen zu ziehen. So finden wir z.B. in mehreren Experimenten, dass die etablierten Update-Methoden, also Early- oder Max-Violation-Update, nicht mehr gut funktionieren, sobald die vorhergesagte Struktur eine gewisse GröĂe ĂŒberschreitet. Es zeigt sich, dass das Hauptproblem dieser Methoden das Auslassen von Trainingsdaten ist, und dass sie desto mehr Daten auslassen, je gröĂer die vorhergesagte Struktur wird. Dieses Problem kann durch bessere Update-Methoden vermieden werden, bei denen stets alle Trainingsdaten verwendet werden. Wir stellen eine neue Methode vor, DLaSO, und zeigen, dass diese Methode konsequent bessere Ergebnisse liefert als alle Vergleichsmethoden. Ăberdies zeigen wir, dass eine erhöhte BeamgröĂe beim Suchen das Problem der ausgelassenen Trainingsdaten nicht kompensieren kann und daher keine Alternative zu besseren Update-Methoden darstellt
Predicting Linguistic Structure with Incomplete and Cross-Lingual Supervision
Contemporary approaches to natural language processing are predominantly based on statistical machine learning from large amounts of text, which has been manually annotated with the linguistic structure of interest. However, such complete supervision is currently only available for the world's major languages, in a limited number of domains and for a limited range of tasks. As an alternative, this dissertation considers methods for linguistic structure prediction that can make use of incomplete and cross-lingual supervision, with the prospect of making linguistic processing tools more widely available at a lower cost. An overarching theme of this work is the use of structured discriminative latent variable models for learning with indirect and ambiguous supervision; as instantiated, these models admit rich model features while retaining efficient learning and inference properties.
The first contribution to this end is a latent-variable model for fine-grained sentiment analysis with coarse-grained indirect supervision. The second is a model for cross-lingual word-cluster induction and the application thereof to cross-lingual model transfer. The third is a method for adapting multi-source discriminative cross-lingual transfer models to target languages, by means of typologically informed selective parameter sharing. The fourth is an ambiguity-aware self- and ensemble-training algorithm, which is applied to target language adaptation and relexicalization of delexicalized cross-lingual transfer parsers. The fifth is a set of sequence-labeling models that combine constraints at the level of tokens and types, and an instantiation of these models for part-of-speech tagging with incomplete cross-lingual and crowdsourced supervision. In addition to these contributions, comprehensive overviews are provided of structured prediction with no or incomplete supervision, as well as of learning in the multilingual and cross-lingual settings.
Through careful empirical evaluation, it is established that the proposed methods can be used to create substantially more accurate tools for linguistic processing, compared to both unsupervised methods and to recently proposed cross-lingual methods. The empirical support for this claim is particularly strong in the latter case; our models for syntactic dependency parsing and part-of-speech tagging achieve the hitherto best published results for a wide number of target languages, in the setting where no annotated training data is available in the target language
Recommended from our members
Where are you talking about? Advances and Challenges of Geographic Analysis of Text with Application to Disease Monitoring
The Natural Language Processing task we focus on in this thesis is Geoparsing. Geoparsing is the process of extraction and grounding of toponyms (place names). Consider this sentence: "The victims of the Spanish earthquake off the coast of Malaga were of American and Mexican origin." Four toponyms will be extracted (called Geotagging) and grounded to their geographic coordinates (called Toponym Resolution). However, our research goes further than any previous work by showing how to distinguish the literal place(s) of the event (Spain, Malaga) from other linguistic types/uses such as nationalities (Mexican, American), improving downstream task accuracy. We consolidate and extend the Standard Evaluation Framework, discuss key research problems, then present concrete solutions in order to advance each stage of geoparsing. For geotagging, as well as training a SOTA neural Location-NER tagger, we simplify Metonymy Resolution with a novel minimalist feature extraction combined with an LSTM-based classifier, matching SOTA results. For toponym resolution, we deploy the latest deep learning methods to achieve SOTA performance by augmenting neural models with hitherto unused geographic features called Map Vectors. With each research project, we provide high-quality datasets and system prototypes, further building resources in this field. We then show how these geoparsing advances coupled with our proposed Intra-Document Analysis can be used to associate news articles with locations in order to monitor the spread of public health threats. To this end, we evaluate our research contributions with production data from a real-time downstream application to improve geolocation of news events for disease monitoring. The data was made available to us by the Joint Research Centre (JRC), which operates one such system called MediSys that processes incoming news articles in order to monitor threats to public health and make these available to a variety of governmental, business and non-profit organisations. We also discuss steps towards an end-to-end, automated news monitoring system and make actionable recommendations for future work. In summary, the thesis aims are twofold: (1) Generate original geoparsing research aimed at advancing each stage of the pipeline by addressing pertinent challenges with concrete solutions and actionable proposals. (2) Demonstrate how this research can be applied to news event monitoring to increase the efficacy of existing biosurveillance systems, e.g. European Commissionâs MediSys.I was generously funded by DREAM CDT, which was funded by NERC of UKRI
Recommended from our members
Cross-Lingual Transfer of Natural Language Processing Systems
Accurate natural language processing systems rely heavily on annotated datasets. In the absence of such datasets, transfer methods can help to develop a model by transferring annotations from one or more rich-resource languages to the target language of interest. These methods are generally divided into two approaches: 1) annotation projection from translation data, aka parallel data, using supervised models in rich-resource languages, and 2) direct model transfer from annotated datasets in rich-resource languages.
In this thesis, we demonstrate different methods for transfer of dependency parsers and sentiment analysis systems. We propose an annotation projection method that performs well in the scenarios for which a large amount of in-domain parallel data is available. We also propose a method which is a combination of annotation projection and direct transfer that can leverage a minimal amount of information from a small out-of-domain parallel dataset to develop highly accurate transfer models. Furthermore, we propose an unsupervised syntactic reordering model to improve the accuracy of dependency parser transfer for non-European languages. Finally, we conduct a diverse set of experiments for the transfer of sentiment analysis systems in different data settings.
A summary of our contributions are as follows:
* We develop accurate dependency parsers using parallel text in an annotation projection framework. We make use of the fact that the density of word alignments is a valuable indicator of reliability in annotation projection.
* We develop accurate dependency parsers in the absence of a large amount of parallel data. We use the Bible data, which is in orders of magnitude smaller than a conventional parallel dataset, to provide minimal cues for creating cross-lingual word representations. Our model is also capable of boosting the performance of annotation projection with a large amount of parallel data. Our model develops cross-lingual word representations for going beyond the traditional delexicalized direct transfer methods. Moreover, we propose a simple but effective word translation approach that brings in explicit lexical features from the target language in our direct transfer method.
* We develop different syntactic reordering models that can change the source treebanks in rich-resource languages, thus preventing learning a wrong model for a non-related language. Our experimental results show substantial improvements over non-European languages.
* We develop transfer methods for sentiment analysis in different data availability scenarios. We show that we can leverage cross-lingual word embeddings to create accurate sentiment analysis systems in the absence of annotated data in the target language of interest.
We believe that the novelties that we introduce in this thesis indicate the usefulness of transfer methods. This is appealing in practice, especially since we suggest eliminating the requirement for annotating new datasets for low-resource languages which is expensive, if not impossible, to obtain
Recommended from our members
Deep Learning for Human and Biological Languages
We explore the application of deep learning to the disparate fields of natural language processing and computational biology. Both the sentences uttered by humans as well as the RNA and protein sequences found within the cells of their bodies can be considered formal languages in computer science, as sets of strings composed from an alphabet generated by grammar rules. To briefly characterize these languages, words in natural language sentences have a large number of types but a limited sequence of tokens, while nucleotides in biological contexts have limited types in long sequences of tokens. A sentence has a possible vocabulary size greater than 100,000 but in practice usually have less than 20-30 words; RNA sequences have 4 possible tokens but feature sequences anywhere from less than 100 to greater than 10,000 nucleotides. Protein sequences similarly have 20 possible amino acid tokens. The practical differences between these contexts inform our modeling choices to make deep learning tractable and effective, and they further influence what additional algorithms are needed to attain strong results.
These widely different domains presumably have their own forms of syntactic structure, and their respective grammars dictate the relationships on how words, nucleotides, and amino acids interact within themselves to form structures. With language this comes in the form of syntactic parse tree diagrams, with RNA this becomes secondary structure base pairings, and with proteins this becomes tertiary structure contact map pairings. We present a deep learning approach for predicting syntactic structures for human languages (parsing), and dynamic programming techniques that allow for fast linear-time decoding while maintaining close to state-of-the-art accuracy. Converting the traditional exhaustive cubic time CKY parsing algorithm into having a left-to-right, bottom-up reordering allowed us to additionally apply inexact beam search and then cube-pruning to attain linear runtime complexity. Despite being an inexact search, our model attained results (91.97 F1) better than the previous state-of-the-art model (91.79 F1) which used an exhaustive decoding upon the same underlying neural network architecture.
Analogous to linguistic grammar rules, nucleotides in RNA sequences are also subject to base pairing potentials, as Adenine (A) prefers to bind with Uracil (U) and Cytosine (C) prefers to bind with Guanine (G). The secondary structure base pairing behavior of RNA often involves interactions across the entire sequence. We present a deep learning approach for predicting secondary structure for RNA sequences (folding), and using self-attention-based Transformer models to visualize and correct errors made by other structure prediction algorithms called RNA-Fix. We find that a simple architecture consisting of LSTM and Transformer layers succeed at attaining a strong baseline, which then further improves when predictions made by another program are made available as input. Visualizing the attention weights of our model, we find that strong attention in the last layer is paid towards bracketed structural sections in the output.
We further show a connection to our human language parsing work, by presenting the Nussinov dynamic programming decoding algorithm adapted for deep learning, that guarantees a balanced and valid base pairing output. With cubic runtime complexity analogous to CKY, we show on a dataset of RNA sequences limited to length 50 accuracies surpassing our RNA-Fix models. We also discuss how to linearize the runtime which would allow us to scale to longer sequence datasets.
Even more complex than RNA, protein sequences feature even more possible interactions between the 20 different types of amino acids. A typical way to model how a protein sequence will eventually fold into a 3D molecule is to first search for many similar or homologous sequences in a database, and then use the aligned multiple sequence alignment (MSA) as the input, before predicting the distances between each amino acid to every other position, called a contact map. We present a deep learning approach for predicting tertiary structure for protein sequences (contact map prediction), and an algorithm that overall improves the input and output simultaneously by iteratively realigning the former based on the alignment of the latter. Focusing on the cases where little to no homologous sequences can be found for a given input protein sequence (MSA size 10), we find that the iterative process of realigning the input sequences and output structures results in improvement especially in short, but also in , medium, and long range contacts
Recommended from our members
Semantic chunking
Long sentences pose a challenge for natural language processing (NLP) applications. They are associated with a complex information structure leading to increased requirements for processing resources. Although the issue is present in many areas of research, there is little uniformity in the solutions used by research communities dedicated to individual NLP applications. Different aspects of the problem are addressed by different tasks, such as sentence simplification or shallow chunking.
The main contribution of this thesis is the introduction of the task of semantic chunking as a general approach to reducing the cost of processing long sentences. The goal of semantic chunking is to find semantically contained fragments of a sentence representation that can be processed independently and recombined without loss of information. We anchor its principles in established concepts of semantic theory, in particular event and situation semantics. Most of the experiments in this thesis focus on semantic chunking defined on complex semantic representations in Dependency Minimal Recursion Semantics (DMRS),
but we also demonstrate that the task can be performed on sentence strings. We present three chunking models: a) rule-based proof-of-concept DMRS chunking system; b) a semi-supervised sequence labelling neural model for surface semantic chunking; c) a system capable of finding semantic chunk boundaries based on the inherent structure of DMRS graphs, generalisable in the form of descriptive templates. We show how semantic chunking can be applied within a divide-and-conquer processing paradigm, using as an example the task of realization from DMRS. The application of semantic chunking yields noticeable efficiency gains without decreasing the quality of results
Syntax-based machine translation using dependency grammars and discriminative machine learning
Machine translation underwent huge improvements since the groundbreaking
introduction of statistical methods in the early 2000s, going from very
domain-specific systems that still performed relatively poorly despite the
painstakingly crafting of thousands of ad-hoc rules, to general-purpose
systems automatically trained on large collections of bilingual texts which
manage to deliver understandable translations that convey the general
meaning of the original input.
These approaches however still perform quite below the level of human
translators, typically failing to convey detailed meaning and register, and
producing translations that, while readable, are often ungrammatical and
unidiomatic.
This quality gap, which is considerably large compared to most other
natural language processing tasks, has been the focus of the research in
recent years, with the development of increasingly sophisticated models that
attempt to exploit the syntactical structure of human languages, leveraging
the technology of statistical parsers, as well as advanced machine learning
methods such as marging-based structured prediction algorithms and neural
networks.
The translation software itself became more complex in order to accommodate
for the sophistication of these advanced models: the main translation
engine (the decoder) is now often combined with a pre-processor which
reorders the words of the source sentences to a target language word order, or
with a post-processor that ranks and selects a translation according according
to fine model from a list of candidate translations generated by a coarse
model.
In this thesis we investigate the statistical machine translation problem
from various angles, focusing on translation from non-analytic languages
whose syntax is best described by fluid non-projective dependency grammars
rather than the relatively strict phrase-structure grammars or projectivedependency
grammars which are most commonly used in the literature.
We propose a framework for modeling word reordering phenomena
between language pairs as transitions on non-projective source dependency
parse graphs. We quantitatively characterize reordering phenomena for the
German-to-English language pair as captured by this framework, specifically
investigating the incidence and effects of the non-projectivity of source
syntax and the non-locality of word movement w.r.t. the graph structure.
We evaluated several variants of hand-coded pre-ordering rules in order to
assess the impact of these phenomena on translation quality.
We propose a class of dependency-based source pre-ordering approaches
that reorder sentences based on a flexible models trained by SVMs and and
several recurrent neural network architectures.
We also propose a class of translation reranking models, both syntax-free
and source dependency-based, which make use of a type of neural networks
known as graph echo state networks which is highly flexible and requires
extremely little training resources, overcoming one of the main limitations
of neural network models for natural language processing tasks