156 research outputs found

    The Best Explanation:Beyond Right and Wrong in Question Answering

    Get PDF

    Unsupervised Methods for Learning and Using Semantics of Natural Language

    Get PDF
    Teaching the computer to understand language is the major goal in the field of natural language processing. In this thesis we introduce computational methods that aim to extract language structure — e.g. grammar, semantics or syntax — from text, which provides the computer with information in order to understand language. During the last decades, scientific efforts and the increase of computational resources made it possible to come closer to the goal of understanding language. In order to extract language structure, many approaches train the computer on manually created resources. Most of these so-called supervised methods show high performance when applied to similar textual data. However, they perform inferior when operating on textual data, which are different to the one they are trained on. Whereas training the computer is essential to obtain reasonable structure from natural language, we want to avoid training the computer using manually created resources. In this thesis, we present so-called unsupervised methods, which are suited to learn patterns in order to extract structure from textual data directly. These patterns are learned with methods that extract the semantics (meanings) of words and phrases. In comparison to manually built knowledge bases, unsupervised methods are more flexible: they can extract structure from text of different languages or text domains (e.g. finance or medical texts), without requiring manually annotated structure. However, learning structure from text often faces sparsity issues. The reason for these phenomena is that in language many words occur only few times. If a word is seen only few times no precise information can be extracted from the text it occurs. Whereas sparsity issues cannot be solved completely, information about most words can be gained by using large amounts of data. In the first chapter, we briefly describe how computers can learn to understand language. Afterwards, we present the main contributions, list the publications this thesis is based on and give an overview of this thesis. Chapter 2 introduces the terminology used in this thesis and gives a background about natural language processing. Then, we characterize the linguistic theory on how humans understand language. Afterwards, we show how the underlying linguistic intuition can be operationalized for computers. Based on this operationalization, we introduce a formalism for representing words and their context. This formalism is used in the following chapters in order to compute similarities between words. In Chapter 3 we give a brief description of methods in the field of computational semantics, which are targeted to compute similarities between words. All these methods have in common that they extract a contextual representation for a word that is generated from text. Then, this representation is used to compute similarities between words. In addition, we also present examples of the word similarities that are computed with these methods. Segmenting text into its topically related units is intuitively performed by humans and helps to extract connections between words in text. We equip the computer with these abilities by introducing a text segmentation algorithm in Chapter 4. This algorithm is based on a statistical topic model, which learns to cluster words into topics solely on the basis of the text. Using the segmentation algorithm, we demonstrate the influence of the parameters provided by the topic model. In addition, our method yields state-of-the-art performances on two datasets. In order to represent the meaning of words, we use context information (e.g. neighboring words), which is utilized to compute similarities. Whereas we described methods for word similarity computations in Chapter 3, we introduce a generic symbolic framework in Chapter 5. As we follow a symbolic approach, we do not represent words using dense numeric vectors but we use symbols (e.g. neighboring words or syntactic dependency parses) directly. Such a representation is readable for humans and is preferred in sensitive applications like the medical domain, where the reason for decisions needs to be provided. This framework enables the processing of arbitrarily large data. Furthermore, it is able to compute the most similar words for all words within a text collection resulting in a distributional thesaurus. We show the influence of various parameters deployed in our framework and examine the impact of different corpora used for computing similarities. Performing computations based on various contextual representations, we obtain the best results when using syntactic dependencies between words within sentences. However, these syntactic dependencies are predicted using a supervised dependency parser, which is trained on language-dependent and human-annotated resources. To avoid such language-specific preprocessing for computing distributional thesauri, we investigate the replacement of language-dependent dependency parsers by language-independent unsupervised parsers in Chapter 6. Evaluating the syntactic dependencies from unsupervised and supervised parses against human-annotated resources reveals that the unsupervised methods are not capable to compete with the supervised ones. In this chapter we use the predicted structure of both types of parses as context representation in order to compute word similarities. Then, we evaluate the quality of the similarities, which provides an extrinsic evaluation setup for both unsupervised and supervised dependency parsers. In an evaluation on English text, similarities computed based on contextual representations generated with unsupervised parsers do not outperform the similarities computed with the context representation extracted from supervised parsers. However, we observe the best results when applying context retrieved by the unsupervised parser for computing distributional thesauri on German language. Furthermore, we demonstrate that our framework is capable to combine different context representations, as we obtain the best performance with a combination of both flavors of syntactic dependencies for both languages. Most languages are not composed of single-worded terms only, but also contain many multi-worded terms that form a unit, called multiword expressions. The identification of multiword expressions is particularly important for semantics, as e.g. the term New York has a different meaning than its single terms New or York. Whereas most research on semantics avoids handling these expressions, we target on the extraction of multiword expressions in Chapter 7. Most previously introduced methods rely on part-of-speech tags and apply a ranking function to rank term sequences according to their multiwordness. Here, we introduce a language-independent and knowledge-free ranking method that uses information from distributional thesauri. Performing evaluations on English and French textual data, our method achieves the best results in comparison to methods from the literature. In Chapter 8 we apply information from distributional thesauri as features for various applications. First, we introduce a general setting for tackling the out-of-vocabulary problem. This problem describes the inferior performance of supervised methods according to words that are not contained in the training data. We alleviate this issue by replacing these unseen words with the most similar ones that are known, extracted from a distributional thesaurus. Using a supervised part-of-speech tagging method, we show substantial improvements in the classification performance for out-of-vocabulary words based on German and English textual data. The second application introduces a system for replacing words within a sentence with a word of the same meaning. For this application, the information from a distributional thesaurus provides the highest-scoring features. In the last application, we introduce an algorithm that is capable to detect the different meanings of a word and groups them into coarse-grained categories, called supersenses. Generating features by means of supersenses and distributional thesauri yields an performance increase when plugged into a supervised system that recognized named entities (e.g. names, organizations or locations). Further directions for using distributional thesauri are presented in Chapter 9. First, we lay out a method, which is capable of incorporating background information (e.g. source of the text collection or sense information) into a distributional thesaurus. Furthermore, we describe an approach on building thesauri for different text domains (e.g. medical or finance domain) and how they can be combined to have a high coverage of domain-specific knowledge as well as a broad background for the open domain. In the last section we characterize yet another method, suited to enrich existing knowledge bases. All three directions might be further extensions, which induce further structure based on textual data. The last chapter gives a summary of this work: we demonstrate that without language-dependent knowledge, a computer can learn to extract useful structure from text by using computational semantics. Due to the unsupervised nature of the introduced methods, we are able to extract new structure from raw textual data. This is important especially for languages, for which less manually created resources are available as well as for special domains e.g. medical or finance. We have demonstrated that our methods achieve state-of-the-art performance. Furthermore, we have proven their impact by applying the extracted structure in three natural language processing tasks. We have also applied the methods to different languages and large amounts of data. Thus, we have not proposed methods, which are suited for extracting structure for a single language, but methods that are capable to explore structure for “language” in general

    General methods for fine-grained morphological and syntactic disambiguation

    Get PDF
    We present methods for improved handling of morphologically rich languages (MRLS) where we define MRLS as languages that are morphologically more complex than English. Standard algorithms for language modeling, tagging and parsing have problems with the productive nature of such languages. Consider for example the possible forms of a typical English verb like work that generally has four four different forms: work, works, working and worked. Its Spanish counterpart trabajar has 6 different forms in present tense: trabajo, trabajas, trabaja, trabajamos, trabajáis and trabajan and more than 50 different forms when including the different tenses, moods (indicative, subjunctive and imperative) and participles. Such a high number of forms leads to sparsity issues: In a recent Wikipedia dump of more than 400 million tokens we find that 20 of these forms occur only twice or less and that 10 forms do not occur at all. This means that even if we only need unlabeled data to estimate a model and even when looking at a relatively common and frequent verb, we do not have enough data to make reasonable estimates for some of its forms. However, if we decompose an unseen form such as trabajaréis `you will work', we find that it is trabajar in future tense and second person plural. This allows us to make the predictions that are needed to decide on the grammaticality (language modeling) or syntax (tagging and parsing) of a sentence. In the first part of this thesis, we develop a morphological language model. A language model estimates the grammaticality and coherence of a sentence. Most language models used today are word-based n-gram models, which means that they estimate the transitional probability of a word following a history, the sequence of the (n - 1) preceding words. The probabilities are estimated from the frequencies of the history and the history followed by the target word in a huge text corpus. If either of the sequences is unseen, the length of the history has to be reduced. This leads to a less accurate estimate as less context is taken into account. Our morphological language model estimates an additional probability from the morphological classes of the words. These classes are built automatically by extracting morphological features from the word forms. To this end, we use unsupervised segmentation algorithms to find the suffixes of word forms. Such an algorithm might for example segment trabajaréis into trabaja and réis and we can then estimate the properties of trabajaréis from other word forms with the same or similar morphological properties. The data-driven nature of the segmentation algorithms allows them to not only find inflectional suffixes (such as -réis), but also more derivational phenomena such as the head nouns of compounds or even endings such as -tec, which identify technology oriented companies such as Vortec, Memotec and Portec and would not be regarded as a morphological suffix by traditional linguistics. Additionally, we extract shape features such as if a form contains digits or capital characters. This is important because many rare or unseen forms are proper names or numbers and often do not have meaningful suffixes. Our class-based morphological model is then interpolated with a word-based model to combine the generalization capabilities of the first and the high accuracy in case of sufficient data of the second. We evaluate our model across 21 European languages and find improvements between 3% and 11% in perplexity, a standard language modeling evaluation measure. Improvements are highest for languages with more productive and complex morphology such as Finnish and Estonian, but also visible for languages with a relatively simple morphology such as English and Dutch. We conclude that a morphological component yields consistent improvements for all the tested languages and argue that it should be part of every language model. Dependency trees represent the syntactic structure of a sentence by attaching each word to its syntactic head, the word it is directly modifying. Dependency parsing is usually tackled using heavily lexicalized (word-based) models and a thorough morphological preprocessing is important for optimal performance, especially for MRLS. We investigate if the lack of morphological features can be compensated by features induced using hidden Markov models with latent annotations (HMM-LAs) and find this to be the case for German. HMM-LAs were proposed as a method to increase part-of-speech tagging accuracy. The model splits the observed part-of-speech tags (such as verb and noun) into subtags. An expectation maximization algorithm is then used to fit the subtags to different roles. A verb tag for example might be split into an auxiliary verb and a full verb subtag. Such a split is usually beneficial because these two verb classes have different contexts. That is, a full verb might follow an auxiliary verb, but usually not another full verb. For German and English, we find that our model leads to consistent improvements over a parser not using subtag features. Looking at the labeled attachment score (LAS), the number of words correctly attached to their head, we observe an improvement from 90.34 to 90.75 for English and from 87.92 to 88.24 for German. For German, we additionally find that our model achieves almost the same performance (88.24) as a model using tags annotated by a supervised morphological tagger (LAS of 88.35). We also find that the German latent tags correlate with morphology. Articles for example are split by their grammatical case. We also investigate the part-of-speech tagging accuracies of models using the traditional treebank tagset and models using induced tagsets of the same size and find that the latter outperform the former, but are in turn outperformed by a discriminative tagger. Furthermore, we present a method for fast and accurate morphological tagging. While part-of-speech tagging annotates tokens in context with their respective word categories, morphological tagging produces a complete annotation containing all the relevant inflectional features such as case, gender and tense. A complete reading is represented as a single tag. As a reading might consist of several morphological features the resulting tagset usually contains hundreds or even thousands of tags. This is an issue for many decoding algorithms such as Viterbi which have runtimes depending quadratically on the number of tags. In the case of morphological tagging, the problem can be avoided by using a morphological analyzer. A morphological analyzer is a manually created finite-state transducer that produces the possible morphological readings of a word form. This analyzer can be used to prune the tagging lattice and to allow for the application of standard sequence labeling algorithms. The downside of this approach is that such an analyzer is not available for every language or might not have the coverage required for the task. Additionally, the output tags of some analyzers are not compatible with the annotations of the treebanks, which might require some manual mapping of the different annotations or even to reduce the complexity of the annotation. To avoid this problem we propose to use the posterior probabilities of a conditional random field (CRF) lattice to prune the space of possible taggings. At the zero-order level the posterior probabilities of a token can be calculated independently from the other tokens of a sentence. The necessary computations can thus be performed in linear time. The features available to the model at this time are similar to the features used by a morphological analyzer (essentially the word form and features based on it), but also include the immediate lexical context. As the ambiguity of word types varies substantially, we just fix the average number of readings after pruning by dynamically estimating a probability threshold. Once we obtain the pruned lattice, we can add tag transitions and convert it into a first-order lattice. The quadratic forward-backward computations are now executed on the remaining plausible readings and thus efficient. We can now continue pruning and extending the lattice order at a relatively low additional runtime cost (depending on the pruning thresholds). The training of the model can be implemented efficiently by applying stochastic gradient descent (SGD). The CRF gradient can be calculated from a lattice of any order as long as the correct reading is still in the lattice. During training, we thus run the lattice pruning until we either reach the maximal order or until the correct reading is pruned. If the reading is pruned we perform the gradient update with the highest order lattice still containing the reading. This approach is similar to early updating in the structured perceptron literature and forces the model to learn how to keep the correct readings in the lower order lattices. In practice, we observe a high number of lower updates during the first training epoch and almost exclusively higher order updates during later epochs. We evaluate our CRF tagger on six languages with different morphological properties. We find that for languages with a high word form ambiguity such as German, the pruning results in a moderate drop in tagging accuracy while for languages with less ambiguity such as Spanish and Hungarian the loss due to pruning is negligible. However, our pruning strategy allows us to train higher order models (order > 1), which give substantial improvements for all languages and also outperform unpruned first-order models. That is, the model might lose some of the correct readings during pruning, but is also able to solve more of the harder cases that require more context. We also find our model to substantially and significantly outperform a number of frequently used taggers such as Morfette and SVMTool. Based on our morphological tagger we develop a simple method to increase the performance of a state-of-the-art constituency parser. A constituency tree describes the syntactic properties of a sentence by assigning spans of text to a hierarchical bracket structure. developed a language-independent approach for the automatic annotation of accurate and compact grammars. Their implementation -- known as the Berkeley parser -- gives state-of-the-art results for many languages such as English and German. For some MRLS such as Basque and Korean, however, the parser gives unsatisfactory results because of its simple unknown word model. This model maps unknown words to a small number of signatures (similar to our morphological classes). These signatures do not seem expressive enough for many of the subtle distinctions made during parsing. We propose to replace rare words by the morphological reading generated by our tagger instead. The motivation is twofold. First, our tagger has access to a number of lexical and sublexical features not available during parsing. Second, we expect the morphological readings to contain most of the information required to make the correct parsing decision even though we know that things such as the correct attachment of prepositional phrases might require some notion of lexical semantics. In experiments on the SPMRL 2013 dataset of nine MRLS we find our method to give improvements for all languages except French for which we observe a minor drop in the Parseval score of 0.06. For Hebrew, Hungarian and Basque we find substantial absolute improvements of 5.65, 11.87 and 15.16, respectively. We also performed an extensive evaluation on the utility of word representations for morphological tagging. Our goal was to reduce the drop in performance that is caused when a model trained on a specific domain is applied to some other domain. This problem is usually addressed by domain adaption (DA). DA adapts a model towards a specific domain using a small amount of labeled or a huge amount of unlabeled data from that domain. However, this procedure requires us to train a model for every target domain. Instead we are trying to build a robust system that is trained on domain-specific labeled and domain-independent or general unlabeled data. We believe word representations to be key in the development of such models because they allow us to leverage unlabeled data efficiently. We compare data-driven representations to manually created morphological analyzers. We understand data-driven representations as models that cluster word forms or map them to a vectorial representation. Examples heavily used in the literature include Brown clusters, Singular Value Decompositions of count vectors and neural-network-based embeddings. We create a test suite of six languages consisting of in-domain and out-of-domain test sets. To this end we converted annotations for Spanish and Czech and annotated the German part of the Smultron treebank with a morphological layer. In our experiments on these data sets we find Brown clusters to outperform the other data-driven representations. Regarding the comparison with morphological analyzers, we find Brown clusters to give slightly better performance in part-of-speech tagging, but to be substantially outperformed in morphological tagging

    Exploiting word embeddings for modeling bilexical relations

    Get PDF
    There has been an exponential surge of text data in the recent years. As a consequence, unsupervised methods that make use of this data have been steadily growing in the field of natural language processing (NLP). Word embeddings are low-dimensional vectors obtained using unsupervised techniques on the large unlabelled corpora, where words from the vocabulary are mapped to vectors of real numbers. Word embeddings aim to capture syntactic and semantic properties of words. In NLP, many tasks involve computing the compatibility between lexical items under some linguistic relation. We call this type of relation a bilexical relation. Our thesis defines statistical models for bilexical relations that centrally make use of word embeddings. Our principle aim is that the word embeddings will favor generalization to words not seen during the training of the model. The thesis is structured in four parts. In the first part of this thesis, we present a bilinear model over word embeddings that leverages a small supervised dataset for a binary linguistic relation. Our learning algorithm exploits low-rank bilinear forms and induces a low-dimensional embedding tailored for a target linguistic relation. This results in compressed task-specific embeddings. In the second part of our thesis, we extend our bilinear model to a ternary setting and propose a framework for resolving prepositional phrase attachment ambiguity using word embeddings. Our models perform competitively with state-of-the-art models. In addition, our method obtains significant improvements on out-of-domain tests by simply using word-embeddings induced from source and target domains. In the third part of this thesis, we further extend the bilinear models for expanding vocabulary in the context of statistical phrase-based machine translation. Our model obtains a probabilistic list of possible translations of target language words, given a word in the source language. We do this by projecting pre-trained embeddings into a common subspace using a log-bilinear model. We empirically notice a significant improvement on an out-of-domain test set. In the final part of our thesis, we propose a non-linear model that maps initial word embeddings to task-tuned word embeddings, in the context of a neural network dependency parser. We demonstrate its use for improved dependency parsing, especially for sentences with unseen words. We also show downstream improvements on a sentiment analysis task.En els darrers anys hi ha hagut un sorgiment notable de dades en format textual. Conseqüentment, en el camp del Processament del Llenguatge Natural (NLP, de l'anglès "Natural Language Processing") s'han desenvolupat mètodes no supervistats que fan ús d'aquestes dades. Els anomenats "word embeddings", o embeddings de paraules, són vectors de dimensionalitat baixa que s'obtenen mitjançant tècniques no supervisades aplicades a corpus textuals de grans volums. Com a resultat, cada paraula del diccionari es correspon amb un vector de nombres reals, el propòsit del qual és capturar propietats sintàctiques i semàntiques de la paraula corresponent. Moltes tasques de NLP involucren calcular la compatibilitat entre elements lèxics en l'àmbit d'una relació lingüística. D'aquest tipus de relació en diem relació bilèxica. Aquesta tesi proposa models estadístics per a relacions bilèxiques que fan ús central d'embeddings de paraules, amb l'objectiu de millorar la generalització del model lingüístic a paraules no vistes durant l'entrenament. La tesi s'estructura en quatre parts. A la primera part presentem un model bilineal sobre embeddings de paraules que explota un conjunt petit de dades anotades sobre una relaxió bilèxica. L'algorisme d'aprenentatge treballa amb formes bilineals de poc rang, i indueix embeddings de poca dimensionalitat que estan especialitzats per la relació bilèxica per la qual s'han entrenat. Com a resultat, obtenim embeddings de paraules que corresponen a compressions d'embeddings per a una relació determinada. A la segona part de la tesi proposem una extensió del model bilineal a trilineal, i amb això proposem un nou model per a resoldre ambigüitats de sintagmes preposicionals que usa només embeddings de paraules. En una sèrie d'avaluacións, els nostres models funcionen de manera similar a l'estat de l'art. A més, el nostre mètode obté millores significatives en avaluacions en textos de dominis diferents al d'entrenament, simplement usant embeddings induïts amb textos dels dominis d'entrenament i d'avaluació. A la tercera part d'aquesta tesi proposem una altra extensió dels models bilineals per ampliar la cobertura lèxica en el context de models estadístics de traducció automàtica. El nostre model probabilístic obté, donada una paraula en la llengua d'origen, una llista de possibles traduccions en la llengua de destí. Fem això mitjançant una projecció d'embeddings pre-entrenats a un sub-espai comú, usant un model log-bilineal. Empíricament, observem una millora significativa en avaluacions en dominis diferents al d'entrenament. Finalment, a la quarta part de la tesi proposem un model no lineal que indueix una correspondència entre embeddings inicials i embeddings especialitzats, en el context de tasques d'anàlisi sintàctica de dependències amb models neuronals. Mostrem que aquest mètode millora l'analisi de dependències, especialment en oracions amb paraules no vistes durant l'entrenament. També mostrem millores en un tasca d'anàlisi de sentiment

    Selecting and Generating Computational Meaning Representations for Short Texts

    Full text link
    Language conveys meaning, so natural language processing (NLP) requires representations of meaning. This work addresses two broad questions: (1) What meaning representation should we use? and (2) How can we transform text to our chosen meaning representation? In the first part, we explore different meaning representations (MRs) of short texts, ranging from surface forms to deep-learning-based models. We show the advantages and disadvantages of a variety of MRs for summarization, paraphrase detection, and clustering. In the second part, we use SQL as a running example for an in-depth look at how we can parse text into our chosen MR. We examine the text-to-SQL problem from three perspectives—methodology, systems, and applications—and show how each contributes to a fuller understanding of the task.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/143967/1/cfdollak_1.pd

    A common semantic space for monolingual and cross-lingual meta-embeddings

    Get PDF
    This master’s thesis presents a new technique for creating monolingual and cross-lingual meta-embeddings. Our method integrates multiple word embeddings created from complementary techniques, textual sources, knowledge bases and languages. Existing word vectors are projected to a common semantic space using linear transformations and averaging. With our method the resulting meta-embeddings maintain the dimensionality of the original embeddings without losing information while dealing with the out-of-vocabulary (OOV) problem. Furthermore, empirical evaluation demonstrates the effectiveness of our technique with respect to previous work on various intrinsic and extrinsic multilingual evaluations

    A Comprehensive Survey on Word Representation Models: From Classical to State-Of-The-Art Word Representation Language Models

    Full text link
    Word representation has always been an important research area in the history of natural language processing (NLP). Understanding such complex text data is imperative, given that it is rich in information and can be used widely across various applications. In this survey, we explore different word representation models and its power of expression, from the classical to modern-day state-of-the-art word representation language models (LMS). We describe a variety of text representation methods, and model designs have blossomed in the context of NLP, including SOTA LMs. These models can transform large volumes of text into effective vector representations capturing the same semantic information. Further, such representations can be utilized by various machine learning (ML) algorithms for a variety of NLP related tasks. In the end, this survey briefly discusses the commonly used ML and DL based classifiers, evaluation metrics and the applications of these word embeddings in different NLP tasks
    corecore