37 research outputs found

    Augmented Parsing of Unknown Word by Graph-based Semi-supervised Learning

    Get PDF

    Knowledge-enhanced neural grammar Induction

    Get PDF
    Natural language is usually presented as a word sequence, but the inherent structure of language is not necessarily sequential. Automatic grammar induction for natural language is a long-standing research topic in the field of computational linguistics and still remains an open problem today. From the perspective of cognitive science, the goal of a grammar induction system is to mimic children: learning a grammar that can generalize to infinitely many utterances by only consuming finite data. With regard to computational linguistics, an automatic grammar induction system could be beneficial for a wide variety of natural language processing (NLP) applications: providing syntactic analysis explicitly for a pipeline or a joint learning system; injecting structural bias implicitly into an end-to-end model. Typically, approaches to grammar induction only have access to raw text. Due to the huge search space of trees as well as data sparsity and ambiguity issues, grammar induction is a difficult problem. Thanks to the rapid development of neural networks and their capacity of over-parameterization and continuous representation learning, neural models have been recently introduced to grammar induction. Given its large capacity, introducing external knowledge into a neural system is an effective approach in practice, especially for an unsupervised problem. This thesis explores how to incorporate external knowledge into neural grammar induction models. We develop several approaches to combine different types of knowledge with neural grammar induction models on two grammar formalisms — constituency and dependency grammar. We first investigate how to inject symbolic knowledge, universal linguistic rules, into unsupervised dependency parsing. In contrast to previous state-of-the-art models that utilize time-consuming global inference, we propose a neural transition-based parser using variational inference. Our parser is able to employ rich features and supports inference in linear time for both training and testing. The core component in our parser is posterior regularization, where the posterior distribution of the dependency trees is constrained by the universal linguistic rules. The resulting parser outperforms previous unsupervised transition-based dependency parsers and achieves performance comparable to global inference-based models. Our parser also substantially increases parsing speed over global inference-based models. Recently, tree structures have been considered as latent variables that are learned through downstream NLP tasks, such as language modeling and natural language inference. More specifically, auxiliary syntax-aware components are embedded into the neural networks and are trained end-to-end on the downstream tasks. However, such latent tree models either struggle to produce linguistically plausible tree structures, or require an external biased parser to obtain good parsing performance. In the second part of this thesis, we focus on constituency structure and propose to use imitation learning to couple two heterogeneous latent tree models: we transfer the knowledge learned from a continuous latent tree model trained using language modeling to a discrete one, and further fine-tune the discrete model using a natural language inference objective. Through this two-stage training scheme, the discrete latent tree model achieves stateof-the-art unsupervised parsing performance. The transformer is a newly proposed neural model for NLP. Transformer-based pre-trained language models (PLMs) like BERT have achieved remarkable success on various NLP tasks by training on an enormous corpus using word prediction tasks. Recent studies show that PLMs can learn considerable syntactical knowledge in a syntaxagnostic manner. In the third part of this thesis, we leverage PLMs as a source of external knowledge. We propose a parameter-free approach to select syntax-sensitive self-attention heads from PLMs and perform chart-based unsupervised constituency parsing. In contrast to previous approaches, our head-selection approach only relies on raw text without any annotated development data. Experimental results on both English and eight other languages show that our approach achieves competitive performance

    Unsupervised structure induction and multimodal grounding

    Get PDF
    Structured representations build upon symbolic abstraction (e.g., words in natural language and visual concepts in natural images), offer a principled way of encoding our perceptions about the physical world, and enable the human-like generalization of machine learning systems. The predominant paradigm for learning structured representations of the observed data has been supervised learning, but it is limited in several respects. First, supervised learning is challenging given the scarcity of labeled data. Second, conventional approaches to structured prediction have been relying on a single modality (e.g., either images or text), ignoring the learning cues that may have been specified in and can be readily obtained from other modalities of data. In this thesis, we investigate unsupervised approaches to structure induction in a multimodal setting. Unsupervised learning is inherently difficult in general, let alone inducing complex and discrete structures from data without direct supervision. By considering the multimodal setting, we leverage the alignments between different data modalities (e.g., text, audio, and images) to facilitate the learning of structure-induction models, e.g., knowing that the individual words in ``a white pigeon'' always appear with the same visual object, a language parser is likely to treat them as a whole (i.e., phrase). The multimodal learning setting is practically viable because multimodal alignments are generally abundant. For example, they can be found in online posts such as news and tweets that usually contain images and associated text, and in (YouTube) videos, where audio, scripts, and scenes are synchronized and grounded in each other. We develop structure-induction models, which are capable of exploiting bimodal image-text alignments, for two modalities: (1) for natural language, we consider unsupervised syntactic parsing with phrase-structure grammars and regularize the parser by using visual image groundings; and (2) for visual images, we induce scene graph representations by mapping arguments and predicates in the text to their visual counterparts (i.e., visual objects and relations among them) in an unsupervised manner. While useful, crossmodal alignments are not always abundantly available on the web, e.g., the alignments between non-speech audio and text. We tackle the challenge by sharing the visual modality between image-text alignment and image-audio alignment; images function as a pivot and connect audio and text. The contributions of this thesis span from model development to data collection. We demonstrated the feasibility of applying multimodal learning techniques to unsupervised structure induction and multimodal alignment collection. Our work opens up new avenues for multimodal and unsupervised structured representation learning

    Parsing with sparse annotated resources

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2013.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (p. 67-73).This thesis focuses on algorithms for parsing within the context of sparse annotated resources. Despite recent progress in parsing techniques, existing methods require significant resources for training. Therefore, current technology is limited when it comes to parsing sentences in new languages or new grammars. We propose methods for parsing when annotated resources are limited. In the first scenario, we explore an automatic method for mapping language-specific part of- speech (POS) tags into a universal tagset. Universal tagsets play a crucial role in cross-lingual syntactic transfer of multilingual dependency parsers. Our central assumption is that a high-quality mapping yields POS annotations with coherent linguistic properties which are consistent across source and target languages. We encode this intuition in an objective function. Given the exponential size of the mapping space, we propose a novel method for optimizing the objective over mappings. Our results demonstrate that automatically induced mappings rival their manually designed counterparts when evaluated in the context of multilingual parsing. In the second scenario, we consider the problem of cross-formalism transfer in parsing. We are interested in parsing constituency-based grammars such as HPSG and CCG using a small amount of data annotated in the target formalisms and a large quantity of coarse CFG annotations from the Penn Treebank. While the trees annotated in all of the target formalisms share a similar basic syntactic structure with the Penn Treebank CFG, they also encode additional constraints and semantic features. To handle this apparent difference, we design a probabilistic model that jointly generates CFG and target formalism parses. The model includes features of both parses, enabling transfer between the formalisms, and preserves parsing efficiency. Experimental results show that across a range of formalisms, our model benefits from the coarse annotations.by Yuan Zhang.S.M

    Towards Bidirectional Hierarchical Representations for Attention-Based Neural Machine Translation

    Full text link
    This paper proposes a hierarchical attentional neural translation model which focuses on enhancing source-side hierarchical representations by covering both local and global semantic information using a bidirectional tree-based encoder. To maximize the predictive likelihood of target words, a weighted variant of an attention mechanism is used to balance the attentive information between lexical and phrase vectors. Using a tree-based rare word encoding, the proposed model is extended to sub-word level to alleviate the out-of-vocabulary (OOV) problem. Empirical results reveal that the proposed model significantly outperforms sequence-to-sequence attention-based and tree-based neural translation models in English-Chinese translation tasks.Comment: Accepted for publication at EMNLP 201

    Syntactic inductive biases for deep learning methods

    Full text link
    Le débat entre connexionnisme et symbolisme est l'une des forces majeures qui animent le développement de l'Intelligence Artificielle. L'apprentissage profond et la linguistique théorique sont les domaines d'études les plus représentatifs pour les deux écoles respectivement. Alors que la méthode d'apprentissage profond a fait des percées impressionnantes et est devenue la principale raison de la récente prospérité de l'IA pour l'industrie et les universités, la linguistique et le symbolisme occupent quelque domaines importantes, notamment l'interprétabilité et la fiabilité. Dans cette thèse, nous essayons de construire une connexion entre les deux écoles en introduisant des biais inductifs linguistiques pour les modèles d'apprentissage profond. Nous proposons deux familles de biais inductifs, une pour la structure de circonscription et une autre pour la structure de dépendance. Le biais inductif de circonscription encourage les modèles d'apprentissage profond à utiliser différentes unités (ou neurones) pour traiter séparément les informations à long terme et à court terme. Cette séparation fournit un moyen pour les modèles d'apprentissage profond de construire les représentations hiérarchiques latentes à partir d'entrées séquentielles, dont une représentation de niveau supérieur est composée et peut être décomposée en une série de représentations de niveau inférieur. Par exemple, sans connaître la structure de vérité fondamentale, notre modèle proposé apprend à traiter l'expression logique en composant des représentations de variables et d'opérateurs en représentations d'expressions selon sa structure syntaxique. D'autre part, le biais inductif de dépendance encourage les modèles à trouver les relations latentes entre les mots dans la séquence d'entrée. Pour le langage naturel, les relations latentes sont généralement modélisées sous la forme d'un graphe de dépendance orienté, où un mot a exactement un nœud parent et zéro ou plusieurs nœuds enfants. Après avoir appliqué cette contrainte à un modèle de type transformateur, nous constatons que le modèle est capable d'induire des graphes orientés proches des annotations d'experts humains, et qu'il surpasse également le modèle de transformateur standard sur différentes tâches. Nous pensons que ces résultats expérimentaux démontrent une alternative intéressante pour le développement futur de modèles d'apprentissage profond.The debate between connectionism and symbolism is one of the major forces that drive the development of Artificial Intelligence. Deep Learning and theoretical linguistics are the most representative fields of study for the two schools respectively. While the deep learning method has made impressive breakthroughs and became the major reason behind the recent AI prosperity for industry and academia, linguistics and symbolism still holding some important grounds including reasoning, interpretability and reliability. In this thesis, we try to build a connection between the two schools by introducing syntactic inductive biases for deep learning models. We propose two families of inductive biases, one for constituency structure and another one for dependency structure. The constituency inductive bias encourages deep learning models to use different units (or neurons) to separately process long-term and short-term information. This separation provides a way for deep learning models to build the latent hierarchical representations from sequential inputs, that a higher-level representation is composed of and can be decomposed into a series of lower-level representations. For example, without knowing the ground-truth structure, our proposed model learns to process logical expression through composing representations of variables and operators into representations of expressions according to its syntactic structure. On the other hand, the dependency inductive bias encourages models to find the latent relations between entities in the input sequence. For natural language, the latent relations are usually modeled as a directed dependency graph, where a word has exactly one parent node and zero or several children nodes. After applying this constraint to a transformer-like model, we find the model is capable of inducing directed graphs that are close to human expert annotations, and it also outperforms the standard transformer model on different tasks. We believe that these experimental results demonstrate an interesting alternative for the future development of deep learning models

    General methods for fine-grained morphological and syntactic disambiguation

    Get PDF
    We present methods for improved handling of morphologically rich languages (MRLS) where we define MRLS as languages that are morphologically more complex than English. Standard algorithms for language modeling, tagging and parsing have problems with the productive nature of such languages. Consider for example the possible forms of a typical English verb like work that generally has four four different forms: work, works, working and worked. Its Spanish counterpart trabajar has 6 different forms in present tense: trabajo, trabajas, trabaja, trabajamos, trabajáis and trabajan and more than 50 different forms when including the different tenses, moods (indicative, subjunctive and imperative) and participles. Such a high number of forms leads to sparsity issues: In a recent Wikipedia dump of more than 400 million tokens we find that 20 of these forms occur only twice or less and that 10 forms do not occur at all. This means that even if we only need unlabeled data to estimate a model and even when looking at a relatively common and frequent verb, we do not have enough data to make reasonable estimates for some of its forms. However, if we decompose an unseen form such as trabajaréis `you will work', we find that it is trabajar in future tense and second person plural. This allows us to make the predictions that are needed to decide on the grammaticality (language modeling) or syntax (tagging and parsing) of a sentence. In the first part of this thesis, we develop a morphological language model. A language model estimates the grammaticality and coherence of a sentence. Most language models used today are word-based n-gram models, which means that they estimate the transitional probability of a word following a history, the sequence of the (n - 1) preceding words. The probabilities are estimated from the frequencies of the history and the history followed by the target word in a huge text corpus. If either of the sequences is unseen, the length of the history has to be reduced. This leads to a less accurate estimate as less context is taken into account. Our morphological language model estimates an additional probability from the morphological classes of the words. These classes are built automatically by extracting morphological features from the word forms. To this end, we use unsupervised segmentation algorithms to find the suffixes of word forms. Such an algorithm might for example segment trabajaréis into trabaja and réis and we can then estimate the properties of trabajaréis from other word forms with the same or similar morphological properties. The data-driven nature of the segmentation algorithms allows them to not only find inflectional suffixes (such as -réis), but also more derivational phenomena such as the head nouns of compounds or even endings such as -tec, which identify technology oriented companies such as Vortec, Memotec and Portec and would not be regarded as a morphological suffix by traditional linguistics. Additionally, we extract shape features such as if a form contains digits or capital characters. This is important because many rare or unseen forms are proper names or numbers and often do not have meaningful suffixes. Our class-based morphological model is then interpolated with a word-based model to combine the generalization capabilities of the first and the high accuracy in case of sufficient data of the second. We evaluate our model across 21 European languages and find improvements between 3% and 11% in perplexity, a standard language modeling evaluation measure. Improvements are highest for languages with more productive and complex morphology such as Finnish and Estonian, but also visible for languages with a relatively simple morphology such as English and Dutch. We conclude that a morphological component yields consistent improvements for all the tested languages and argue that it should be part of every language model. Dependency trees represent the syntactic structure of a sentence by attaching each word to its syntactic head, the word it is directly modifying. Dependency parsing is usually tackled using heavily lexicalized (word-based) models and a thorough morphological preprocessing is important for optimal performance, especially for MRLS. We investigate if the lack of morphological features can be compensated by features induced using hidden Markov models with latent annotations (HMM-LAs) and find this to be the case for German. HMM-LAs were proposed as a method to increase part-of-speech tagging accuracy. The model splits the observed part-of-speech tags (such as verb and noun) into subtags. An expectation maximization algorithm is then used to fit the subtags to different roles. A verb tag for example might be split into an auxiliary verb and a full verb subtag. Such a split is usually beneficial because these two verb classes have different contexts. That is, a full verb might follow an auxiliary verb, but usually not another full verb. For German and English, we find that our model leads to consistent improvements over a parser not using subtag features. Looking at the labeled attachment score (LAS), the number of words correctly attached to their head, we observe an improvement from 90.34 to 90.75 for English and from 87.92 to 88.24 for German. For German, we additionally find that our model achieves almost the same performance (88.24) as a model using tags annotated by a supervised morphological tagger (LAS of 88.35). We also find that the German latent tags correlate with morphology. Articles for example are split by their grammatical case. We also investigate the part-of-speech tagging accuracies of models using the traditional treebank tagset and models using induced tagsets of the same size and find that the latter outperform the former, but are in turn outperformed by a discriminative tagger. Furthermore, we present a method for fast and accurate morphological tagging. While part-of-speech tagging annotates tokens in context with their respective word categories, morphological tagging produces a complete annotation containing all the relevant inflectional features such as case, gender and tense. A complete reading is represented as a single tag. As a reading might consist of several morphological features the resulting tagset usually contains hundreds or even thousands of tags. This is an issue for many decoding algorithms such as Viterbi which have runtimes depending quadratically on the number of tags. In the case of morphological tagging, the problem can be avoided by using a morphological analyzer. A morphological analyzer is a manually created finite-state transducer that produces the possible morphological readings of a word form. This analyzer can be used to prune the tagging lattice and to allow for the application of standard sequence labeling algorithms. The downside of this approach is that such an analyzer is not available for every language or might not have the coverage required for the task. Additionally, the output tags of some analyzers are not compatible with the annotations of the treebanks, which might require some manual mapping of the different annotations or even to reduce the complexity of the annotation. To avoid this problem we propose to use the posterior probabilities of a conditional random field (CRF) lattice to prune the space of possible taggings. At the zero-order level the posterior probabilities of a token can be calculated independently from the other tokens of a sentence. The necessary computations can thus be performed in linear time. The features available to the model at this time are similar to the features used by a morphological analyzer (essentially the word form and features based on it), but also include the immediate lexical context. As the ambiguity of word types varies substantially, we just fix the average number of readings after pruning by dynamically estimating a probability threshold. Once we obtain the pruned lattice, we can add tag transitions and convert it into a first-order lattice. The quadratic forward-backward computations are now executed on the remaining plausible readings and thus efficient. We can now continue pruning and extending the lattice order at a relatively low additional runtime cost (depending on the pruning thresholds). The training of the model can be implemented efficiently by applying stochastic gradient descent (SGD). The CRF gradient can be calculated from a lattice of any order as long as the correct reading is still in the lattice. During training, we thus run the lattice pruning until we either reach the maximal order or until the correct reading is pruned. If the reading is pruned we perform the gradient update with the highest order lattice still containing the reading. This approach is similar to early updating in the structured perceptron literature and forces the model to learn how to keep the correct readings in the lower order lattices. In practice, we observe a high number of lower updates during the first training epoch and almost exclusively higher order updates during later epochs. We evaluate our CRF tagger on six languages with different morphological properties. We find that for languages with a high word form ambiguity such as German, the pruning results in a moderate drop in tagging accuracy while for languages with less ambiguity such as Spanish and Hungarian the loss due to pruning is negligible. However, our pruning strategy allows us to train higher order models (order > 1), which give substantial improvements for all languages and also outperform unpruned first-order models. That is, the model might lose some of the correct readings during pruning, but is also able to solve more of the harder cases that require more context. We also find our model to substantially and significantly outperform a number of frequently used taggers such as Morfette and SVMTool. Based on our morphological tagger we develop a simple method to increase the performance of a state-of-the-art constituency parser. A constituency tree describes the syntactic properties of a sentence by assigning spans of text to a hierarchical bracket structure. developed a language-independent approach for the automatic annotation of accurate and compact grammars. Their implementation -- known as the Berkeley parser -- gives state-of-the-art results for many languages such as English and German. For some MRLS such as Basque and Korean, however, the parser gives unsatisfactory results because of its simple unknown word model. This model maps unknown words to a small number of signatures (similar to our morphological classes). These signatures do not seem expressive enough for many of the subtle distinctions made during parsing. We propose to replace rare words by the morphological reading generated by our tagger instead. The motivation is twofold. First, our tagger has access to a number of lexical and sublexical features not available during parsing. Second, we expect the morphological readings to contain most of the information required to make the correct parsing decision even though we know that things such as the correct attachment of prepositional phrases might require some notion of lexical semantics. In experiments on the SPMRL 2013 dataset of nine MRLS we find our method to give improvements for all languages except French for which we observe a minor drop in the Parseval score of 0.06. For Hebrew, Hungarian and Basque we find substantial absolute improvements of 5.65, 11.87 and 15.16, respectively. We also performed an extensive evaluation on the utility of word representations for morphological tagging. Our goal was to reduce the drop in performance that is caused when a model trained on a specific domain is applied to some other domain. This problem is usually addressed by domain adaption (DA). DA adapts a model towards a specific domain using a small amount of labeled or a huge amount of unlabeled data from that domain. However, this procedure requires us to train a model for every target domain. Instead we are trying to build a robust system that is trained on domain-specific labeled and domain-independent or general unlabeled data. We believe word representations to be key in the development of such models because they allow us to leverage unlabeled data efficiently. We compare data-driven representations to manually created morphological analyzers. We understand data-driven representations as models that cluster word forms or map them to a vectorial representation. Examples heavily used in the literature include Brown clusters, Singular Value Decompositions of count vectors and neural-network-based embeddings. We create a test suite of six languages consisting of in-domain and out-of-domain test sets. To this end we converted annotations for Spanish and Czech and annotated the German part of the Smultron treebank with a morphological layer. In our experiments on these data sets we find Brown clusters to outperform the other data-driven representations. Regarding the comparison with morphological analyzers, we find Brown clusters to give slightly better performance in part-of-speech tagging, but to be substantially outperformed in morphological tagging

    구문론을 활용한 신경망 기반 문장 표현의 학습 및 분석

    Get PDF
    학위논문(박사) -- 서울대학교대학원 : 공과대학 컴퓨터공학부, 2021.8. 김태욱.구문론(syntax)은 언어학의 한 갈래로써, 자연어 문장의 형성 과정에 내포되어 있 는 원리와 그로 인해 촉발되는 여러 언어적 현상을 규정하고 이를 검증하는 연구 분야를 총칭한다. 구문론은 단어, 구 및 절과 같은 문장 내의 구성 요소로부터 해당 문장의 의미를 점진적으로 구축해 나가는 과정에 대한 체계적인 이론적 절차를 제공하며, 따라서 이는 자연어처리에서 문장 표현 학습 및 분석을 위한 방법론을 구상하는데 있어 활용될 수 있는 잠재성을 지니고 있다. 본 논문에서는 신경망 기반의 문장 표현 방법론을 개발하는 데 있어 구문론을 활용하는 두 측면에 관하여 논한다. 먼저, 언어학적인 파스 트리의 형태로 표현되 어 있거나 혹은 타 신경망 모델의 파라미터에 암시적으로 저장되어 있는 구문론적 지식을 도입하여 더 나은 문장 표현을 만드는 보다 직접적인 방법론을 제시한다. 이에 더하여, 구문론에 바탕한 문법적 체계를 이용하여 학습이 완료된 신경망 기반 문장 표현 모델들의 작동 원리를 규명하고 이들의 개선점을 찾는데 도움을 줄 수 있 는 분석적 접근법 또한 소개한다. 실제 환경에서의 다각적인 실험과 검증을 통하여 규칙 및 통계 기반 자연어처리에서 귀중한 자원으로 간주되었던 구문론이 신경망 기반의 모델이 대중적으로 사용되고 있는 현재의 자연어처리에서도 보완재로써 기능할 수 있음을 보인다. 구체적으로, 구문론이 고성능의 문장 표현을 위한 신경 망 모델 혹은 이를 위한 학습 방법론을 개발하는데 있어 효과적인 직관을 제공할 수 있음을 실증하고, 문장 표현 신경망 모델이 자체적으로 파스 트리를 복원해낼 수 있는 능력을 평가함으로써 구문론을 내부 작동 체계가 불명확한 신경망 모델의 작동 원리에 대한 이해도를 증진시키는 분석 도구로 활용한다.Syntax is a theory in linguistics that deals with the principles underlying the composition of sentences. As this theoretical framework provides formal instructions regarding the procedure of constructing a sentence with its constituents, it has been considered as a valuable reference in sentence representation learning, whose objective is to discover an approach of transforming a sentence into the vector that illustrates its meaning in a computationally tractable manner. This dissertation provides two particular perspectives on harmonizing syntax with neural sentence representation models, especially focusing on constituency grammar. We first propose two methods for enriching the quality of sentence embeddings by exploiting the syntactic knowledge either represented as explicit parse trees or implicitly stored in neural models. Second, we regard syntactic formalism as a lens through which we reveal the inner workings of pre-trained language models which are state-of-the-art in sentence representation learning. With a series of demonstrations in practical scenarios, we show that syntax is useful even in the neural era where the models trained with huge corpora in an end-to-end manner are prevalent, functioning as either (i) a source of inductive biases that facilitate fast and effective learning of such models or (ii) an analytic tool that increases the interpretability of the black-box models.Chapter 1 Introduction 1 1.1 Dissertation Outline 5 1.2 Related Publications 6 Chapter 2 Background 8 2.1 Introduction to Syntax 8 2.2 Neural Networks for Sentence Representations 10 2.2.1 Recursive Neural Network 11 2.2.2 Transformer 12 2.2.3 Pre-trained Language Models 14 2.3 Related Literature 16 2.3.1 Sentence Representation Learning 16 2.3.2 Probing Methods for Neural NLP Models 17 2.3.3 Grammar Induction and Unsupervised Parsing 18 Chapter 3 Sentence Representation Learning with Explicit Syntactic Structure 19 3.1 Introduction 19 3.2 Related Work 21 3.3 Method 23 3.3.1 Tree-LSTM 24 3.3.2 Structure-aware Tag Representation 25 3.3.3 Leaf-LSTM 28 3.3.4 SATA Tree-LSTM 29 3.4 Experiments 31 3.4.1 General Configurations 31 3.4.2 Sentence Classification Tasks 32 3.4.3 Natural Language Inference 35 3.5 Analysis 36 3.5.1 Ablation Study 36 3.5.2 Representation Visualization 38 3.6 Limitations and Future Work 39 3.7 Summary 40 Chapter 4 Sentence Representation Learning with Implicit Syntactic Knowledge 41 4.1 Introduction 41 4.2 Related Work 44 4.3 Method 46 4.3.1 Contrastive Learning with Self-Guidance 47 4.3.2 Learning Objective Optimization 50 4.4 Experiments 52 4.4.1 General Configurations 52 4.4.2 Semantic Textual Similarity Tasks 53 4.4.3 Multilingual STS Tasks 58 4.4.4 SentEval Benchmark 59 4.5 Analysis 60 4.5.1 Ablation Study 60 4.5.2 Robustness to Domain Shifts 61 4.5.3 Computational Efficiency 62 4.5.4 Representation Visualization 63 4.6 Limitations and Future Work 63 4.7 Summary 65 Chapter 5 Syntactic Analysis of Sentence Representation Models 66 5.1 Introduction 66 5.2 Related Work 68 5.3 Motivation 70 5.4 Method 72 5.4.1 CPE-PLM 72 5.4.2 Top-down CPE-PLM 73 5.4.3 Pre-trained Language Models 74 5.4.4 Distance Measure Functions 76 5.4.5 Injecting Bias into Syntactic Distances 77 5.5 Experiments 78 5.5.1 General Configurations 78 5.5.2 Experimental Results on PTB 80 5.5.3 Experimental Results on MNLI 83 5.6 Analysis 85 5.6.1 Performance Comparison by Layer 85 5.6.2 Estimating the Upper Limit of Distance Measure Functions 86 5.6.3 Constituency Tree Examples 88 5.7 Summary 93 Chapter 6 Multilingual Syntactic Analysis with Enhanced Techniques 94 6.1 Introduction 94 6.2 Related work 96 6.3 Method 97 6.3.1 Chart-based CPE-PLM 97 6.3.2 Top-K Ensemble for CPE-PLM 100 6.4 Experiments 100 6.4.1 General Configurations 100 6.4.2 Experiments on Monolingual Settings 102 6.4.3 Experiments on Multilingual Settings 103 6.5 Analysis 106 6.5.1 Factor Correlation Analysis 108 6.5.2 Visualization of Attention Heads 108 6.5.3 Recall Scores on Noun and Verb Phrases 109 6.6 Limitations and Future Work 110 6.7 Summary 111 Chapter 7 Conclusion 112 Bibliography 116 초록 138박