9 research outputs found
Morphonette: a morphological network of French
This paper describes in details the first version of Morphonette, a new
French morphological resource and a new radically lexeme-based method of
morphological analysis. This research is grounded in a paradigmatic conception
of derivational morphology where the morphological structure is a structure of
the entire lexicon and not one of the individual words it contains. The
discovery of this structure relies on a measure of morphological similarity
between words, on formal analogy and on the properties of two morphological
paradigms
MORSE: Semantic-ally Drive-n MORpheme SEgment-er
We present in this paper a novel framework for morpheme segmentation which
uses the morpho-syntactic regularities preserved by word representations, in
addition to orthographic features, to segment words into morphemes. This
framework is the first to consider vocabulary-wide syntactico-semantic
information for this task. We also analyze the deficiencies of available
benchmarking datasets and introduce our own dataset that was created on the
basis of compositionality. We validate our algorithm across datasets and
present state-of-the-art results
Acquisition of morphological families and derivational series from a machine readable dictionary
The paper presents a linguistic and computational model aiming at making the
morphological structure of the lexicon emerge from the formal and semantic
regularities of the words it contains. The model is word-based. The proposed
morphological structure consists of (1) binary relations that connect each
headword with words that are morphologically related, and especially with the
members of its morphological family and its derivational series, and of (2) the
analogies that hold between the words. The model has been tested on the lexicon
of French using the TLFi machine readable dictionary.Comment: proceedings of the 6th D\'ecembrette
Towards Semantic Validation of a Derivational Lexicon
Abstract Derivationally related lemmas like friend N -friendly A -friendship N are derived from a common stem. Frequently, their meanings are also systematically related. However, there are also many examples of derivationally related lemma pairs whose meanings differ substantially, e.g., object N -objective N . Most broad-coverage derivational lexicons do not reflect this distinction, mixing up semantically related and unrelated word pairs. In this paper, we investigate strategies to recover the above distinction by recognizing semantically related lemma pairs, a process we call semantic validation. We make two main contributions: First, we perform a detailed data analysis on the basis of a large German derivational lexicon. It reveals two promising sources of information (distributional semantics and structural information about derivational rules), but also systematic problems with these sources. Second, we develop a classification model for the task that reflects the noisy nature of the data. It achieves an improvement of 13.6% in precision and 5.8% in F1-score over a strong majority class baseline. Our experiments confirm that both information sources contribute to semantic validation, and that they are complementary enough that the best results are obtained from a combined model
Fuzzy Order-of-Magnitude Based Link Analysis for Qualitative Alias Detection
Numerical link-based similarity techniques have proven effective for identifying similar objects in the Internet and publication domains. However, for cases involving unduly high similarity measures, these methods usually generate inaccurate results. Also, they are often restricted to measuring over single properties only. This paper presents an order-of-magnitude based similarity mechanism that integrates multiple link properties to derive semantic-rich similarity descriptions. The approach extends conventional order-of-magnitude reasoning with the theory of fuzzy sets. The inherent ability of this work in computing-with-words also allows coherent interpretation and communication within a decision-making group. The proposed approach is applied to supporting the analysis of intelligence data. When evaluated over a difficult terrorism-related dataset, experimental results show that the approach helps to partly resolve the problem of false positives
Fuzzy Order-of-Magnitude Based Link Analysis for Qualitative Alias Detection
Numerical link-based similarity techniques have proven effective for identifying similar objects in the Internet and publication domains. However, for cases involving unduly high similarity measures, these methods usually generate inaccurate results. Also, they are often restricted to measuring over single properties only. This paper presents an order-of-magnitude based similarity mechanism that integrates multiple link properties to derive semantic-rich similarity descriptions. The approach extends conventional order-of-magnitude reasoning with the theory of fuzzy sets. The inherent ability of this work in computing-with-words also allows coherent interpretation and communication within a decision-making group. The proposed approach is applied to supporting the analysis of intelligence data. When evaluated over a difficult terrorism-related dataset, experimental results show that the approach helps to partly resolve the problem of false positives
Unsupervised morpheme segmentation in a non-parametric Bayesian framework
Learning morphemes from any plain text is an emerging research area in the natural language processing. Knowledge about the process of word formation is helpful in devising automatic segmentation of words into their constituent morphemes. This thesis applies unsupervised morpheme induction method, based on the statistical behavior of words, to induce morphemes for word segmentation. The morpheme cache for the purpose is based on the Dirichlet Process (DP) and stores frequency information of the induced morphemes and their occurrences in a Zipfian distribution.
This thesis uses a number of empirical, morpheme-level grammar models to classify the induced morphemes under the labels prefix, stem and suffix. These grammar models capture the different structural relationships among the morphemes. Furthermore, the morphemic categorization reduces the problems of over segmentation. The output of the strategy demonstrates a significant improvement on the baseline system.
Finally, the thesis measures the performance of the unsupervised morphology learning system for Nepali
Unsupervised learning of Arabic non-concatenative morphology
Unsupervised approaches to learning the morphology of a language play an important role in computer processing of language from a practical and theoretical perspective, due their minimal reliance on manually produced linguistic resources and human annotation. Such approaches have been widely researched for the problem of concatenative affixation, but less attention has been paid to the intercalated (non-concatenative) morphology exhibited by Arabic and other Semitic languages.
The aim of this research is to learn the root and pattern morphology of Arabic, with accuracy comparable to manually built morphological analysis systems. The approach is kept free from human supervision or manual parameter settings, assuming only that roots and patterns intertwine to form a word.
Promising results were obtained by applying a technique adapted from previous work in concatenative morphology learning, which uses machine learning to determine relatedness between words. The output, with probabilistic relatedness values between words, was then used to rank all possible roots and patterns to form a lexicon. Analysis using trilateral roots resulted in correct root identification accuracy of approximately 86% for inflected words.
Although the machine learning-based approach is effective, it is conceptually complex. So an alternative, simpler and computationally efficient approach was then devised to obtain morpheme scores based on comparative counts of roots and patterns. In this approach, root and pattern scores are defined in terms of each other in a mutually recursive relationship, converging to an optimized morpheme ranking. This technique gives slightly better accuracy while being conceptually simpler and more efficient.
The approach, after further enhancements, was evaluated on a version of the Quranic Arabic Corpus, attaining a final accuracy of approximately 93%. A comparative evaluation shows this to be superior to two existing, well used manually built Arabic stemmers, thus demonstrating the practical feasibility of unsupervised learning of non-concatenative morphology