816 research outputs found

    Acronyms as an integral part of multi–word term recognition - A token of appreciation

    Get PDF
    Term conflation is the process of linking together different variants of the same term. In automatic term recognition approaches, all term variants should be aggregated into a single normalized term representative, which is associated with a single domain–specific concept as a latent variable. In a previous study, we described FlexiTerm, an unsupervised method for recognition of multi–word terms from a domain–specific corpus. It uses a range of methods to normalize three types of term variation – orthographic, morphological and syntactic variation. Acronyms, which represent a highly productive type of term variation, were not supported. In this study, we describe how the functionality of FlexiTerm has been extended to recognize acronyms and incorporate them into the term conflation process. The main contribution of this study is not acronym recognition per se, but rather its integration with other types of term variation into the term conflation process. We evaluated the effects of term conflation in the context of information retrieval as one of its most prominent applications. On average, relative recall increased by 32 percent points, whereas index compression factor increased by 7 percent points. Therefore, evidence suggests that integration of acronyms provides non–trivial improvement of term conflation

    A novel hybrid algorithm for morphological analysis: artificial Neural-Net-XMOR

    Get PDF
    In this study, we present a novel algorithm that combines a rule-based approach and an artificial neural network-based approach in morphological analysis. The usage of hybrid models including both techniques is evaluated for performance improvements. The proposed hybrid algorithm is based on the idea of the dynamic generation of an artificial neural network according to two-level phonological rules. In this study, the combination of linguistic parsing, a neural network-based error correction model, and statistical filtering is utilized to increase the coverage of pure morphological analysis. We experimented hybrid algorithm applying rule-based and long short-term memory-based (LSTM-based) techniques, and the results show that we improved the morphological analysis performance for optical character recognizer (OCR) and social media data. Thus, for the new hybrid algorithm with LSTM, the accuracy reached 99.91% for the OCR dataset and 99.82% for social media data. © TÜBİTAK

    Computational Sociolinguistics: A Survey

    Get PDF
    Language is a social phenomenon and variation is inherent to its social nature. Recently, there has been a surge of interest within the computational linguistics (CL) community in the social dimension of language. In this article we present a survey of the emerging field of "Computational Sociolinguistics" that reflects this increased interest. We aim to provide a comprehensive overview of CL research on sociolinguistic themes, featuring topics such as the relation between language and social identity, language use in social interaction and multilingual communication. Moreover, we demonstrate the potential for synergy between the research communities involved, by showing how the large-scale data-driven methods that are widely used in CL can complement existing sociolinguistic studies, and how sociolinguistics can inform and challenge the methods and assumptions employed in CL studies. We hope to convey the possible benefits of a closer collaboration between the two communities and conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication: 18th February, 201
    • …
    corecore