Search CORE

993 research outputs found

Classification of Under-Resourced Language Documents Using English Ontology

Author: Kassa Tsegay Mullu
Publication venue: The International Institute for Science, Technology and Education (IISTE)
Publication date: 30/07/2019
Field of study

Automatic documents classification is an important task due to the rapid growth of the number of electronic documents, which aims automatically assign the document to a predefined category based on its contents. The use of automatic document classification has been plays an important role in information extraction, summarization, text retrieval, question answering, e-mail spam detection, web page content filtering, automatic message routing , etc.Most existing methods and techniques in the field of document classification are keyword based, but due to lack of semantic consideration of this technique, it incurs low performance. In contrast, documents also be classified by taking their semantics using ontology as a knowledge base for classification; however, it is very challenging of building ontology with under-resourced language. Hence, this approach is only limited to resourced language (i.e. English) support. As a result, under-resourced language written documents are not benefited such ontology based classification approach. This paper describes the design of automatic document classification of under-resourced language written documents. In this work, we propose an approach that performs classification of under-resourced language written documents on top of English ontology. We used a bilingual dictionary with Part of Speech feature for word-by-word text translation to enable the classification of document without any language barrier. The design has a concept-mapping component, which uses lexical and semantic features to map the translated sense along the ontology concepts. Beside this, the design also has a categorization component, which determines a category of a given document based on weight of mapped concept. To evaluate the performance of the proposed approach 20-test documents for Amharic and Tigrinya and 15-test document for Afaan Oromo in each news category used. In order to observe the effect of incorporated features (i.e. lemma based index term selection, pre-processing strategies during concept mapping, lexical and semantics based concept mapping) five experimental techniques conducted. The experimental result indicated that the proposed approach with incorporation of all features and components achieved an average F-measure of 92.37%, 86.07% and 88.12% for Amharic, Afaan Oromo and Tigrinya documents respectively. Keywords: under-resourced language, Multilingual, Documents or text Classification, knowledge base, Ontology based text categorization, multilingual text classification, Ontology. DOI: 10.7176/CEIS/10-6-02 Publication date:July 31st 201

International Institute for Science, Technology and Education (IISTE): E-Journals

Sentence Level N-Gram Context Feature in Real-Word Spelling Error Detection and Correction: Unsupervised Corpus Based Approach

Author: Kassa Tsegay Mullu
Publication venue: The International Institute for Science, Technology and Education (IISTE)
Publication date: 01/10/2020
Field of study

Spell checking is the process of finding misspelled words and possibly correcting them. Most of the modern commercial spell checkers use a straightforward approach to finding misspellings, which considered a word is erroneous when it is not found in the dictionary. However, this approach is not able to check the correctness of words in their context and this is called real-word spelling error. To solve this issue, in the state-of-the-art researchers use context feature at fixed size n-gram (i.e. tri-gram) and this reduces the effectiveness of model due to limited feature. In this paper, we address the problem of this issue by adopting sentence level n-gram feature for real-word spelling error detection and correction. In this technique, all possible word n-grams are used to learn proposed model about properties of target language and this enhance its effectiveness. In this investigation, the only corpus required to training proposed model is unsupervised corpus (or raw text) and this enables the model flexible to be adoptable for any natural languages. But, for demonstration purpose we adopt under-resourced languages such as Amharic, Afaan Oromo and Tigrigna. The model has been evaluated in terms of Recall, Precision, F-measure and a comparison with literature was made (i.e. fixed n-gram context feature) to assess if the technique used performs as good. The experimental result indicates proposed model with sentence level n-gram context feature achieves a better result: for real-word error detection and correction achieves an average F-measure of 90.03%, 85.95%, and 84.24% for Amharic, Afaan Oromo and Tigrigna respectively. Keywords: Sentence level n-gram, real-word spelling error, spell checker, unsupervised corpus based spell checker DOI: 10.7176/JIEA/10-4-02 Publication date:September 30th 202

International Institute for Science, Technology and Education (IISTE): E-Journals

Data-driven Language Typology

Author: Hinkka Atte
Publication venue: Helsingfors universitet
Publication date: 01/01/2018
Field of study

In this thesis we use statistical n-gram language models and the perplexity measure for language typology tasks. We interpret the perplexity of a language model as a distance measure when the model is applied on a phonetic transcript of a language the model wasn't originally trained on. We use these distance measures for detecting language families, detecting closely related languages, and for language family tree reproduction. We also study the sample sizes required to train the language models and make estimations on how large corpora are needed for the successful use of these methods. We find that trigram language models trained from automatically transcribed phonetic transcripts and the perplexity measure can be used for both detecting language families and for detecting closely related languages

Helsingin yliopiston digitaalinen arkisto

Assessment of Dyslexia in the Urdu Language

Author: Haidry Sana
Publication venue: 'University of Groningen Press'
Publication date: 01/01/2017
Field of study

Proceedings - University of Groningen

Analyzing user reviews of messaging Apps for competitive analysis

Author: Liang Wenyi
Publication venue
Publication date: 28/01/2022
Field of study

Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceThe rise of various messaging apps has resulted in intensively fierce competition, and the era of Web 2.0 enables business managers to gain competitive intelligence from user-generated content (UGC). Text-mining UGC for competitive intelligence has been drawing great interest of researchers. However, relevant studies mostly focus on industries such as hospitality and products, and few studies applied such techniques to effectively perform competitive analysis for messaging apps. Here, we conducted a competitive analysis based on topic modeling and sentiment analysis by text-mining 27,479 user reviews of four iOS messaging apps, namely Messenger, WhatsApp, Signal and Telegram. The results show that the performance of topic modeling and sentiment analysis is encouraging, and that a combination of the extracted app aspect-based topics and the adjusted sentiment scores can effectively reveal meaningful competitive insights into user concerns, competitive strengths and weaknesses as well as changes of user sentiments over time. We anticipate that this study will not only advance the existing literature on competitive analysis using text mining techniques for messaging apps but also help existing players and new entrants in the market to sharpen their competitive edge by better understanding their user needs and the industry trends

Repositório da Universidade Nova de Lisboa

Master of Arts

Author: Bayles Andrew John
Publication venue: University of Utah
Publication date: 01/01/2016
Field of study

thesisHigh-vowel lenition is attested in various forms in a number of languages, including Shoshoni, Lezgian, East Cree, Andean Spanish, and Japanese, along with many others. It is also attested in the development of the various Romance languages from Proto-Romance. High-vowel deletion and devoicing are both attested in Quebec French, with some authors reporting devoicing but no deletion, and others reporting frequent deletion and devoicing. Research indicates that both surrounding consonantal context and sociolinguistic factors contribute to (non)lenition of Quebec French high vowels, with some authors treating deletion and devoicing as separate phenomena and others treating them as different manifestations of the same phenomenon. Few studies have investigated high-vowel lenition in other varieties of French. This study investigates deletion and devoicing of the high-vowel phonemes /i/, /y/, and /u/ in the French spoken in Quebec and Paris, and identifies which phonetic and social factors, including left and right context, vowel phoneme, provenance, gender, and style, best predict these phenomena. It also addressed whether high-vowel deletion and devoicing are different manifestations of a single phenomenon or two separate phenomena in these varieties of French. Data are from recordings of native French speakers from the Phonologie du Francais Contemporain (PFC) corpus project. Each speaker participated in two different interviews representing two levels of style. For each speaker, each interview type, and each high-vowel phoneme, twenty interconsonantal tokens were transcribed and coded as deleted or present, and as voiced or devoiced, along with the surrounding consonantal context. Tokens were subjected to statistical analysis. Despite most expectations, there are no statistical differences between the rates of deletion and devoicing in Quebec and Paris, and neither phenomenon is unique to Quebec French. The best predictors of deletion were place and manner of articulation of surrounding consonants, while the best predictor of devoicing was voiceless surrounding consonants. These results indicate that deletion and devoicing are separate processes. Although not significant at the aggregate level, sociolinguistic factors were significant predictors in more specific models. Deletion and devoicing of French high-vowels are both more complex and more widespread than previous studies have suggested

The University of Utah: J. Willard Marriott Digital Library

First International Workshop on Lexical Resources

Author: Sagot Benoît
Publication venue: HAL CCSD
Publication date: 01/08/2011
Field of study

International audienceLexical resources are one of the main sources of linguistic information for research and applications in Natural Language Processing and related fields. In recent years advances have been achieved in both symbolic aspects of lexical resource development (lexical formalisms, rule-based tools) and statistical techniques for the acquisition and enrichment of lexical resources, both monolingual and multilingual. The latter have allowed for faster development of large-scale morphological, syntactic and/or semantic resources, for widely-used as well as resource-scarce languages. Moreover, the notion of dynamic lexicon is used increasingly for taking into account the fact that the lexicon undergoes a permanent evolution.This workshop aims at sketching a large picture of the state of the art in the domain of lexical resource modeling and development. It is also dedicated to research on the application of lexical resources for improving corpus-based studies and language processing tools, both in NLP and in other language-related fields, such as linguistics, translation studies, and didactics

INRIA a CCSD electronic archive server

Hal-Diderot