2,300 research outputs found

    Natural language processing

    Get PDF
    Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems

    Opinion Mining on Non-English Short Text

    Full text link
    As the type and the number of such venues increase, automated analysis of sentiment on textual resources has become an essential data mining task. In this paper, we investigate the problem of mining opinions on the collection of informal short texts. Both positive and negative sentiment strength of texts are detected. We focus on a non-English language that has few resources for text mining. This approach would help enhance the sentiment analysis in languages where a list of opinionated words does not exist. We propose a new method projects the text into dense and low dimensional feature vectors according to the sentiment strength of the words. We detect the mixture of positive and negative sentiments on a multi-variant scale. Empirical evaluation of the proposed framework on Turkish tweets shows that our approach gets good results for opinion mining

    Impact of Tokenization on Language Models: An Analysis for Turkish

    Full text link
    Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be different for morphologically rich languages, such as Turkic languages, where many words can be generated by adding prefixes and suffixes. We compare five tokenizers at different granularity levels, i.e. their outputs vary from smallest pieces of characters to the surface form of words, including a Morphological-level tokenizer. We train these tokenizers and pretrain medium-sized language models using RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus. We then fine-tune our models on six downstream tasks. Our experiments, supported by statistical tests, reveal that Morphological-level tokenizer has challenging performance with de facto tokenizers. Furthermore, we find that increasing the vocabulary size improves the performance of Morphological and Word-level tokenizers more than that of de facto tokenizers. The ratio of the number of vocabulary parameters to the total number of model parameters can be empirically chosen as 20% for de facto tokenizers and 40% for other tokenizers to obtain a reasonable trade-off between model size and performance.Comment: submitted to ACM TALLI

    Application of K-NN and FPTC based text categorization algorithms to Turkish news reports

    Get PDF
    Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2001.Thesis (Master's) -- Bilkent University, 2001.Includes bibliographical references leaves 64-68New technological developments, such as easy access to Internet, optical character readers, high-speed networks and inexpensive massive storage facilities, have resulted in a dramatic increase in the availability of on-line text-newspaper articles, incoming (electronic) mail, technical reports, etc. The enormous growth of on-line information has led to a comparable growth in the need for methods that help users organize such information. Text Categorization may be the remedy of increased need for advanced techniques. Text Categorization is the classi cation of units of natural language texts with respect to a set of pre-existing categories. Categorization of documents is challenging, as the number of discriminating words can be very large. This thesis presents compilation of a Turkish dataset, called Anadolu Agency Newsgroup in order to study in Text Categorization. Turkish is an agglutinative languages in which words contain no direct indication where the morpheme boundaries are, furthermore, morphemes take a shape dependent on the morphological and phonological context. In Turkish, the process of adding one suÆx to another can result in a relatively long word, furthermore, a single Turkish word can give rise to a very large number of variants. Due to this complex morphological structure, Turkish requires text processing techniques di erent than English and similar languages. Therefore, besides converting all words to lower case and removing punctuation marks, some preliminary work is required such as stemming, removal of stopwords and formation of a keyword list.This thesis also presents the evaluation and comparison of the well-known k-NN classi cation algorithm and a variant of the k-NN, called Feature Projection Text Categorization (FPTC) algorithm. The k-NN classi er is an instance based learning method. It computes the similarity between the test instance and training instance, and considering the k top-ranking nearest instances to predict the categories of the input, nds out the category that is most similar. FPTC algorithm is based on the idea of representing training instances as their pro jections on each feature dimension. If the value of a training instance is missing for a feature, that instance is not stored on that feature. Experiments show that the FPTC algorithm achieves comparable accuracy with the k-NN algorithm, furthermore, the time eÆciency of FPTC outperforms the k-NN signi cantlyİlhan, UfukM.S

    Corpora for sentiment analysis of Arabic text in social media

    Get PDF
    Different Natural Language Processing (NLP) applications such as text categorization, machine translation, etc., need annotated corpora to check quality and performance. Similarly, sentiment analysis requires annotated corpora to test the performance of classifiers. Manual annotation performed by native speakers is used as a benchmark test to measure how accurate a classifier is. In this paper we summarise currently available Arabic corpora and describe work in progress to build, annotate, and use Arabic corpora consisting of Facebook (FB) posts. The distinctive nature of thesecorpora is that it is based on posts written in Dialectal Arabic (DA) not following specific grammatical or spelling standards. The corpora are annotated with five labels (positive, negative, dual, neutral, and spam). In addition to building the corpus, the paper illustrates how manual tagging can be used to extract opinionated words and phrases to be used in a lexicon-based classifier

    English/Arabic/English Machine Translation: A Historical Perspective

    Get PDF
    This paper examines the history and development of Machine Translation (MT) applications for the Arabic language in the context of the history and machine translation in general. It starts with a discussion of the beginnings of MT in the US and then, depending on the work of MT historians, surveys the decline of the work on MT and drying up of funding; then the revival with globalization, development of information technology and the rising needs for breaking the language barriers in the world; and last on the dramatic developments that came with the advances in computer technology. The paper also examined some of the major approaches for MT within a historical perspective. The case of Arabic is treated along the same lines focusing on the work that was done on Arabic by Western research institutes and Western profit motivated companies. Special attention is given to the work of the one Arab company, Sakr of Al-Alamiyya Group, which was established in 1982 and has seriously since then worked on developing software applications for Arabic under the umbrella of natural language processing for the Arabic language. Major available software applications for Arabic/English Arabic MT as well as MT related software were surveyed within a historical framework.Cet article examine l’histoire et l’évolution des applications de la traduction automatique (TA) en langue arabe, dans le contexte de l’histoire de la TA en général. Il commence par décrire les débuts de la TA aux États-Unis et son déclin dû à l’épuisement du financement ; ensuite, son renouveau suscité par la mondialisation, le développement des technologies de l’information et les besoins croissants de lever les barrières linguistiques. Finalement, il aborde les progrès vertigineux réalisés grâce à l’informatique. L’article étudie aussi les principales approches de la TA dans une perspective historique. Le cas de l’arabe est traité dans cette perspective, compte tenu des travaux effectués par les instituts de recherche occidentaux et quelques sociétés privées occidentales. Un accent particulier est mis sur les recherches de la société arabe Sakr, fondée dès 1982, qui a mis au point plusieurs logiciels de traitement de langues naturelles pour l’arabe. Ces divers logiciels de TA arabe-anglais-arabe ainsi que des applications associées sont présentés dans un cadre historique

    Named Entity Recognition in Turkish with Bayesian Learning and Hybrid Approaches

    Get PDF
    Named entity recognition is one of the significant textual information extraction tasks. In this paper, we present two approaches for named entity recognition on Turkish texts. The first is a Bayesian learning approach which is trained on a considerably limited training set. The second approach comprises two hybrid systems based on joint utilization of this Bayesian learning approach and a previously proposed rule-based named entity recognizer. All of the proposed three approaches achieve promising performance rates. This paper is significant as it reports the first use of the Bayesian approach for the task of named entity recognition on Turkish texts for which especially practical approaches are still insufficient

    A comparative study of the phenomenon of false friends in SMG and CSG

    Get PDF
    The present dissertation examines the phenomenon of false friends presenting the evolution of research regarding the same, the theoretical background (definition, linguistic levels involved, total vs. partial, reasons of emergence, the mechanism of borrowing and its importance, etc.) and the practical side of problems caused in communication due to their existence. The aim of the dissertation is not only to present false friends as a linguistic occurrence between languages (interlinguistic false friends) and its many facets, but mainly in delimitating the existence of false friends at the intralinguistic level, namely between varieties of the same language. The thesis presents the specialized case of false friends between two varieties of Modern Greek, i.e. Standard Modern Greek (SMG) and Cypriot (Standard) Greek. The first variety is used in Greece whereas the second in Cyprus and the sociolinguistic situation that characterizes the two varieties is described, in order to clarify the lack of awareness as regards the existence of the phenomenon. Apart from the theoretical background for the phenomenon and the specifics of the specific intralinguistic faux amis, hte thesis boasts the presentation and analysis of 194 false friends between the two varieties of Mordern Greek. The 194 pairs appear as lemmas in Greek accompanied by phonetic transcription, exclusive SMG or C(S)G meanings, as well as common meanings (where there are), real examples of usage in the Cypriot context and an analysis that attempts to explain their provenance and relation
    • …
    corecore