14 research outputs found

    Author Profiling for English and Arabic Emails

    Get PDF
    This paper reports on some aspects of a research project aimed at automating the analysis of texts for the purpose of author profiling and identification. The Text Attribution Tool (TAT) was developed for the purpose of language-independent author profiling and has now been trained on two email corpora, English and Arabic. The complete analysis provides probabilities for the author’s basic demographic traits (gender, age, geographic origin, level of education and native language) as well as for five psychometric traits. The prototype system also provides a probability of a match with other texts, whether from known or unknown authors. A very important part of the project was the data collection and we give an overview of the collection process as well as a detailed description of the corpus of email data which was collected. We describe the overall TAT system and its components before outlining the ways in which the email data is processed and analysed. Because Arabic presents particular challenges for NLP, this paper also describes more specifically the text processing components developed to handle Arabic emails. Finally, we describe the Machine Learning setup used to produce classifiers for the different author traits and we present the experimental results, which are promising for most traits examined.The work presented in this paper was carried out while the authors were working at Appen Pty Ltd., Chatswood NSW 2067, Australi

    Author Profiling for English and Arabic Emails

    Get PDF
    This paper reports on some aspects of a research project aimed at automating the analysis of texts for the purpose of author profiling and identification. The Text Attribution Tool (TAT) was developed for the purpose of language-independent author profiling and has now been trained on two email corpora, English and Arabic. The complete analysis provides probabilities for the author’s basic demographic traits (gender, age, geographic origin, level of education and native language) as well as for five psychometric traits. The prototype system also provides a probability of a match with other texts, whether from known or unknown authors. A very important part of the project was the data collection and we give an overview of the collection process as well as a detailed description of the corpus of email data which was collected. We describe the overall TAT system and its components before outlining the ways in which the email data is processed and analysed. Because Arabic presents particular challenges for NLP, this paper also describes more specifically the text processing components developed to handle Arabic emails. Finally, we describe the Machine Learning setup used to produce classifiers for the different author traits and we present the experimental results, which are promising for most traits examined.The work presented in this paper was carried out while the authors were working at Appen Pty Ltd., Chatswood NSW 2067, Australi

    Linguistic knowledge and word sense disambiguation

    Get PDF
    The main research question I try to answer in the my thesis is which linguistic knowledge sources are most useful for word sense disambiguation (WSD), more specifically word sense disambiguation of Dutch. The goal of the project was to develop a tool which is able to automatically determine the meaning of a particular ambiguous word in context, a so called word sense disambiguation system. In order to achieve this, I make use of the information contained in the context, namely the words surrounding the ambiguous word, and additional underlying information (such as syntactic class and structure) to build a statistical language model. This model is then used to determine the meaning of examples of that particular ambiguous word in new contexts. My results on the (unseen) Senseval-2 test data show that adding structural syntactic information in the form of dependency relations instead of PoS of the context leads to an error-rate reduction of 8% for the word form model. Furthermore, the lemma-based approach (introduced in this thesis) outperforms the word form-based approach independently of the features included in the model. We can observe an error rate reduction of 10% with regard to the lemma-based model including PoS in context, and a reduction of 6% of errors with regard to the best model based on word forms. Comparing the results on the test data to results obtained with a different system, using Memory-Based Learning (MBL) as a classification algorithm, both the word form-based classifiers and the lemma-based classifiers from my system produce higher accuracy. The lemma-based model actually leads to an error rate reduction of 10% if compared to the MBL WSD system. In my maximum entropy system, especially the addition of deep linguistic knowledge greatly improves accuracy. In combination with an approach taking advantage of morphological information, the lemma-based approach, the best results for WSD of Dutch on the Senseval-2 data set are obtained. Our system achieves significantly higher disambiguation accuracy than any results for Dutch that have been reported in the literature up to now and is thus state-of-the-art for Dutch WSD.
    corecore