3 research outputs found

    A Comparison of Text-Categorization Methods Applied to N-Gram Frequency Statistics

    No full text
    Abstract. This paper gives an analysis of multi-class e-mail categoriza-tion performance, comparing a character n-gram document representa-tion against a word-frequency based representation. Furthermore the im-pact of using available e-mail specific meta-information on classification performance is explored and the findings are presented.

    Language-independent pre-processing of large document bases for text classification

    Get PDF
    Text classification is a well-known topic in the research of knowledge discovery in databases. Algorithms for text classification generally involve two stages. The first is concerned with identification of textual features (i.e. words andlor phrases) that may be relevant to the classification process. The second is concerned with classification rule mining and categorisation of "unseen" textual data. The first stage is the subject of this thesis and often involves an analysis of text that is both language-specific (and possibly domain-specific), and that may also be computationally costly especially when dealing with large datasets. Existing approaches to this stage are not, therefore, generally applicable to all languages. In this thesis, we examine a number of alternative keyword selection methods and phrase generation strategies, coupled with two potential significant word list construction mechanisms and two final significant word selection mechanisms, to identify such words andlor phrases in a given textual dataset that are expected to serve to distinguish between classes, by simple, language-independent statistical properties. We present experimental results, using common (large) textual datasets presented in two distinct languages, to show that the proposed approaches can produce good performance with respect to both classification accuracy and processing efficiency. In other words, the study presented in this thesis demonstrates the possibility of efficiently solving the traditional text classification problem in a language-independent (also domain-independent) manner

    A model for automated topic spotting in a mobile chat based mathematics tutoring environment

    Get PDF
    Systems of writing have existed for thousands of years. The history of civilisation and the history of writing are so intertwined that it is hard to separate the one from the other. These systems of writing, however, are not static. They change. One of the latest developments in systems of writing is short electronic messages such as seen on Twitter and in MXit. One novel application which uses these short electronic messages is the Dr Math® project. Dr Math is a mobile online tutoring system where pupils can use MXit on their cell phones and receive help with their mathematics homework from volunteer tutors around the world. These conversations between pupils and tutors are held in MXit lingo or MXit language – this cryptic, abbreviated system 0f ryting w1ch l0ks lyk dis. Project μ (pronounced mu and indicating MXit Understander) investigated how topics could be determined in MXit lingo and Project μ's research outputs spot mathematics topics in conversations between Dr Math tutors and pupils. Once the topics are determined, supporting documentation can be presented to the tutors to assist them in helping pupils with their mathematics homework. Project μ made the following contributions to new knowledge: a statistical and linguistic analysis of MXit lingo provides letter frequencies, word frequencies, message length statistics as well as linguistic bases for new spelling conventions seen in MXit based conversations; a post-stemmer for use with MXit lingo removes suffixes from the ends of words taking into account MXit spelling conventions allowing words such as equashun and equation to be reduced to the same root stem; a list of over ten thousand stop words for MXit lingo appropriate for the domain of mathematics; a misspelling corrector for MXit lingo which corrects words such as acount and equates it to account; and a model for spotting mathematical topics in MXit lingo. The model was instantiated and integrated into the Dr Math tutoring platform. Empirical evidence as to the effectiveness of the μ Topic Spotter and the other contributions is also presented. The empirical evidence includes specific statistical tests with MXit lingo, specific tests of the misspelling corrector, stemmer, and feedback mechanism, and an extensive exercise of content analysis with respect to mathematics topics
    corecore