Search CORE

24,060 research outputs found

An Intelligent System For Arabic Text Categorization

Author: Fayed Z.T.
Habib M.B.
Syiam M.M.
Publication venue: Faculty of Computers and Information Sciences, Ain Shams University
Publication date: 01/01/2006
Field of study

Text Categorization (classification) is the process of classifying documents into a predefined set of categories based on their content. In this paper, an intelligent Arabic text categorization system is presented. Machine learning algorithms are used in this system. Many algorithms for stemming and feature selection are tried. Moreover, the document is represented using several term weighting schemes and finally the k-nearest neighbor and Rocchio classifiers are used for classification process. Experiments are performed over self collected data corpus and the results show that the suggested hybrid method of statistical and light stemmers is the most suitable stemming algorithm for Arabic language. The results also show that a hybrid approach of document frequency and information gain is the preferable feature selection criterion and normalized-tfidf is the best weighting scheme. Finally, Rocchio classifier has the advantage over k-nearest neighbor classifier in the classification process. The experimental results illustrate that the proposed model is an efficient method and gives generalization accuracy of about 98%

Maastricht University Research Portal

University of Twente Research Information

Termien painotus lyhyissä dokumenteissa dokumenttien luokitteluun, avainsanojen louhimiseen ja kyselyjen laajentamiseen

Author: Timonen Mika
Publication venue: 'University of Helsinki Libraries'
Publication date: 25/01/2013
Field of study

This thesis focuses on term weighting in short documents. I propose weighting approaches for assessing the importance of terms for three tasks: (1) document categorization, which aims to classify documents such as tweets into categories, (2) keyword extraction, which aims to identify and extract the most important words of a document, and (3) keyword association modeling, which aims to identify links between keywords and use them for query expansion. As the focus of text mining is shifting toward datasets that hold user-generated content, for example, social media, the type of data used in the text mining research is changing. The main characteristic of this data is its shortness. For example, a user status update usually contains less than 20 words. When using short documents, the biggest challenge in term weighting comes from the fact that most words of a document occur only once within the document. This is called hapax legomena and we call it Term Frequency = 1, or TF=1 challenge. As many traditional feature weighting approaches, such as Term Frequency - Inverse Document Frequency, are based on the occurrence frequency of each word within a document, these approaches do not perform well with short documents. The first contribution of this thesis is a term weighting approach for document categorization. This approach is directed to combat the TF=1 challenge by excluding the traditional term frequency from the weighting method. It is replaced by using word distribution among categories and within a single category as the main components. The second contribution of this thesis is a keyword extraction approach that uses three levels of word evaluation: corpus level, cluster level, and document level. I propose novel weighting approaches for all of these levels. This approach is designed to be used with short documents. Finally, the third contribution of this thesis is an approach for keyword association weighting that is used for query expansion. This approach uses keyword co-occurrences as the main component and creates an association network that aims to identify strong links between the keywords. The main finding of this study is that the existing term weighting approaches have trouble performing well with short documents. The novel algorithms proposed in this thesis produce promising results both for the keyword extraction and for the text categorization. In addition, when using keyword weighting with query expansion, we show that we are able to produce better search results especially when the original search terms would not produce any results.Tämä väitös keskittyy termien painotuksen haasteisiin lyhyissä dokumenteissa. Ehdotan painotusmenetelmiä kolmeen eri osa-alueeseen: (1) dokumenttien kategorisointi, jossa pyritään luokittelemaan muun muassa Twitter viestejä, (2) avainsanojen louhinta, jossa tavoitteena on tunnistaa ja louhia dokumentin tärkeimmät sanat, ja (3) avainsanojen assosiaatiomallinnus, jonka tavoitteena on tunnistaa sanojen välisiä linkkejä ja hyödyntää niitä haun laajennoksessa. Koska tekstinlouhinta keskittyy nykyään käyttäjien luomiin dokumentteihin, kuten esimerkiksi sosiaaliseen mediaan, tekstinlouhinnassa käytetty tieto on muuttumassa. Suurin muutos on tekstin pituus, koska sosiaalisen median viestit ovat usein alle 20 sanaa pitkiä. Tästä seuraa painotuksen suurin haaste: sanat esiintyvät usein pelkästään kerran dokumentin sisällä. Me kutsumme tätä haastetta Term Frequency = 1 (Termi Frekvenssi = 1) tai TF=1 haasteeksi. Tämän haasteen vuoksi useat perinteiset menetelmät, kuten esimerkiksi TF-IDF, ei tuota hyviä tuloksia lyhyissä dokumenteissa. Tämän työn ensimmäinen kontribuutio on termien painotus menetelmä dokumenttien luokitteluun. Menetelmä perustuu sanan esiintymistiheyden korvaamiseen muilla komponenteille, kuten esimerkiksi sanan luokkakohtaisella jakaumalla. Työn toinen kontribuutio on menetelmä avainsanojen louhintaan joka perustuu sanojen hyvyyden arviointiin kolmella eri tasolla: korpus, klusteri ja dokumentti tasoilla. Kolmas kontribuutio keskittyy avainsanojen assosiaatiomallintamiseen. Tässä tavoitteena on löytää vahvasti toisiinsa liittyviä avainsanoja ja hyödyntää näitä linkkejä haun laajennoksessa. Tämän väitöskirjan tärkein löydös on se, että olemassa olevat ja hyväksi havaitut menetelmät jotka on luotu pitkille dokumenteille, eivät toimi lyhyiden dokumenttien kanssa optimaalisesti. Tässä väitöksessä esitetyt uudet menetelmät tuottavat lupaavia menetelmiä kaikilla kokeilluilla osa-alueilla

Helsingin yliopiston digitaalinen arkisto

Automatic categorization of diverse experimental information in the bioscience literature

Author: Brown Nick
Chen Wen
Davis Paul
Fang Ruihua
Fernandes Jolene
Gelbart William M.
Marygold Steven J.
Matthews Beverley
Millburn Gillian
Schindelman Gary
Sternberg Paul W.
Tuli Mary Ann
Van Auken Kimberly
Wang Xiaodong
Zhang Haiyan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Background: Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely automatic and can be readily applied to diverse experimental data types. It has been in use in production for automatic categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance. Results: We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction. Conclusions: Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort

Crossref

Springer - Publisher Connector

Harvard University - DASH

Caltech Authors

Web news classification using neural networks based on PCA

Author: Omatu Sigeru
Selamat Ali
Yanagimoto Hidekazu
Publication venue
Publication date: 01/01/2002
Field of study

In this paper, we propose a news web page classification method (WPCM). The WPCM uses a neural network with inputs obtained by both the principal components and class profile-based features (CPBF). The fixed number of regular words from each class will be used as a feature vectors with the reduced features from the PCA. These feature vectors are then used as the input to the neural networks for classification. The experimental evaluation demonstrates that the WPCM provides acceptable classification accuracy with the sports news datasets

Universiti Teknologi Malaysia Institutional Repository