6 research outputs found

    Feature selection algorithms for Malaysian dengue outbreak detection model

    Get PDF
    Dengue fever is considered as one of the most common mosquito borne diseases worldwide. Dengue outbreak detection can be very useful in terms of practical efforts to overcome the rapid spread of the disease by providing the knowledge to predict the next outbreak occurrence. Many studies have been conducted to model and predict dengue outbreak using different data mining techniques. This research aimed to identify the best features that lead to better predictive accuracy of dengue outbreaks using three different feature selection algorithms; particle swarm optimization (PSO), genetic algorithm (GA) and rank search (RS). Based on the selected features, three predictive modeling techniques (J48, DTNB and Naive Bayes) were applied for dengue outbreak detection. The dataset used in this research was obtained from the Public Health Department, Seremban, Negeri Sembilan, Malaysia. The experimental results showed that the predictive accuracy was improved by applying feature selection process before the predictive modeling process. The study also showed the set of features to represent dengue outbreak detection for Malaysian health agencies

    Investigating Citation Linkage Between Research Articles

    Get PDF
    In recent years, there has been a dramatic increase in scientific publications across the globe. To help navigate this overabundance of information, methods have been devised to find papers with related content, but they are lacking in the ability to provide specific information that a researcher may need without having to read hundreds of linked papers. The search and browsing capabilities of online domain specific scientific repositories are limited to finding a paper citing other papers, but do not point to the specific text that is being cited. Providing this capability to the research community will be beneficial in terms of the time required to acquire the amount of background information they need to undertake their research. In this thesis, we present our effort to develop a citation linkage framework for finding those sentences in a cited article that are the focus of a citation in a citing paper. This undertaking has involved the construction of datasets and corpora that are required to build models for focused information extraction, text classification and information retrieval. As the first part of this thesis, two preprocessing steps that are deemed to assist with the citation linkage task are explored: method mention extraction and rhetorical categorization of scientific discourse. In the second part of this thesis, two methodologies for achieving the citation linkage goal are investigated. Firstly, regression techniques have been used to predict the degree of similarity between citation sentences and their equivalent target sentences with medium Pearson correlation score between predicted and expected values. The resulting learning models are then used to rank sentences in the cited paper based on their predicted scores. Secondly, search engine-like retrieval techniques have been used to rank sentences in the cited paper based on the words contained in the citation sentence. Our experiments show that it is possible to find the set of sentences that a citation refers to in a cited paper with reasonable performance. Possible applications of this work include: creation of better science paper repository navigation tools, development of scientific argumentation across research articles, and multi-document summarization of science articles

    Algorithmes d'estimation pour la classification parcimonieuse

    Get PDF
    Cette thèse traite du développement d'algorithmes d'estimation en haute dimension. Ces algorithmes visent à résoudre des problèmes de discrimination et de classification, notamment, en incorporant un mécanisme de sélection des variables pertinentes. Les contributions de cette thèse se concrétisent par deux algorithmes, GLOSS pour la discrimination et Mix-GLOSS pour la classification. Tous les deux sont basés sur le résolution d'une régression régularisée de type "optimal scoring" avec une formulation quadratique de la pénalité group-Lasso qui encourage l'élimination des descripteurs non-significatifs. Les fondements théoriques montrant que la régression de type "optimal scoring" pénalisée avec un terme "group-Lasso" permet de résoudre un problème d'analyse discriminante linéaire ont été développés ici pour la première fois. L'adaptation de cette théorie pour la classification avec l'algorithme EM n'est pas nouvelle, mais elle n'a jamais été détaillée précisément pour les pénalités qui induisent la parcimonie. Cette thèse démontre solidement que l'utilisation d'une régression de type "optimal scoring" pénalisée avec un terme "group-Lasso" à l'intérieur d'une boucle EM est possible. Nos algorithmes ont été testés avec des bases de données réelles et artificielles en haute dimension avec des résultats probants en terme de parcimonie, et ce, sans compromettre la performance du classifieur.This thesis deals with the development of estimation algorithms with embedded feature selection the context of high dimensional data, in the supervised and unsupervised frameworks. The contributions of this work are materialized by two algorithms, GLOSS for the supervised domain and Mix-GLOSS for unsupervised counterpart. Both algorithms are based on the resolution of optimal scoring regression regularized with a quadratic formulation of the group-Lasso penalty which encourages the removal of uninformative features. The theoretical foundations that prove that a group-Lasso penalized optimal scoring regression can be used to solve a linear discriminant analysis bave been firstly developed in this work. The theory that adapts this technique to the unsupervised domain by means of the EM algorithm is not new, but it has never been clearly exposed for a sparsity-inducing penalty. This thesis solidly demonstrates that the utilization of group-Lasso penalized optimal scoring regression inside an EM algorithm is possible. Our algorithms have been tested with real and artificial high dimensional databases with impressive resuits from the point of view of the parsimony without compromising prediction performances.COMPIEGNE-BU (601592101) / SudocSudocFranceF

    Multiple classifier systems based on directed attribute selection in credit risk assessment

    Get PDF
    Kao nastavak prethodnih istraživanja autora, ova doktorska disertacija predstavlja sljedeći korak istraživanja problema klasifikacije kreditnog rizika. Utemeljena na opservaciji ponašanja koje intuitivno primjenjuje društvo u svakodnevnom životu, ideja kombiniranja glasova stručnjaka je dobila posebnu pozornost istraživačke zajednice na području klasifikacije podataka. Sve veći fokus istraživača ali i obećavajući pronalasci na području kombinacije klasifikatora usmjerili su interes autora prema tom području.Svrha istraživanja provedenih i opisanih u ovom radu je istražiti primjenjivost sustava višestrukih klasifikatora temeljnog na odabiru atributa na problem procjene kreditnog rizika građana. U skladu sa svrhom provedeno je više istraživanja koja zajednički predstavljajujedan kompleksni pristup odabranom problemu. Glavni cilj ovog rada jest razviti brzu,robusnu tehniku za kombiniranje klasifikatora koja će na temelju upravljanog odabira atributa stvarati efikasne i kvalitetne sustave za ocjenu sposobnosti tražitelja kredita da vrati kredit navrijeme i u skladu s ugovorenim uvjetima. Povrh navedenog, nova tehnika mora biti dovoljno jednostavna za laku implementaciju i široku primjenu u istraživačkoj zajednici uključujući i istraživače koji primarno ne istražuju navedeno područje.Dva glavna elementa nove tehnike su: (1) odabir atributa kao strategija za postizanje raznolikosti odluka klasifikatora i (2) smanjivanje sustava kao način uključivanja samo bitnih klasifikatora koji doprinose kvaliteti sustava. Odabir atributa počiva na korištenju nekoliko različitih brzih tehnika koje rangiraju atribute po kvaliteti. Prilikom odabira tehnika, kako bise osigurao odabir različitih atributa, bitno je voditi računa o mjerama koje se koriste prilikom rangiranja atributa. Tako odabrani podskupovi atributa koriste se za trening klasifikatora, kojina temelju različitih ulaza produciraju različite modele. U sljedećem koraku tehnika odabiresamo one modele koji kombinirani mogu pozitivno utjecati na performanse sustava, temeljem odluka novog, u radu predloženog pohlepnog algoritma. Uključivanje smanjivanja sustava pozitivno utječe na efikasnost sustava i kvalitetu odluke.Nova tehnika je kreirana na kreditnim skupovima podataka s ciljem testiranja postavljenih hipoteza doktorske disertacije. U istraživanju se uspoređuju rezultati nove tehnike u odnosuna rezultate pojedinačnih klasifikatora koji su uključeni u konačni sustav, da bi se utvrdilaopravdanost kombiniranja klasifikatora. Povrh toga, analizirane su odluke algoritma zasmanjivanje i način odabira klasifikatora u sustav te odnos točnosti i Q statistike na treniranim sustavima. U slijedećem krugu istraživanja, rezultati tehnike su vrednovani pomoću tehnika Bagging i Boosting. Rezultati su uspoređivani pomoću četiri različite mjere performansi:točnosti, greške tipa I, greške tipa II i AUC mjere. Osim odabranih mjera uspoređena su i vremena potrebna za treniranje i test klasifikacijskih modela pomoću odabranih tehnika.Rezultati pokazuju da se korištenjem nove tehnike mogu poboljšati rezultati klasifikacijepodataka u odnosu na pojedinačne klasifikatore uključene u sustav. Dodatno, rezultati sukvalitetom usporedivi s najpopularnijim tehnikama, štoviše tri od četiri odabrane mjere pokazuju superiornost nove tehnike. U skladu s ciljem konstruiranja, nova tehnika ostvaruje najbolje rezultate na sustavima s manjim brojem članova i vremenski nije zahtjevna uusporedbi s tehnikama Bagging i Boosting. Ostvareni rezultati su obećavajući a predložena tehnika predstavlja dobru alternativu postojećim tehnikama za konstruiranje sustava višestrukih klasifikatora.Following the previous authors researches, this doctoral dissertation is the next step in creditrisk classification research. Based on observations of behavior that can be found in nature andsociety, the idea of combining experts decisions has gained significant importance inresearch community, especially in the area of data classification. Increasing focus of researchers as well as promising findings have directed authors interest to the mentioned research area.The purpose of researches, conducted and elaborated in this dissertation is to investigate the application of multiple classifier systems based on attribute selection on credit risk assessment. In accordance with the purpose, several researches have been conducted, that jointly represent a complex approach to the selected problem. The main goal of this paper isto develop fast and robust technique for combining classifiers, based on directed attribute selection, which will be able to create efficient and accurate systems for credit risk assessmentin retail. The afore mentioned technique must be sufficiently simple for easy implementationand wide application by the research community, including researchers that are not primarily focused on this field.Two key elements of the new technique are: (1) attribute selection used as strategy fortraining diverse classifiers and (2) ensemble thinning used to include only those classifiersthat contribute to overall system quality. Attribute selection in this context refers to the implementation of several different fast techniques which rank attributes by their quality. In order to ensure selection of different attributes, it is necessary to consider techniques based on different evaluation criteria for attribute ranking. Subsets of attributes, selected in suchmanner, are used in training process of classifiers, thus ensuring difference in produced models. In the next step technique selects only those models which when combined together,positively contribute to performances of ensemble. The selection is conducted using new, inthis paper proposed, greedy algorithm for ensemble thinning. Including ensemble thinning innew technique increases efficiency and quality of decisions.The new technique has been tested on credit data sets in accordance with defined research hypothesis of this doctoral dissertation. In presented research the results obtained using new technique are compared to results of individual classifiers included in the final ensemble, inorder to justify combining action. Additionally, decisions made by algorithm for ensemblethinning are analyzed as well as relationship between Q statistics and ensemble accuracy. Infollowing research, the results of the new technique are evaluated by techniques Bagging and Boosting. Results are evaluated with four different performance measures: accuracy, errortype I, error type II and AUC. Moreover, time necessary for training and testing of models aremeasured and compared in research.Results show significant improvement of classification performance compared toindividual classifiers as a direct result of the new technique. Furthermore, quality of obtained results can be compared with results of most popular techniques; moreover three out of four performance measures show superiority of the new technique. In accordance with the design,the new technique performs best on ensembles with small number of members and it is nottime consuming compared to Bagging and Boosting

    Multiple classifier systems based on directed attribute selection in credit risk assessment

    Get PDF
    Kao nastavak prethodnih istraživanja autora, ova doktorska disertacija predstavlja sljedeći korak istraživanja problema klasifikacije kreditnog rizika. Utemeljena na opservaciji ponašanja koje intuitivno primjenjuje društvo u svakodnevnom životu, ideja kombiniranja glasova stručnjaka je dobila posebnu pozornost istraživačke zajednice na području klasifikacije podataka. Sve veći fokus istraživača ali i obećavajući pronalasci na području kombinacije klasifikatora usmjerili su interes autora prema tom području.Svrha istraživanja provedenih i opisanih u ovom radu je istražiti primjenjivost sustava višestrukih klasifikatora temeljnog na odabiru atributa na problem procjene kreditnog rizika građana. U skladu sa svrhom provedeno je više istraživanja koja zajednički predstavljajujedan kompleksni pristup odabranom problemu. Glavni cilj ovog rada jest razviti brzu,robusnu tehniku za kombiniranje klasifikatora koja će na temelju upravljanog odabira atributa stvarati efikasne i kvalitetne sustave za ocjenu sposobnosti tražitelja kredita da vrati kredit navrijeme i u skladu s ugovorenim uvjetima. Povrh navedenog, nova tehnika mora biti dovoljno jednostavna za laku implementaciju i široku primjenu u istraživačkoj zajednici uključujući i istraživače koji primarno ne istražuju navedeno područje.Dva glavna elementa nove tehnike su: (1) odabir atributa kao strategija za postizanje raznolikosti odluka klasifikatora i (2) smanjivanje sustava kao način uključivanja samo bitnih klasifikatora koji doprinose kvaliteti sustava. Odabir atributa počiva na korištenju nekoliko različitih brzih tehnika koje rangiraju atribute po kvaliteti. Prilikom odabira tehnika, kako bise osigurao odabir različitih atributa, bitno je voditi računa o mjerama koje se koriste prilikom rangiranja atributa. Tako odabrani podskupovi atributa koriste se za trening klasifikatora, kojina temelju različitih ulaza produciraju različite modele. U sljedećem koraku tehnika odabiresamo one modele koji kombinirani mogu pozitivno utjecati na performanse sustava, temeljem odluka novog, u radu predloženog pohlepnog algoritma. Uključivanje smanjivanja sustava pozitivno utječe na efikasnost sustava i kvalitetu odluke.Nova tehnika je kreirana na kreditnim skupovima podataka s ciljem testiranja postavljenih hipoteza doktorske disertacije. U istraživanju se uspoređuju rezultati nove tehnike u odnosuna rezultate pojedinačnih klasifikatora koji su uključeni u konačni sustav, da bi se utvrdilaopravdanost kombiniranja klasifikatora. Povrh toga, analizirane su odluke algoritma zasmanjivanje i način odabira klasifikatora u sustav te odnos točnosti i Q statistike na treniranim sustavima. U slijedećem krugu istraživanja, rezultati tehnike su vrednovani pomoću tehnika Bagging i Boosting. Rezultati su uspoređivani pomoću četiri različite mjere performansi:točnosti, greške tipa I, greške tipa II i AUC mjere. Osim odabranih mjera uspoređena su i vremena potrebna za treniranje i test klasifikacijskih modela pomoću odabranih tehnika.Rezultati pokazuju da se korištenjem nove tehnike mogu poboljšati rezultati klasifikacijepodataka u odnosu na pojedinačne klasifikatore uključene u sustav. Dodatno, rezultati sukvalitetom usporedivi s najpopularnijim tehnikama, štoviše tri od četiri odabrane mjere pokazuju superiornost nove tehnike. U skladu s ciljem konstruiranja, nova tehnika ostvaruje najbolje rezultate na sustavima s manjim brojem članova i vremenski nije zahtjevna uusporedbi s tehnikama Bagging i Boosting. Ostvareni rezultati su obećavajući a predložena tehnika predstavlja dobru alternativu postojećim tehnikama za konstruiranje sustava višestrukih klasifikatora.Following the previous authors researches, this doctoral dissertation is the next step in creditrisk classification research. Based on observations of behavior that can be found in nature andsociety, the idea of combining experts decisions has gained significant importance inresearch community, especially in the area of data classification. Increasing focus of researchers as well as promising findings have directed authors interest to the mentioned research area.The purpose of researches, conducted and elaborated in this dissertation is to investigate the application of multiple classifier systems based on attribute selection on credit risk assessment. In accordance with the purpose, several researches have been conducted, that jointly represent a complex approach to the selected problem. The main goal of this paper isto develop fast and robust technique for combining classifiers, based on directed attribute selection, which will be able to create efficient and accurate systems for credit risk assessmentin retail. The afore mentioned technique must be sufficiently simple for easy implementationand wide application by the research community, including researchers that are not primarily focused on this field.Two key elements of the new technique are: (1) attribute selection used as strategy fortraining diverse classifiers and (2) ensemble thinning used to include only those classifiersthat contribute to overall system quality. Attribute selection in this context refers to the implementation of several different fast techniques which rank attributes by their quality. In order to ensure selection of different attributes, it is necessary to consider techniques based on different evaluation criteria for attribute ranking. Subsets of attributes, selected in suchmanner, are used in training process of classifiers, thus ensuring difference in produced models. In the next step technique selects only those models which when combined together,positively contribute to performances of ensemble. The selection is conducted using new, inthis paper proposed, greedy algorithm for ensemble thinning. Including ensemble thinning innew technique increases efficiency and quality of decisions.The new technique has been tested on credit data sets in accordance with defined research hypothesis of this doctoral dissertation. In presented research the results obtained using new technique are compared to results of individual classifiers included in the final ensemble, inorder to justify combining action. Additionally, decisions made by algorithm for ensemblethinning are analyzed as well as relationship between Q statistics and ensemble accuracy. Infollowing research, the results of the new technique are evaluated by techniques Bagging and Boosting. Results are evaluated with four different performance measures: accuracy, errortype I, error type II and AUC. Moreover, time necessary for training and testing of models aremeasured and compared in research.Results show significant improvement of classification performance compared toindividual classifiers as a direct result of the new technique. Furthermore, quality of obtained results can be compared with results of most popular techniques; moreover three out of four performance measures show superiority of the new technique. In accordance with the design,the new technique performs best on ensembles with small number of members and it is nottime consuming compared to Bagging and Boosting

    Scalable Feature Selection for Multi-class Problems

    No full text
    corecore