8 research outputs found

    A new method for identifying bivariate differential expression in high dimensional microarray data using quadratic discriminant analysis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>One of the drawbacks we face up when analyzing gene to phenotype associations in genomic data is the ugly performance of the designed classifier due to the small sample-high dimensional data structures (<it>n</it> â‰Ș <it>p</it>) at hand. This is known as the peaking phenomenon, a common situation in the analysis of gene expression data. Highly predictive bivariate gene interactions whose marginals are useless for discrimination are also affected by such phenomenon, so they are commonly discarded by state of the art sequential search algorithms. Such patterns are known as weak/marginal strong bivariate interactions. This paper addresses the problem of uncovering them in high dimensional settings.</p> <p>Results</p> <p>We propose a new approach which uses the quadratic discriminant analysis (QDA) as a search engine in order to detect such signals. The choice of QDA is justified by a simulation study for a benchmark of classifiers which reveals its appealing properties. The procedure rests on an exhaustive search which explores the feature space in a blockwise manner by dividing it in blocks and by assessing the accuracy of the QDA for the predictors within each pair of blocks; the block size is determined by the resistance of the QDA to peaking. This search highlights chunks of features which are expected to contain the type of subtle interactions we are concerned with; a closer look at this smaller subset of features by means of an exhaustive search guided by the QDA error rate for all the pairwise input combinations within this subset will enable their final detection. The proposed method is applied both to synthetic data and to a public domain microarray data. When applied to gene expression data, it leads to pairs of genes which are not univariate differentially expressed but exhibit subtle patterns of bivariate differential expression.</p> <p>Conclusions</p> <p>We have proposed a novel approach for identifying weak marginal/strong bivariate interactions. Unlike standard approaches as the top scoring pair (TSP) and the CorScor, our procedure does not assume a specified shape of phenotype separation and may enrich the type of bivariate differential expression patterns that can be uncovered in high dimensional data.</p

    Data mining scenarios for the discovery of subtypes and the comparison of algorithms

    Get PDF
    A data mining scenario is a logical sequence of steps to infer patterns from data. In this thesis, we present two scenarios. Our first scenario aims to identify homogeneous subtypes in data. It was applied to clinical research on Osteoarthritis (OA) and Parkinson’s disease (PD) and in drug discovery. Thus, because OA and PD are characterized by clinical heterogeneity, a more sensitive classification of the cohort of patients may contribute to the search for the underlying diseases mechanism. In drug discovery, subtyping may improve the understanding of the similarity (and distance) between different phenotypic effects as induced by drugs and chemicals. Our second scenario aims to compare text classification algorithms. First, we show that common classifiers achieve comparable performance on most problems. Second, tightly constrained SVM solutions are high performers. In that situation, most training documents are bounded support vectors, SVM reduces to a nearest mean classifier and no training is necessary, which raises a question on SVM merits in sparse bag of words feature spaces. Also, SVM is shown to suffer from performance deterioration for particular combinations of training set size/number of features. This relate to outlying documents of distinct classes overlapping in the feature space.LEI Universiteit LeidenAlgorithm

    Efficient feature reduction and classification methods

    Get PDF
    Durch die steigende Anzahl verfĂŒgbarer Daten in unterschiedlichsten Anwendungsgebieten nimmt der Aufwand vieler Data-Mining Applikationen signifikant zu. Speziell hochdimensionierte Daten (Daten die ĂŒber viele verschiedene Attribute beschrieben werden) können ein großes Problem fĂŒr viele Data-Mining Anwendungen darstellen. Neben höheren Laufzeiten können dadurch sowohl fĂŒr ĂŒberwachte (supervised), als auch nicht ĂŒberwachte (unsupervised) Klassifikationsalgorithmen weitere Komplikationen entstehen (z.B. ungenaue Klassifikationsgenauigkeit, schlechte Clustering-Eigenschaften, 
). Dies fĂŒhrt zu einem Bedarf an effektiven und effizienten Methoden zur Dimensionsreduzierung. Feature Selection (die Auswahl eines Subsets von Originalattributen) und Dimensionality Reduction (Transformation von Originalattribute in (Linear)-Kombinationen der Originalattribute) sind zwei wichtige Methoden um die Dimension von Daten zu reduzieren. Obwohl sich in den letzten Jahren vielen Studien mit diesen Methoden beschĂ€ftigt haben, gibt es immer noch viele offene Fragestellungen in diesem Forschungsgebiet. DarĂŒber hinaus ergeben sich in vielen Anwendungsbereichen durch die immer weiter steigende Anzahl an verfĂŒgbaren und verwendeten Attributen und Features laufend neue Probleme. Das Ziel dieser Dissertation ist es, verschiedene Fragenstellungen in diesem Bereich genau zu analysieren und Verbesserungsmöglichkeiten zu entwickeln. GrundsĂ€tzlich, werden folgende AnsprĂŒche an Methoden zur Feature Selection und Dimensionality Reduction gestellt: Die Methoden sollten effizient (bezĂŒglich ihres Rechenaufwandes) sein und die resultierenden Feature-Sets sollten die Originaldaten möglichst kompakt reprĂ€sentieren können. DarĂŒber hinaus ist es in vielen Anwendungsgebieten wichtig, die Interpretierbarkeit der Originaldaten beizubehalten. Letztendlich sollte der Prozess der Dimensionsreduzierung keinen negativen Effekt auf die Klassifikationsgenauigkeit haben - sondern idealerweise, diese noch verbessern. Offene Problemstellungen in diesem Bereich betreffen unter anderem den Zusammenhang zwischen Methoden zur Dimensionsreduzierung und der resultierenden Klassifikationsgenauigkeit, wobei sowohl eine möglichst kompakte ReprĂ€sentation der Daten, als auch eine hohe Klassifikationsgenauigkeit erzielt werden sollen. Wie bereits erwĂ€hnt, ergibt sich durch die große Anzahl an Daten auch ein erhöhter Rechenaufwand, weshalb schnelle und effektive Methoden zur Dimensionsreduzierung entwickelt werden mĂŒssen, bzw. existierende Methoden verbessert werden mĂŒssen. DarĂŒber hinaus sollte natĂŒrlich auch der Rechenaufwand der verwendeten Klassifikationsmethoden möglichst gering sein. Des Weiteren ist die Interpretierbarkeit von Feature Sets zwar möglich, wenn Feature Selection Methoden fĂŒr die Dimensionsreduzierung verwendet werden, im Fall von Dimensionality Reduction sind die resultierenden Feature Sets jedoch meist Linearkombinationen der Originalfeatures. Daher ist es schwierig zu ĂŒberprĂŒfen, wie viel Information einzelne Originalfeatures beitragen. Im Rahmen dieser Dissertation konnten wichtige BeitrĂ€ge zu den oben genannten Problemstellungen prĂ€sentiert werden: Es wurden neue, effiziente Initialisierungsvarianten fĂŒr die Dimensionality Reduction Methode Nonnegative Matrix Factorization (NMF) entwickelt, welche im Vergleich zu randomisierter Initialisierung und im Vergleich zu State-of-the-Art Initialisierungsmethoden zu einer schnelleren Reduktion des Approximationsfehlers fĂŒhren. Diese Initialisierungsvarianten können darĂŒber hinaus mit neu entwickelten und sehr effektiven Klassifikationsalgorithmen basierend auf NMF kombiniert werden. Um die Laufzeit von NMF weiter zu steigern wurden unterschiedliche Varianten von NMF Algorithmen auf Multi-Prozessor Systemen vorgestellt, welche sowohl Task- als auch Datenparallelismus unterstĂŒtzen und zu einer erheblichen Reduktion der Laufzeit fĂŒr NMF fĂŒhren. Außerdem wurde eine effektive Verbesserung der Matlab Implementierung des ALS Algorithmus vorgestellt. DarĂŒber hinaus wurde eine Technik aus dem Bereich des Information Retrieval -- Latent Semantic Indexing -- erfolgreich als Klassifikationsalgorithmus fĂŒr Email Daten angewendet. Schließlich wurde eine ausfĂŒhrliche empirische Studie ĂŒber den Zusammenhang verschiedener Feature Reduction Methoden (Feature Selection und Dimensionality Reduction) und der resultierenden Klassifikationsgenauigkeit unterschiedlicher Lernalgorithmen prĂ€sentiert. Der starke Einfluss unterschiedlicher Methoden zur Dimensionsreduzierung auf die resultierende Klassifikationsgenauigkeit unterstreicht dass noch weitere Untersuchungen notwendig sind um das komplexe Zusammenspiel von Dimensionsreduzierung und Klassifikation genau analysieren zu können.The sheer volume of data today and its expected growth over the next years are some of the key challenges in data mining and knowledge discovery applications. Besides the huge number of data samples that are collected and processed, the high dimensional nature of data arising in many applications causes the need to develop effective and efficient techniques that are able to deal with this massive amount of data. In addition to the significant increase in the demand of computational resources, those large datasets might also influence the quality of several data mining applications (especially if the number of features is very high compared to the number of samples). As the dimensionality of data increases, many types of data analysis and classification problems become significantly harder. This can lead to problems for both supervised and unsupervised learning. Dimensionality reduction and feature (subset) selection methods are two types of techniques for reducing the attribute space. While in feature selection a subset of the original attributes is extracted, dimensionality reduction in general produces linear combinations of the original attribute set. In both approaches, the goal is to select a low dimensional subset of the attribute space that covers most of the information of the original data. During the last years, feature selection and dimensionality reduction techniques have become a real prerequisite for data mining applications. There are several open questions in this research field, and due to the often increasing number of candidate features for various application areas (e.\,g., email filtering or drug classification/molecular modeling) new questions arise. In this thesis, we focus on some open research questions in this context, such as the relationship between feature reduction techniques and the resulting classification accuracy and the relationship between the variability captured in the linear combinations of dimensionality reduction techniques (e.\,g., PCA, SVD) and the accuracy of machine learning algorithms operating on them. Another important goal is to better understand new techniques for dimensionality reduction, such as nonnegative matrix factorization (NMF), which can be applied for finding parts-based, linear representations of nonnegative data. This ``sum-of-parts'' representation is especially useful if the interpretability of the original data should be retained. Moreover, performance aspects of feature reduction algorithms are investigated. As data grow, implementations of feature selection and dimensionality reduction techniques for high-performance parallel and distributed computing environments become more and more important. In this thesis, we focus on two types of open research questions: methodological advances without any specific application context, and application-driven advances for a specific application context. Summarizing, new methodological contributions are the following: The utilization of nonnegative matrix factorization in the context of classification methods is investigated. In particular, it is of interest how the improved interpretability of NMF factors due to the non-negativity constraints (which is of central importance in various problem settings) can be exploited. Motivated by this problem context two new fast initialization techniques for NMF based on feature selection are introduced. It is shown how approximation accuracy can be increased and/or how computational effort can be reduced compared to standard randomized seeding of the NMF and to state-of-the-art initialization strategies suggested earlier. For example, for a given number of iterations and a required approximation error a speedup of 3.6 compared to standard initialization, and a speedup of 3.4 compared to state-of-the-art initialization strategies could be achieved. Beyond that, novel classification methods based on the NMF are proposed and investigated. We can show that they are not only competitive in terms of classification accuracy with state-of-the-art classifiers, but also provide important advantages in terms of computational effort (especially for low-rank approximations). Moreover, parallelization and distributed execution of NMF is investigated. Several algorithmic variants for efficiently computing NMF on multi-core systems are studied and compared to each other. In particular, several approaches for exploiting task and/or data-parallelism in NMF are studied. We show that for some scenarios new algorithmic variants clearly outperform existing implementations. Last, but not least, a computationally very efficient adaptation of the implementation of the ALS algorithm in Matlab 2009a is investigated. This variant reduces the runtime significantly (in some settings by a factor of 8) and also provides several possibilities to be executed concurrently. In addition to purely methodological questions, we also address questions arising in the adaptation of feature selection and classification methods to two specific application problems: email classification and in silico screening for drug discovery. Different research challenges arise in the contexts of these different application areas, such as the dynamic nature of data for email classification problems, or the imbalance in the number of available samples of different classes for drug discovery problems. Application-driven advances of this thesis comprise the adaptation and application of latent semantic indexing (LSI) to the task of email filtering. Experimental results show that LSI achieves significantly better classification results than the widespread de-facto standard method for this special application context. In the context of drug discovery problems, several groups of well discriminating descriptors could be identified by utilizing the ``sum-of-parts`` representation of NMF. The number of important descriptors could be further increased when applying sparseness constraints on the NMF factors

    Tilastollisesti merkityksellisten riippuvuussÀÀntöjen tehokas haku binÀÀridatasta

    Get PDF
    Analyzing statistical dependencies is a fundamental problem in all empirical science. Dependencies help us understand causes and effects, create new scientific theories, and invent cures to problems. Nowadays, large amounts of data is available, but efficient computational tools for analyzing the data are missing. In this research, we develop efficient algorithms for a commonly occurring search problem - searching for the statistically most significant dependency rules in binary data. We consider dependency rules of the form X->A or X->not A, where X is a set of positive-valued attributes and A is a single attribute. Such rules describe which factors either increase or decrease the probability of the consequent A. A classical example are genetic and environmental factors, which can either cause or prevent a disease. The emphasis in this research is that the discovered dependencies should be genuine - i.e. they should also hold in future data. This is an important distinction from the traditional association rules, which - in spite of their name and a similar appearance to dependency rules - do not necessarily represent statistical dependencies at all or represent only spurious connections, which occur by chance. Therefore, the principal objective is to search for the rules with statistical significance measures. Another important objective is to search for only non-redundant rules, which express the real causes of dependence, without any occasional extra factors. The extra factors do not add any new information on the dependence, but can only blur it and make it less accurate in future data. The problem is computationally very demanding, because the number of all possible rules increases exponentially with the number of attributes. In addition, neither the statistical dependency nor the statistical significance are monotonic properties, which means that the traditional pruning techniques do not work. As a solution, we first derive the mathematical basis for pruning the search space with any well-behaving statistical significance measures. The mathematical theory is complemented by a new algorithmic invention, which enables an efficient search without any heuristic restrictions. The resulting algorithm can be used to search for both positive and negative dependencies with any commonly used statistical measures, like Fisher's exact test, the chi-squared measure, mutual information, and z scores. According to our experiments, the algorithm is well-scalable, especially with Fisher's exact test. It can easily handle even the densest data sets with 10000-20000 attributes. Still, the results are globally optimal, which is a remarkable improvement over the existing solutions. In practice, this means that the user does not have to worry whether the dependencies hold in future data or if the data still contains better, but undiscovered dependencies.Tilastollisten riippuvuuksien etsintÀ ja analysointi on empiiristen tieteiden keskeisimpiÀ tehtÀviÀ. Tilastolliset riippuvuudet auttavat ymmÀrtÀmÀÀn asioiden syy- ja seuraussuhteita, kuten esimerkiksi mitkÀ geenit tai elÀmÀntavat altistavat tietyille sairauksille ja mitkÀ puolestaan suojelevat niiltÀ. TÀllaiset riipuvuudet voidaan esittÀÀ havainnollisesti riippuvuussÀÀntöinÀ muotoa ABCD->E, missÀ A,B,C ja D vastaavat havaittuja tekijöitÀ ja E on niistÀ tilastollisesti riippuva seuraus. Analysoitavaa dataa on nykyaikana valtavasti saatavilla lÀhes miltÀtahansa elÀmÀn alueelta. Ongelmana on, ettei kaikkia mahdollisia riippuvuuksia voida tutkia tavallisilla tilastollisilla työkaluilla tai tietokoneohjelmilla. Esimerkiksi jos datassa esiintyy 20 muuttujaa ja kukin niistÀ voi saada vain kaksi arvoa (esimerkiksi geeni esiintyy tai ei esiiny nÀytteessÀ), erilaisia mahdollisia riippuvuussÀÀntöjÀ on jo yli 20 miljoonaa kappaletta. Usein data kuitenkin sisÀltÀÀ vÀhintÀÀn satoja tai jopa kymmeniÀ tuhansia muuttujia, eikÀ kaikkien mahdollisten riippuvuussÀÀntöjen tutkiminen ole laskennallisesti mahdollista. TÀssÀ tutkimuksessa on kehitetty tarvittavia tehokkaita laskentamenetelmiÀ tilastollisesti kaikkein merkitsevimpien riippuvuussÀÀntöjen etsintÀÀn binÀÀridatasta, jossa kukin muuttuja voi saada vain kaksi arvoa. Geenitutkimuksen lisÀksi tÀllaista dataa esiintyy luonnostaan mm. biologiassa (eri havaintopaikoilla esiintyvÀt kasvi- ja elÀinlajit) sekÀ markkinointitutkimuksessa (ns. ostoskoridata eli mitÀ tuotteita kukin asiakas on ostanut). MikÀli datassa on kuitenkin useampiarvoisia muuttujia, ne voidaan aina tarvittaessa esittÀÀ binÀÀrimuodossa. Aiempiin tiedonlouhintamenetelmiin verrattuna tutkimuksessa kehitetyt menetelmÀt ovat sekÀ tehokkaampia ettÀ luotettavampia. Perinteisesti suurien datajoukkojen riippuvuuksia on yritetty analysoida assosiaatiosÀÀnnöillÀ, mutta assosiaatiosÀÀnnöt eivÀt vÀlttÀmÀttÀ esitÀ mitÀÀn tilastollista riippuvuutta tai riippuvuus voi olla tilastollisesti merkityksetön (sattuman tuotetta). LisÀksi assosiaatiosÀÀntöjen hakumenetelmÀt ovat tehottomia löytÀmÀÀn kaikkia merkityksellisiÀ riippuvuuksia. TÀmÀn tutkimuksen tuloksena kehitetyllÀ tietokoneohjelmalla on kuitenkin mahdollista hakea kaikkein merkityksellisimmÀt riippuvuudet jopa kymmeniÀ tuhansia muuttujia sisÀltÀvistÀ datajoukoista tavallisella pöytÀtietokoneella. HakukriteerinÀ, jolla riippuvuuden tilastollinen merkityksevyys arvioidaan, voidaan kÀyttÀÀ melkein mitÀtahansa tilastollista mittaa kuten Fisherin eksaktia testiÀ tai chi2-mittaa

    StReBio'09: Statistical Relational Learning and Mining in Bioinformatics

    No full text
    Bioinformatics is an application domain where information is naturally represented in terms of relations between heterogenous objects. Modern experimentation and data acquisition techniques allow the study of complex interactions in biological systems. This raises interesting challenges for machine learning and data mining researchers, as the amount of data is huge, some information can not be observed, and measurements may be noisy. This report presents a review on the ACM SIGKDD 2009 Workshop on Statistical Relational Learning and Mining in Bioinformatics (StReBio'09) which was held in Paris on June 28th, 2009. The aim of this workshop was to provide a forum to share challenges, results and ideas at the frontier between the field of statistical relational learning and the field of bioinformatics.status: publishe
    corecore