5 research outputs found

    Topics in Matrix Sampling Algorithms

    Full text link
    We study three fundamental problems of Linear Algebra, lying in the heart of various Machine Learning applications, namely: 1)"Low-rank Column-based Matrix Approximation". We are given a matrix A and a target rank k. The goal is to select a subset of columns of A and, by using only these columns, compute a rank k approximation to A that is as good as the rank k approximation that would have been obtained by using all the columns; 2) "Coreset Construction in Least-Squares Regression". We are given a matrix A and a vector b. Consider the (over-constrained) least-squares problem of minimizing ||Ax-b||, over all vectors x in D. The domain D represents the constraints on the solution and can be arbitrary. The goal is to select a subset of the rows of A and b and, by using only these rows, find a solution vector that is as good as the solution vector that would have been obtained by using all the rows; 3) "Feature Selection in K-means Clustering". We are given a set of points described with respect to a large number of features. The goal is to select a subset of the features and, by using only this subset, obtain a k-partition of the points that is as good as the partition that would have been obtained by using all the features. We present novel algorithms for all three problems mentioned above. Our results can be viewed as follow-up research to a line of work known as "Matrix Sampling Algorithms". [Frieze, Kanna, Vempala, 1998] presented the first such algorithm for the Low-rank Matrix Approximation problem. Since then, such algorithms have been developed for several other problems, e.g. Graph Sparsification and Linear Equation Solving. Our contributions to this line of research are: (i) improved algorithms for Low-rank Matrix Approximation and Regression (ii) algorithms for a new problem domain (K-means Clustering).Comment: PhD Thesis, 150 page

    Principled design of evolutionary learning sytems for large scale data mining

    Get PDF
    Currently, the data mining and machine learning fields are facing new challenges because of the amount of information that is collected and needs processing. Many sophisticated learning approaches cannot simply cope with large and complex domains, because of the unmanageable execution times or the loss of prediction and generality capacities that occurs when the domains become more complex. Therefore, to cope with the volumes of information of the current realworld problems there is a need to push forward the boundaries of sophisticated data mining techniques. This thesis is focused on improving the efficiency of Evolutionary Learning systems in large scale domains. Specifically the objective of this thesis is improving the efficiency of the Bioinformatic Hierarchical Evolutionary Learning (BioHEL) system, a system designed with the purpose of handling large domains. This is a classifier system that uses an Iterative Rule Learning approach to generate a set of rules one by one using consecutive Genetic Algorithms. This system have shown to be very competitive so far in large and complex domains. In particular, BioHEL has obtained very important results when solving protein structure prediction problems and has won related merits, such as being placed among the best algorithms for this purpose at the Critical Assessment of Techniques for Protein Structure Prediction (CASP) in 2008 and 2010, and winning the bronze medal at the HUMIES Awards for Human-competitive results in 2007. However, there is still a need to analyse this system in a principled way to determine how the current mechanisms work together to solve larger domains and determine the aspects of the system that can be improved towards this aim. To fulfil the objective of this thesis, the work is divided in two parts. In the first part of the thesis exhaustive experimentation was carried out to determine ways in which the system could be improved. From this exhaustive analysis three main weaknesses are pointed out: a) the problem-dependancy of parameters in BioHEL's fitness function, which results in having a system difficult to set up and which requires an extensive preliminary experimentation to determine the adequate values for these parameters; b) the execution time of the learning process, which at the moment does not use any parallelisation techniques and depends on the size of the training sets; and c) the lack of global supervision over the generated solutions which comes from the usage of the Iterative Rule Learning paradigm and produces larger rule sets in which there is no guarantee of minimality or maximal generality. The second part of the thesis is focused on tackling each one of the weaknesses abovementioned to have a system capable of handling larger domains. First a heuristic approach to set parameters within BioHEL's fitness function is developed. Second a new parallel evaluation process that runs on General Purpose Graphic Processing Units was developed. Finally, post-processing operators to tackle the generality and cardinality of the generated solutions are proposed. By means of these enhancements we managed to improve the BioHEL system to reduce both the learning and the preliminary experimentation time, increase the generality of the final solutions and make the system more accessible for end-users. Moreover, as the techniques discussed in this thesis can be easily extended to other Evolutionary Learning systems we consider them important additions to the research in this field towards tackling large scale domains
    corecore