108 research outputs found

    Study of gene expression representation with Treelets and hierarchical clustering algorithms

    Get PDF
    English: Since the mid-1990's, the field of genomic signal processing has exploded due to the development of DNA microarray technology, which made possible the measurement of mRNA expression of thousands of genes in parallel. Researchers had developed a vast body of knowledge in classification methods. However, microarray data is characterized by extremely high dimensionality and comparatively small number of data points. This makes microarray data analysis quite unique. In this work we have developed various hierarchical clustering algorthims in order to improve the microarray classification task. At first, the original feature set of gene expression values are enriched with new features that are linear combinations of the original ones. These new features are called metagenes and are produced by different proposed hierarchical clustering algorithms. In order to prove the utility of this methodology to classify microarray datasets the building of a reliable classifier via feature selection process is introduced. This methodology has been tested on three public cancer datasets: Colon, Leukemia and Lymphoma. The proposed method has obtained better classification results than if this enhancement is not performed. Confirming the utility of the metagenes generation to improve the final classifier. Secondly, a new technique has been developed in order to use the hierarchical clustering to perform a reduction on the huge microarray datasets, removing the initial genes that will not be relevant for the cancer classification task. The experimental results of this method are also presented and analyzed when it is applied to one public database demonstrating the utility of this new approach.Castellano: Desde finales de la década de los años 90, el campo de la genómica fue revolucionado debido al desarrollo de la tecnología de los DNA microarrays. Con ésta técnica es posible medir la expresión de los mRNA de miles de genes en paralelo. Los investigadores han desarrollado un vasto conocimiento en los métodos de clasificación. Sin embargo, los microarrays están caracterizados por tener un alto número de genes y un número de muestras comparativamente pequeño. Éste hecho convierte al estudio de los microarrays en único. En éste trabajo se ha desarrollado diversos algoritmos de agrupación jerárquica para mejorar la clasificación de los microarrays. La primera y gran aplicación ha sido el enriquecimiento de las bases de datos originales mediante la introducción de nuevos elementos que son obtenidos como combinaciones lineales los genes originales. Estos nuevos elementos se han denominado metagenes y son producidos mediante los diferentes algoritmos propuestos de agrupación jerárquica. A fin de demostrar la utilidad de esta metodología para clasificar las bases de datos de microarrays se ha introducido la construcción de un clasificador fiable a través de un proceso de selección de características. Esta metodología ha sido probada en tres bases de datos de cáncer públicas: Colon, Leucemia y Linfoma. El método propuesto ha obtenido mejores resultados en la clasificación que cuando éste enriquecimiento no se ha llevado a cabo. De ésta manera se ha confirmado la utilidad de la generación de los metagenes para mejorar el clasificador. En segundo lugar, se ha desarrollado una nueva técnica para realizar una reducción inicial en las bases de datos, consistente en eliminar los genes que no son relevantes para realizar la clasificación. Éste método se ha aplicado a una de las bases de datos públicas, y los resultados experimentales se presentan y analizan demostrando la utilidad de éste nuevo enfoque.Català: Des de finals de la dècada dels 90, el camp de la genómica va ser revolucionat gràcies al desenvolupament de la tecnología dels DNA microarrays. Amb aquesta tècnica es possible mesurar l'expresió dels mRNA de milers de gens en paralel. Els investigadors han desenvolupat un ample coneixement dels mètodes de classificació. No obstant, els microarrays estàn caracteritzats per tindre una alt nombre de genes i comparativament un nombre petit de mostres. Aquest fet fa que l'estudi dels microarrays sigui únic. Amb aquest treball s' han desenvolupat diversos algoritmes d'agrupació jeràrquica per millorar la classificació dels microarrays. La primera i gran aplicació ha sigut l'enriqueiment de les bases de dades originals mitjançant l'introducció de nous elements que s'obtenen com combinacions lineals dels gens originals. Aquests nous elements han sigut denominats com metagens i són calculats mitjantçant els diferents algoritmes d'agrupació jerárquica proposats. Per a demostrar l'utilitat d'aquesta metodología per a classificar les bases de dades de microarrays s'ha introduït la construcció d'un classificador fiable mitjantçant un procés de selecció de característiques. Aquesta metodología ha sigut aplicada a tres bases de dades públiques de càncer: Colon, Leucèmia i Limfoma. El métode proposat ha obtenigut millors resultats en la classificació que quan aquest enriqueiment no ha sigut realitzat. D'aquesta manera s'ha confirmat l'utilitat de la generació dels metagens per a millorar els classificadors. En segon lloc, s'ha desenvolupat una nova técnica per a realitzar una reducció inicial en les bases de dades, aquest mètode consisteix en l'eliminació dels gens que no són relevants a l'hora de realitzar la classificació dels pacients. Aquest mètode ha sigut aplicat a una de les bases de dades públiques. Els resultats experimentals es presenten i analitzen demostrant l'utilitat d'aquesta nova tècnica

    Predicting Distributions of Estuarine Associated Fish and Invertebrates in Southeast Alaska

    Get PDF
    Thesis (Ph.D.) University of Alaska Fairbanks, 2013Estuaries in Southeast Alaska provide habitat for juveniles and adults of several commercial fish and invertebrate species; however, because of the area's size and challenging environment, very little is known about the spatial structure and distribution of estuarine species in relation to the biotic and abiotic environment. This study uses advanced machine learning algorithms (random forest and multivariate random forest) and landscape and seascape-scale environmental variables to develop predictive models of species occurrence and community composition within Southeast Alaskanestuaries. Species data were obtained from trawl and seine sampling in 49 estuaries throughout the study area. Environmental data were compiled and extracted from existing spatial datasets. Individual models for species occurrence were validated using independent data from seine surveys in 88 estuaries. Prediction accuracy for individual species models ranged from 94% to 63%, with 76% of the fish species models and 72% of the invertebrate models having a predictive accuracy of 70% or better. The models elucidated complex species-habitat relationships that can be used to identify habitat protection priorities and to guide future research. The multivariate models demonstrated that community composition was strongly related to regional patterns of precipitation and tidal energy, as well as to local abundance of intertidal habitat and vegetation. The models provide insight into how changes in species abundance are influenced by both environmental variation and the co-occurrence of other species. Taxonomic diversity in the region was high (74%) and functional diversity was relatively low (23%). Functional diversity was not linearly correlated to species richness, indicating that the number of species in the estuary was not a good predictor of functional diversity or redundancy. Functional redundancy differed across estuary clusters, suggesting that some estuaries have a greater potential for loss of functional diversity with species removal than others

    Learning from Multi-Class Imbalanced Big Data with Apache Spark

    Get PDF
    With data becoming a new form of currency, its analysis has become a top priority in both academia and industry, furthering advancements in high-performance computing and machine learning. However, these large, real-world datasets come with additional complications such as noise and class overlap. Problems are magnified when with multi-class data is presented, especially since many of the popular algorithms were originally designed for binary data. Another challenge arises when the number of examples are not evenly distributed across all classes in a dataset. This often causes classifiers to favor the majority class over the minority classes, leading to undesirable results as learning from the rare cases may be the primary goal. Many of the classic machine learning algorithms were not designed for multi-class, imbalanced data or parallelism, and so their effectiveness has been hindered. This dissertation addresses some of these challenges with in-depth experimentation using novel implementations of machine learning algorithms using Apache Spark, a distributed computing framework based on the MapReduce model designed to handle very large datasets. Experimentation showed that many of the traditional classifier algorithms do not translate well to a distributed computing environment, indicating the need for a new generation of algorithms targeting modern high-performance computing. A collection of popular oversampling methods, originally designed for small binary class datasets, have been implemented using Apache Spark for the first time to improve parallelism and add multi-class support. An extensive study on how instance level difficulty affects the learning from large datasets was also performed

    Analysis of Microarray Data using Machine Learning Techniques on Scalable Platforms

    Get PDF
    Microarray-based gene expression profiling has been emerged as an efficient technique for classification, diagnosis, prognosis, and treatment of cancer disease. Frequent changes in the behavior of this disease, generate a huge volume of data. The data retrieved from microarray cover its veracities, and the changes observed as time changes (velocity). Although, it is a type of high-dimensional data which has very large number of features rather than number of samples. Therefore, the analysis of microarray high-dimensional dataset in a short period is very much essential. It often contains huge number of data, only a fraction of which comprises significantly expressed genes. The identification of the precise and interesting genes which are responsible for the cause of cancer is imperative in microarray data analysis. Most of the existing schemes employ a two phase process such as feature selection/extraction followed by classification. Our investigation starts with the analysis of microarray data using kernel based classifiers followed by feature selection using statistical t-test. In this work, various kernel based classifiers like Extreme learning machine (ELM), Relevance vector machine (RVM), and a new proposed method called kernel fuzzy inference system (KFIS) are implemented. The proposed models are investigated using three microarray datasets like Leukemia, Breast and Ovarian cancer. Finally, the performance of these classifiers are measured and compared with Support vector machine (SVM). From the results, it is revealed that the proposed models are able to classify the datasets efficiently and the performance is comparable to the existing kernel based classifiers. As the data size increases, to handle and process these datasets becomes very bottleneck. Hence, a distributed and a scalable cluster like Hadoop is needed for storing (HDFS) and processing (MapReduce as well as Spark) the datasets in an efficient way. The next contribution in this thesis deals with the implementation of feature selection methods, which are able to process the data in a distributed manner. Various statistical tests like ANOVA, Kruskal-Wallis, and Friedman tests are implemented using MapReduce and Spark frameworks which are executed on the top of Hadoop cluster. The performance of these scalable models are measured and compared with the conventional system. From the results, it is observed that the proposed scalable models are very efficient to process data of larger dimensions (GBs, TBs, etc.), as it is not possible to process with the traditional implementation of those algorithms. After selecting the relevant features, the next contribution of this thesis is the scalable viii implementation of the proximal support vector machine classifier, which is an efficient variant of SVM. The proposed classifier is implemented on the two scalable frameworks like MapReduce and Spark and executed on the Hadoop cluster. The obtained results are compared with the results obtained using conventional system. From the results, it is observed that the scalable cluster is well suited for the Big data. Furthermore, it is concluded that Spark is more efficient than MapReduce due to its an intelligent way of handling the datasets through Resilient distributed dataset (RDD) as well as in-memory processing and conventional system to analyze the Big datasets. Therefore, the next contribution of the thesis is the implementation of various scalable classifiers base on Spark. In this work various classifiers like, Logistic regression (LR), Support vector machine (SVM), Naive Bayes (NB), K-Nearest Neighbor (KNN), Artificial Neural Network (ANN), and Radial basis function network (RBFN) with two variants hybrid and gradient descent learning algorithms are proposed and implemented using Spark framework. The proposed scalable models are executed on Hadoop cluster as well as conventional system and the results are investigated. From the obtained results, it is observed that the execution of the scalable algorithms are very efficient than conventional system for processing the Big datasets. The efficacy of the proposed scalable algorithms to handle Big datasets are investigated and compared with the conventional system (where data are not distributed, kept on standalone machine and processed in a traditional manner). The comparative analysis shows that the scalable algorithms are very efficient to process Big datasets on Hadoop cluster rather than the conventional system

    The multiple pheromone Ant clustering algorithm

    Get PDF
    Ant Colony Optimisation algorithms mimic the way ants use pheromones for marking paths to important locations. Pheromone traces are followed and reinforced by other ants, but also evaporate over time. As a consequence, optimal paths attract more pheromone, whilst the less useful paths fade away. In the Multiple Pheromone Ant Clustering Algorithm (MPACA), ants detect features of objects represented as nodes within graph space. Each node has one or more ants assigned to each feature. Ants attempt to locate nodes with matching feature values, depositing pheromone traces on the way. This use of multiple pheromone values is a key innovation. Ants record other ant encounters, keeping a record of the features and colony membership of ants. The recorded values determine when ants should combine their features to look for conjunctions and whether they should merge into colonies. This ability to detect and deposit pheromone representative of feature combinations, and the resulting colony formation, renders the algorithm a powerful clustering tool. The MPACA operates as follows: (i) initially each node has ants assigned to each feature; (ii) ants roam the graph space searching for nodes with matching features; (iii) when departing matching nodes, ants deposit pheromones to inform other ants that the path goes to a node with the associated feature values; (iv) ant feature encounters are counted each time an ant arrives at a node; (v) if the feature encounters exceed a threshold value, feature combination occurs; (vi) a similar mechanism is used for colony merging. The model varies from traditional ACO in that: (i) a modified pheromone-driven movement mechanism is used; (ii) ants learn feature combinations and deposit multiple pheromone scents accordingly; (iii) ants merge into colonies, the basis of cluster formation. The MPACA is evaluated over synthetic and real-world datasets and its performance compares favourably with alternative approaches

    Declarative Querying For Biological Sequences.

    Full text link
    Life science research labs today manage increasing volumes of sequence data. Much of the data management and querying today is accomplished procedurally using Perl, Python, or Java programs that integrate data from different sources and query tools. The dangers of this procedural approach are well known to the database community-- a) severe limitations on the ability to rapidly express queries and b) inefficient query plans due to the lack of sophisticated optimization tools. This situation is likely to get worse with advances in high-throughput technologies that make it easier to quickly produce vast amounts of sequence data. The need for a declarative and efficient system to manage and query biological sequence data is urgent. To address this need, we designed the Periscope/SQ system. Periscope/SQ extends current relational systems to enable sophisticated queries on sequence data and can optimize and execute these queries efficiently. This thesis describes the problems that need to be solved to make it possible to build the Periscope/SQ system. First, we describe the algebraic framework which forms the backbone of Periscope/SQ. Second, we describe algorithms to construct large scale suffix tree indexes for efficiently answering sequence queries. Third, we describe techniques for selectivity estimation and optimization in the context of queries over biological sequences. Next, we demonstrate how some of the techniques developed for Periscope/SQ can be applied to produce a powerful mining algorithm that we call FLAME. Finally, we describe GeneFinder, a biological application built on top of Periscope/SQ. GeneFinder is currently being used to predict the targets of transcription factors. Today, genomic and proteomic sequences are the most abundantly available source of high-quality biological data. By making it possible to declaratively and efficiently query vast amount of sequence data, Periscope/SQ opens the door to vast improvements in the pace of bioinformatics research.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/55670/2/tatas_1.pd

    Proceedings of the 35th WIC Symposium on Information Theory in the Benelux and the 4th joint WIC/IEEE Symposium on Information Theory and Signal Processing in the Benelux, Eindhoven, the Netherlands May 12-13, 2014

    Get PDF
    Compressive sensing (CS) as an approach for data acquisition has recently received much attention. In CS, the signal recovery problem from the observed data requires the solution of a sparse vector from an underdetermined system of equations. The underlying sparse signal recovery problem is quite general with many applications and is the focus of this talk. The main emphasis will be on Bayesian approaches for sparse signal recovery. We will examine sparse priors such as the super-Gaussian and student-t priors and appropriate MAP estimation methods. In particular, re-weighted l2 and re-weighted l1 methods developed to solve the optimization problem will be discussed. The talk will also examine a hierarchical Bayesian framework and then study in detail an empirical Bayesian method, the Sparse Bayesian Learning (SBL) method. If time permits, we will also discuss Bayesian methods for sparse recovery problems with structure; Intra-vector correlation in the context of the block sparse model and inter-vector correlation in the context of the multiple measurement vector problem

    A METHOD FOR DETECTING OPTIMAL SPLITS OVER TIME IN SURVIVAL ANALYSIS USING TREE-STRUCTURED MODELS

    Get PDF
    One of the most popular uses for tree-based methods is in survival analysis for censored time data where the goal is to identify factors that are predictive of survival. Tree-based methods, due to their ability to identify subgroups in a hierarchical manner, can sometimes provide a useful alternative to Cox's proportional hazards model (1972) for the exploration of survival data. Since the data are partitioned into approximately homogeneous groups, Kaplan-Meier estimators can be used to compare prognosis between the groups presented by "nodes" in the tree. The demand for tree-based methods comes from clinical studies where the investigators are interested in grouping patients with differing prognoses. Tree-based methods are usually conducted at landmark time points, for example, five-year overall survival, but the effects of some covariates might be attenuated or increased at some other landmark time point. In some applications, it may be of interest to also determine the time point with respect to the outcome interest where the greatest discrimination between subgroups occurs. Consequently, by using a conventional approach, the time point at which the discrimination is the greatest might be missed. To remediate this potential problem, we propose a tree-structure method that will split based on the potential time-varying effects of the covariates. Accordingly, with our method, we find the best point of discrimination of a covariate with respect to not only a particular value of that covariate but also to the time when the endpoint of interest is observed. We analyze survival data from the National Surgical Adjuvant Breast and Bowel Project (NSABP) Protocol B-09 to demonstrate our method. Simulations are used to assess the statistical properties of this proposed methodology.We propose a new method in survival analysis, which is an area of statistics that is commonly used to assess prognoses of patients or participants in large public health studies. Our proposed method has public health significance because it could potentially facilitate a more refined assessment of the effect of biological and clinical markers on the survival times of different patient populations

    Approximate Matching in Genomic Sequence Data

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    corecore