811 research outputs found

    Integrative mixture of experts to combine clinical factors and gene markers

    Get PDF
    Motivation: Microarrays are being increasingly used in cancer research to better characterize and classify tumors by selecting marker genes. However, as very few of these genes have been validated as predictive biomarkers so far, it is mostly conventional clinical and pathological factors that are being used as prognostic indicators of clinical course. Combining clinical data with gene expression data may add valuable information, but it is a challenging task due to their categorical versus continuous characteristics. We have further developed the mixture of experts (ME) methodology, a promising approach to tackle complex non-linear problems. Several variants are proposed in integrative ME as well as the inclusion of various gene selection methods to select a hybrid signature

    Pattern mining under different conditions

    Get PDF
    New requirements and demands on pattern mining arise in modern applications, which cannot be fulfilled using conventional methods. For example, in scientific research, scientists are more interested in unknown knowledge, which usually hides in significant but not frequent patterns. However, existing itemset mining algorithms are designed for very frequent patterns. Furthermore, scientists need to repeat an experiment many times to ensure reproducibility. A series of datasets are generated at once, waiting for clustering, which can contain an unknown number of clusters with various densities and shapes. Using existing clustering algorithms is time-consuming because parameter tuning is necessary for each dataset. Many scientific datasets are extremely noisy. They contain considerably more noises than in-cluster data points. Most existing clustering algorithms can only handle noises up to a moderate level. Temporal pattern mining is also important in scientific research. Existing temporal pattern mining algorithms only consider pointbased events. However, most activities in the real-world are interval-based with a starting and an ending timestamp. This thesis developed novel pattern mining algorithms for various data mining tasks under different conditions. The first part of this thesis investigates the problem of mining less frequent itemsets in transactional datasets. In contrast to existing frequent itemset mining algorithms, this part focus on itemsets that occurred not that frequent. Algorithms NIIMiner, RaCloMiner, and LSCMiner are proposed to identify such kind of itemsets efficiently. NIIMiner utilizes the negative itemset tree to extract all patterns that occurred less than a given support threshold in a top-down depth-first manner. RaCloMiner combines existing bottom-up frequent itemset mining algorithms with a top-down itemset mining algorithm to achieve a better performance in mining less frequent patterns. LSCMiner investigates the problem of mining less frequent closed patterns. The second part of this thesis studied the problem of interval-based temporal pattern mining in the stream environment. Interval-based temporal patterns are sequential patterns in which each event is aligned with a starting and ending temporal information. The ability to handle interval-based events and stream data is lacking in existing approaches. A novel intervalbased temporal pattern mining algorithm for stream data is described in this part. The last part of this thesis studies new problems in clustering on numeric datasets. The first problem tackled in this part is shape alternation adaptivity in clustering. In applications such as scientific data analysis, scientists need to deal with a series of datasets generated from one experiment. Cluster sizes and shapes are different in those datasets. A kNN density-based clustering algorithm, kadaClus, is proposed to provide the shape alternation adaptability so that users do not need to tune parameters for each dataset. The second problem studied in this part is clustering in an extremely noisy dataset. Many real-world datasets contain considerably more noises than in-cluster data points. A novel clustering algorithm, kenClus, is proposed to identify clusters in arbitrary shapes from extremely noisy datasets. Both clustering algorithms are kNN-based, which only require one parameter k. In each part, the efficiency and effectiveness of the presented techniques are thoroughly analyzed. Intensive experiments on synthetic and real-world datasets are conducted to show the benefits of the proposed algorithms over conventional approaches

    Inferring phylogenetic trees under the general Markov model via a minimum spanning tree backbone

    Get PDF
    Phylogenetic trees are models of the evolutionary relationships among species, with species typically placed at the leaves of trees. We address the following problems regarding the calculation of phylogenetic trees. (1) Leaf-labeled phylogenetic trees may not be appropriate models of evolutionary relationships among rapidly evolving pathogens which may contain ancestor-descendant pairs. (2) The models of gene evolution that are widely used unrealistically assume that the base composition of DNA sequences does not evolve. Regarding problem (1) we present a method for inferring generally labeled phylogenetic trees that allow sampled species to be placed at non-leaf nodes of the tree. Regarding problem (2), we present a structural expectation maximization method (SEM-GM) for inferring leaf-labeled phylogenetic trees under the general Markov model (GM) which is the most complex model of DNA substitution that allows the evolution of base composition. In order to improve the scalability of SEM-GM we present a minimum spanning tree (MST) framework called MST-backbone. MST-backbone scales linearly with the number of leaves. However, the unrealistic location of the root as inferred on empirical data suggests that the GM model may be overtrained. MST-backbone was inspired by the topological relationship between MSTs and phylogenetic trees that was introduced by Choi et al. (2011). We discovered that the topological relationship does not necessarily hold if there is no unique MST. We propose so-called vertex-order based MSTs (VMSTs) that guarantee a topological relationship with phylogenetic trees.Phylogenetische BĂ€ume modellieren evolutionĂ€re Beziehungen zwischen Spezies, wobei die Spezies typischerweise an den BlĂ€ttern der BĂ€ume sitzen. Wir befassen uns mit den folgenden Problemen bei der Berechnung von phylogenetischen BĂ€umen. (1) Blattmarkierte phylogenetische BĂ€ume sind möglicherweise keine geeigneten Modelle der evolutionĂ€ren Beziehungen zwischen sich schnell entwickelnden Krankheitserregern, die Vorfahren-Nachfahren-Paare enthalten können. (2) Die weit verbreiteten Modelle der Genevolution gehen unrealistischerweise davon aus, dass sich die Basenzusammensetzung von DNA-Sequenzen nicht Ă€ndert. BezĂŒglich Problem (1) stellen wir eine Methode zur Ableitung von allgemein markierten phylogenetischen BĂ€umen vor, die es erlaubt, Spezies, fĂŒr die Proben vorliegen, an inneren des Baumes zu platzieren. BezĂŒglich Problem (2) stellen wir eine strukturelle Expectation-Maximization-Methode (SEM-GM) zur Ableitung von blattmarkierten phylogenetischen BĂ€umen unter dem allgemeinen Markov-Modell (GM) vor, das das komplexeste Modell von DNA-Substitution ist und das die Evolution von Basenzusammensetzung erlaubt. Um die Skalierbarkeit von SEM-GM zu verbessern, stellen wir ein Minimale Spannbaum (MST)-Methode vor, die als MST-Backbone bezeichnet wird. MST-Backbone skaliert linear mit der Anzahl der BlĂ€tter. Die Tatsache, dass die Lage der Wurzel aus empirischen Daten nicht immer realistisch abgeleitet warden kann, legt jedoch nahe, dass das GM-Modell möglicherweise ĂŒbertrainiert ist. MST-backbone wurde von einer topologischen Beziehung zwischen minimalen SpannbĂ€umen und phylogenetischen BĂ€umen inspiriert, die von Choi et al. 2011 eingefĂŒhrt wurde. Wir entdeckten, dass die topologische Beziehung nicht unbedingt Bestand hat, wenn es keinen eindeutigen minimalen Spannbaum gibt. Wir schlagen so genannte vertex-order-based MSTs (VMSTs) vor, die eine topologische Beziehung zu phylogenetischen BĂ€umen garantieren

    Methods for missing time-series data and large spatial data

    Get PDF
    Performing accurate statistical inference requires high-quality datasets. However, real-world datasets often contain missing variables of varying degrees both spatially and temporally. Alternatively, modelled datasets can provide a complete dataset, but these are often biased. This thesis derives a simplified approach to the skew Kalman filter that tackles the computational issues present in the existing skew Kalman filter by using a secondary dataset to estimate the skewness parameter. In application, this thesis implements the skew Kalman filter using surface-level ozone to bias-correct the modelled ozone data and use the bias-corrected data to infill missing data in the observed dataset. Further, this thesis explores working with large spatial datasets. When carrying out spatial inference, using all the possible data available allows for more accurate inference. However, spatial models such as Gaussian processes scale cubically with the number of data points and thus quickly become computationally infeasible for moderate to large datasets. Divide and-conquer methods allow data to be split into subsets and inference is carried out on each subset before recombining. While well documented in the independent setting, these methods are less popular in the spatial setting. This thesis evaluates the performance of divide-and-conquer methods in the spatial setting to achieve approximate results compared to carrying out inference on the full dataset. Finally, this is demonstrated using USA temperature data

    Sequential pattern mining with uncertain data

    Get PDF
    In recent years, a number of emerging applications, such as sensor monitoring systems, RFID networks and location based services, have led to the proliferation of uncertain data. However, traditional data mining algorithms are usually inapplicable in uncertain data because of its probabilistic nature. Uncertainty has to be carefully handled; otherwise, it might significantly downgrade the quality of underlying data mining applications. Therefore, we extend traditional data mining algorithms into their uncertain versions so that they still can produce accurate results. In particular, we use a motivating example of sequential pattern mining to illustrate how to incorporate uncertain information in the process of data mining. We use possible world semantics to interpret two typical types of uncertainty: the tuple-level existential uncertainty and the attribute-level temporal uncertainty. In an uncertain database, it is probabilistic that a pattern is frequent or not; thus, we define the concept of probabilistic frequent sequential patterns. And various algorithms are designed to mine probabilistic frequent patterns efficiently in uncertain databases. We also implement our algorithms on distributed computing platforms, such as MapReduce and Spark, so that they can be applied in large scale databases. Our work also includes uncertainty computation in supervised machine learning algorithms. We develop an artificial neural network to classify numeric uncertain data; and a Naive Bayesian classifier is designed for classifying categorical uncertain data streams. We also propose a discretization algorithm to pre-process numerical uncertain data, since many classifiers work with categoric data only. And experimental results in both synthetic and real-world uncertain datasets demonstrate that our methods are effective and efficient

    A flexible predictive density combination for large financial data sets in regular and crisis periods

    Get PDF
    A flexible predictive density combination is introduced for large financial data sets, which allows for model set incompleteness. Dimension reduction procedures that include learning to allocate the large sets of predictive densities and combination weights to relatively small subsets. Given the representation of the probability model in extended nonlinear state-space form, efficient simulation-based Bayesian inference is proposed using parallel dynamic clustering as well as nonlinear filtering implemented on graphics processing units. The approach is applied to combine predictive densities based on a large number of individual US stock returns of daily observations over a period that includes the Covid-19 crisis period. Evidence on dynamic cluster composition, weight patterns and model set incompleteness gives valuable signals for improved modelling. This enables higher predictive accuracy and better assessment of uncertainty and risk for investment fund management

    Tree-based Density Estimation: Algorithms and Applications

    Get PDF
    Data Mining can be seen as an extension to statistics. It comprises the preparation of data and the process of gathering new knowledge from it. The extraction of new knowledge is supported by various machine learning methods. Many of the algorithms are based on probabilistic principles or use density estimations for their computations. Density estimation has been practised in the field of statistics for several centuries. In the simplest case, a histogram estimator, like the simple equalwidth histogram, can be used for this task and has been shown to be a practical tool to represent the distribution of data visually and for computation. Like other nonparametric approaches, it can provide a flexible solution. However, flexibility in existing approaches is generally restricted because the size of the bins is fixed either the width of the bins or the number of values in them. Attempts have been made to generate histograms with a variable bin width and a variable number of values per interval, but the computational approaches in these methods have proven too difficult and too slow even with modern computer technology. In this thesis new flexible histogram estimation methods are developed and tested as part of various machine learning tasks, namely discretization, naive Bayes classification, clustering and multiple-instance learning. Not only are the new density estimation methods applied to machine learning tasks, they also borrow design principles from algorithms that are ubiquitous in artificial intelligence: divide-andconquer methods are a well known way to tackle large problems by dividing them into small subproblems. Decision trees, used for machine learning classification, successfully apply this approach. This thesis presents algorithms that build density estimators using a binary split tree to cut a range of values into subranges of varying length. No class values are required for this splitting process, making it an unsupervised method. The result is a histogram estimator that adapts well even to complex density functions a novel density estimation method with flexible density estimation ability and good computational behaviour. Algorithms are presented for both univariate and multivariate data. The univariate histogram estimator is applied to discretization for density estimation and also used as density estimator inside a naive Bayes classifier. The multivariate histogram, used as the basis for a clustering method, is applied to improve the runtime behaviour of a well-known algorithm for multiple-instance classification. Performance in these applications is evaluated by comparing the new approaches with existing methods
    • 

    corecore