2,118 research outputs found

    Highly comparative feature-based time-series classification

    Full text link
    A highly comparative, feature-based approach to time series classification is introduced that uses an extensive database of algorithms to extract thousands of interpretable features from time series. These features are derived from across the scientific time-series analysis literature, and include summaries of time series in terms of their correlation structure, distribution, entropy, stationarity, scaling properties, and fits to a range of time-series models. After computing thousands of features for each time series in a training set, those that are most informative of the class structure are selected using greedy forward feature selection with a linear classifier. The resulting feature-based classifiers automatically learn the differences between classes using a reduced number of time-series properties, and circumvent the need to calculate distances between time series. Representing time series in this way results in orders of magnitude of dimensionality reduction, allowing the method to perform well on very large datasets containing long time series or time series of different lengths. For many of the datasets studied, classification performance exceeded that of conventional instance-based classifiers, including one nearest neighbor classifiers using Euclidean distances and dynamic time warping and, most importantly, the features selected provide an understanding of the properties of the dataset, insight that can guide further scientific investigation

    How is a data-driven approach better than random choice in label space division for multi-label classification?

    Full text link
    We propose using five data-driven community detection approaches from social networks to partition the label space for the task of multi-label classification as an alternative to random partitioning into equal subsets as performed by RAkELd: modularity-maximizing fastgreedy and leading eigenvector, infomap, walktrap and label propagation algorithms. We construct a label co-occurence graph (both weighted an unweighted versions) based on training data and perform community detection to partition the label set. We include Binary Relevance and Label Powerset classification methods for comparison. We use gini-index based Decision Trees as the base classifier. We compare educated approaches to label space divisions against random baselines on 12 benchmark data sets over five evaluation measures. We show that in almost all cases seven educated guess approaches are more likely to outperform RAkELd than otherwise in all measures, but Hamming Loss. We show that fastgreedy and walktrap community detection methods on weighted label co-occurence graphs are 85-92% more likely to yield better F1 scores than random partitioning. Infomap on the unweighted label co-occurence graphs is on average 90% of the times better than random paritioning in terms of Subset Accuracy and 89% when it comes to Jaccard similarity. Weighted fastgreedy is better on average than RAkELd when it comes to Hamming Loss

    A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm

    Full text link
    K-means is undoubtedly the most widely used partitional clustering algorithm. Unfortunately, due to its gradient descent nature, this algorithm is highly sensitive to the initial placement of the cluster centers. Numerous initialization methods have been proposed to address this problem. In this paper, we first present an overview of these methods with an emphasis on their computational efficiency. We then compare eight commonly used linear time complexity initialization methods on a large and diverse collection of data sets using various performance criteria. Finally, we analyze the experimental results using non-parametric statistical tests and provide recommendations for practitioners. We demonstrate that popular initialization methods often perform poorly and that there are in fact strong alternatives to these methods.Comment: 17 pages, 1 figure, 7 table
    • …
    corecore