501 research outputs found
Discovering robust dependencies from data
Science revolves around forming hypotheses, designing experiments, collecting data, and tests. It was not until recently, with the advent of modern hardware and data analytics, that science shifted towards a big-data-driven paradigm that led to an unprecedented success across various fields. What is perhaps the most astounding feature of this new era, is that interesting hypotheses can now be automatically discovered from observational data. This dissertation investigates knowledge discovery procedures that do exactly this. In particular, we seek algorithms that discover the most informative models able to compactly âdescribeâ aspects of the phenomena under investigation, in both supervised and unsupervised settings. We consider interpretable models in the form of subsets of the original variable set. We want the models to capture all possible interactions, e.g., linear, non-linear, between all types of variables, e.g., discrete, continuous, and lastly, we want their quality to be meaningfully assessed. For this, we employ information-theoretic
measures, and particularly, the fraction of information for the supervised setting, and the normalized total correlation for the unsupervised. The former measures the uncertainty reduction of the target variable conditioned on a model, and the latter measures the information overlap of the variables included in a model.
Without access to the true underlying data generating process, we estimate the aforementioned measures from observational data. This process is prone to statistical errors, and in our case, the errors manifest as biases towards larger models. This can lead to situations where the results are utterly random, hindering
therefore further analysis. We correct this behavior with notions from statistical learning theory. In particular, we propose regularized estimators that are unbiased under the hypothesis of independence, leading to robust estimation from limited data samples and arbitrary dimensionalities. Moreover, we do this for models
consisting of both discrete and continuous variables. Lastly, to discover the top scoring models, we derive effective optimization algorithms for exact, approximate, and heuristic search. These algorithms are
powered by admissible, tight, and efficient-to-compute bounding functions for our proposed estimators that can be used to greatly prune the search space. Overall, the products of this dissertation can successfully assist data analysts with data exploration, discovering powerful description models, or concluding that
no satisfactory models exist, implying therefore new experiments and data are required for the phenomena under investigation. This statement is supported by Materials Science researchers who corroborated our discoveries.In der Wissenschaft geht es um Hypothesenbildung, Entwerfen von Experimenten, Sammeln von Daten und Tests. JuÌngst hat sich die Wissenschaft, durch das Aufkommen moderner Hardware und Datenanalyse, zu einem Big-Data-basierten Paradigma hin entwickelt, das zu einem beispiellosen Erfolg in verschiedenen Bereichen gefuÌhrt hat. Ein erstaunliches Merkmal dieser neuen ra ist, dass interessante Hypothesen jetzt automatisch aus Beobachtungsdaten entdeckt werden kânnen. In dieser Dissertation werden Verfahren zur Wissensentdeckung untersucht, die genau dies tun. Insbesondere suchen wir nach Algorithmen, die Modelle identifizieren, die in der Lage sind, Aspekte der untersuchten Phânomene sowohl in beaufsichtigten als auch in unbeaufsichtigten Szenarien kompakt zu âbeschreibenâ. Hierzu betrachten wir interpretierbare Modelle in Form von Untermengen der urspruÌnglichen Variablenmenge. Ziel ist es, dass diese Modelle alle mâglichen Interaktionen erfassen (z.B. linear, nicht-lineare), zwischen allen Arten von Variablen unterscheiden (z.B. diskrete, kontinuierliche) und dass schlussendlich ihre Qualitât sinnvoll bewertet wird. Dazu setzen wir informationstheoretische Maâe ein, insbesondere den Informationsanteil fuÌr das uÌberwachte und die normalisierte Gesamtkorrelation fuÌr das unuÌberwachte Szenario. Ersteres misst die Unsicherheitsreduktion der Zielvariablen, die durch ein Modell bedingt ist, und letztere misst die InformationsuÌberlappung der enthaltenen Variablen. Ohne Kontrolle des Datengenerierungsprozesses werden die oben genannten Maâe aus Beobachtungsdaten geschâtzt. Dies ist anfâllig fuÌr statistische Fehler, die zu Verzerrungen in grââeren Modellen fuÌhren. So entstehen Situationen, wobei die Ergebnisse vâllig zufâllig sind und somit weitere Analysen stâren. Wir korrigieren dieses Verhalten mit Methoden aus der statistischen Lerntheorie. Insbesondere schlagen wir regularisierte Schâtzer vor, die unter der Hypothese der Unabhângigkeit nicht verzerrt sind und somit zu einer robusten Schâtzung aus begrenzten Datenstichproben und willkuÌrlichen-Dimensionalitâten fuÌhren. DaruÌber hinaus wenden wir dies fuÌr Modelle an, die sowohl aus diskreten als auch aus kontinuierlichen Variablen bestehen. Um die besten Modelle zu entdecken, leiten wir effektive Optimierungsalgorithmen mit verschiedenen Garantien ab. Diese Algorithmen basieren auf speziellen Begrenzungsfunktionen der vorgeschlagenen Schâtzer und erlauben es den Suchraum stark einzuschrânken. Insgesamt sind die Produkte dieser Arbeit sehr effektiv fuÌr die Wissensentdeckung. Letztere Aussage
wurde von Materialwissenschaftlern bestâtigt
Optimizing Dynamic Time Warpingâs Window Width for Time Series Data Mining Applications
Dynamic Time Warping (DTW) is a highly competitive distance measure for most time series data mining problems. Obtaining the best performance from DTW requires setting its only parameter, the maximum amount of warping (w). In the supervised case with ample data, w is typically set by cross-validation in the training stage. However, this method is likely to yield suboptimal results for small training sets. For the unsupervised case, learning via cross-validation is not possible because we do not have access to labeled data. Many practitioners have thus resorted to assuming that âthe larger the betterâ, and they use the largest value of w permitted by the computational resources. However, as we will show, in most circumstances, this is a naĂŻve approach that produces inferior clusterings. Moreover, the best warping window width is generally non-transferable between the two tasks, i.e., for a single dataset, practitioners cannot simply apply the best w learned for classification on clustering or vice versa. In addition, we will demonstrate that the appropriate amount of warping not only depends on the data structure, but also on the dataset size. Thus, even if a practitioner knows the best setting for a given dataset, they will likely be at a lost if they apply that setting on a bigger size version of that data. All these issues seem largely unknown or at least unappreciated in the community. In this work, we demonstrate the importance of setting DTWâs warping window width correctly, and we also propose novel methods to learn this parameter in both supervised and unsupervised settings. The algorithms we propose to learn w can produce significant improvements in classification accuracy and clustering quality. We demonstrate the correctness of our novel observations and the utility of our ideas by testing them with more than one hundred publicly available datasets. Our forceful results allow us to make a perhaps unexpected claim; an underappreciated âlow hanging fruitâ in optimizing DTWâs performance can produce improvements that make it an even stronger baseline, closing most or all the improvement gap of the more sophisticated methods proposed in recent years
Intelligent automatic operational modal analysis
Operational modal analysis methods have been proven especially useful to identify existing structures and infrastructures under serviceability conditions. However, the installation of sensing systems for monitoring continuously an ever larger number of existing constructions has motivated significant efforts towards the automation of the available methods.
Within this framework, the present paper introduces a new paradigm for the automatic output-only modal identification of linear structures under ambient vibrations, namely the intelligent automatic operational modal analysis (i-AOMA). It exploits the covariance-based stochastic subspace (SSI-cov) algorithm for the output-only identification of the modal parameters and its workflow consists of two main phases. Initially, quasi-random samples of the control parameters for the SSI-cov algorithm are generated. Once the SSI-cov algorithm is performed for each sample, the corresponding stabilization diagrams are processed in order to prepare a database for training the intelligent core of the i-AOMA method. This is a machine learning technique (namely a random forest algorithm) that predicts which combination of the control parameters for the SSI-cov algorithm is able to provide good modal estimates. Afterward, new quasi-random samples of the control parameters for the SSI-cov algorithm are generated repeatedly until a statistical convergence criterion is achieved. If the generic sample is classified as feasible by the intelligent core of the i-AOMA method, then the SSI-cov algorithm is performed. Finally, stable modal results are distilled from the stabilization diagrams and relevant statistics are computed to evaluate the uncertainty level due to the variability of the control parameters.
The proposed i-AOMA method has been applied to identify the modal features of the Al-Hamra Firduos Tower, an iconic 412.6 m tall building located in Kuwait City (Kuwait). The final results well agree with a previous experimental study, and it was also possible to identify two new vibration modes of the structure. The implemented open-source Python code is made freely available
Fuzzy clustering with spatial-temporal information
Clustering geographical units based on a set of quantitative features observed at several time occasions requires to deal with the complexity of both space and time information. In particular, one should consider (1) the spatial nature of the units to be clustered, (2) the characteristics of the space of multivariate time trajectories, and (3) the uncertainty related to the assignment of a geographical unit to a given cluster on the basis of the above com- plex features. This paper discusses a novel spatially constrained multivariate time series clustering for units characterised by different levels of spatial proximity. In particular, the Fuzzy Partitioning Around Medoids algorithm with Dynamic Time Warping dissimilarity measure and spatial penalization terms is applied to classify multivariate Spatial-Temporal series. The clustering method has been theoretically presented and discussed using both simulated and real data, highlighting its main features. In particular, the capability of embedding different levels of proximity among units, and the ability of considering time series with different length
Recommended from our members
Fast, Scalable, and Accurate Algorithms for Time-Series Analysis
Time is a critical element for the understanding of natural processes (e.g., earthquakes and weather) or human-made artifacts (e.g., stock market and speech signals). The analysis of time series, the result of sequentially collecting observations of such processes and artifacts, is becoming increasingly prevalent across scientific and industrial applications. The extraction of non-trivial features (e.g., patterns, correlations, and trends) in time series is a critical step for devising effective time-series mining methods for real-world problems and the subject of active research for decades. In this dissertation, we address this fundamental problem by studying and presenting computational methods for efficient unsupervised learning of robust feature representations from time series. Our objective is to (i) simplify and unify the design of scalable and accurate time-series mining algorithms; and (ii) provide a set of readily available tools for effective time-series analysis. We focus on applications operating solely over time-series collections and on applications where the analysis of time series complements the analysis of other types of data, such as text and graphs.
For applications operating solely over time-series collections, we propose a generic computational framework, GRAIL, to learn low-dimensional representations that natively preserve the invariances offered by a given time-series comparison method. GRAIL represents a departure from classic approaches in the time-series literature where representation methods are agnostic to the similarity function used in subsequent learning processes. GRAIL relies on the attractive idea that once we construct the data-to-data similarity matrix most time-series mining tasks can be trivially solved. To overcome scalability issues associated with approaches relying on such matrices, GRAIL exploits time-series clustering to construct a small set of landmark time series and learns representations to reduce the data-to-data matrix to a data-to-landmark points matrix. To demonstrate the effectiveness of GRAIL, we first present domain-independent, highly accurate, and scalable time-series clustering methods to facilitate exploration and summarization of time-series collections. Then, we show that GRAIL representations, when combined with suitable methods, significantly outperform, in terms of efficiency and accuracy, state-of-the-art methods in major time-series mining tasks, such as querying, clustering, classification, sampling, and visualization. Overall, GRAIL rises as a new primitive for highly accurate, yet scalable, time-series analysis.
For applications where the analysis of time series complements the analysis of other types of data, such as text and graphs, we propose generic, simple, and lightweight methodologies to learn features from time-varying measurements. Such applications often organize operations over different types of data in a pipeline such that one operation provides input---in the form of feature vectors---to subsequent operations. To reason about the temporal patterns and trends in the underlying features, we need to (i) track the evolution of features over different time periods; and (ii) transform these time-varying features into actionable knowledge (e.g., forecasting an outcome). To address this challenging problem, we propose principled approaches to model time-varying features and study two large-scale, real-world, applications. Specifically, we first study the problem of predicting the impact of scientific concepts through temporal analysis of characteristics extracted from the metadata and full text of scientific articles. Then, we explore the promise of harnessing temporal patterns in behavioral signals extracted from web search engine logs for early detection of devastating diseases. In both applications, combinations of features with time-series relevant features yielded the greatest impact than any other indicator considered in our analysis. We believe that our simple methodology, along with the interesting domain-specific findings that our work revealed, will motivate new studies across different scientific and industrial settings
Proceedings of the Fifth Workshop on Information Theoretic Methods in Science and Engineering
These are the online proceedings of the Fifth Workshop on Information Theoretic Methods in Science and Engineering (WITMSE), which was held in the Trippenhuis, Amsterdam, in August 2012
Metalearning
This open access book as one of the fastest-growing areas of research in machine learning, metalearning studies principled methods to obtain efficient models and solutions by adapting machine learning and data mining processes. This adaptation usually exploits information from past experience on other tasks and the adaptive processes can involve machine learning approaches. As a related area to metalearning and a hot topic currently, automated machine learning (AutoML) is concerned with automating the machine learning processes. Metalearning and AutoML can help AI learn to control the application of different learning methods and acquire new solutions faster without unnecessary interventions from the user. This book offers a comprehensive and thorough introduction to almost all aspects of metalearning and AutoML, covering the basic concepts and architecture, evaluation, datasets, hyperparameter optimization, ensembles and workflows, and also how this knowledge can be used to select, combine, compose, adapt and configure both algorithms and models to yield faster and better solutions to data mining and data science problems. It can thus help developers to develop systems that can improve themselves through experience. This book is a substantial update of the first edition published in 2009. It includes 18 chapters, more than twice as much as the previous version. This enabled the authors to cover the most relevant topics in more depth and incorporate the overview of recent research in the respective area. The book will be of interest to researchers and graduate students in the areas of machine learning, data mining, data science and artificial intelligence. ; Metalearning is the study of principled methods that exploit metaknowledge to obtain efficient models and solutions by adapting machine learning and data mining processes. While the variety of machine learning and data mining techniques now available can, in principle, provide good model solutions, a methodology is still needed to guide the search for the most appropriate model in an efficient way. Metalearning provides one such methodology that allows systems to become more effective through experience. This book discusses several approaches to obtaining knowledge concerning the performance of machine learning and data mining algorithms. It shows how this knowledge can be reused to select, combine, compose and adapt both algorithms and models to yield faster, more effective solutions to data mining problems. It can thus help developers improve their algorithms and also develop learning systems that can improve themselves. The book will be of interest to researchers and graduate students in the areas of machine learning, data mining and artificial intelligence
- âŠ