2,345 research outputs found

    Discovering Reliable Dependencies from Data: Hardness and Improved Algorithms

    Get PDF
    The reliable fraction of information is an attractive score for quantifying (functional) dependencies in high-dimensional data. In this paper, we systematically explore the algorithmic implications of using this measure for optimization. We show that the problem is NP-hard, which justifies the usage of worst-case exponential-time as well as heuristic search methods. We then substantially improve the practical performance for both optimization styles by deriving a novel admissible bounding function that has an unbounded potential for additional pruning over the previously proposed one. Finally, we empirically investigate the approximation ratio of the greedy algorithm and show that it produces highly competitive results in a fraction of time needed for complete branch-and-bound style search.Comment: Accepted to Proceedings of the IEEE International Conference on Data Mining (ICDM'18

    Discovering robust dependencies from data

    Get PDF
    Science revolves around forming hypotheses, designing experiments, collecting data, and tests. It was not until recently, with the advent of modern hardware and data analytics, that science shifted towards a big-data-driven paradigm that led to an unprecedented success across various fields. What is perhaps the most astounding feature of this new era, is that interesting hypotheses can now be automatically discovered from observational data. This dissertation investigates knowledge discovery procedures that do exactly this. In particular, we seek algorithms that discover the most informative models able to compactly “describe” aspects of the phenomena under investigation, in both supervised and unsupervised settings. We consider interpretable models in the form of subsets of the original variable set. We want the models to capture all possible interactions, e.g., linear, non-linear, between all types of variables, e.g., discrete, continuous, and lastly, we want their quality to be meaningfully assessed. For this, we employ information-theoretic measures, and particularly, the fraction of information for the supervised setting, and the normalized total correlation for the unsupervised. The former measures the uncertainty reduction of the target variable conditioned on a model, and the latter measures the information overlap of the variables included in a model. Without access to the true underlying data generating process, we estimate the aforementioned measures from observational data. This process is prone to statistical errors, and in our case, the errors manifest as biases towards larger models. This can lead to situations where the results are utterly random, hindering therefore further analysis. We correct this behavior with notions from statistical learning theory. In particular, we propose regularized estimators that are unbiased under the hypothesis of independence, leading to robust estimation from limited data samples and arbitrary dimensionalities. Moreover, we do this for models consisting of both discrete and continuous variables. Lastly, to discover the top scoring models, we derive effective optimization algorithms for exact, approximate, and heuristic search. These algorithms are powered by admissible, tight, and efficient-to-compute bounding functions for our proposed estimators that can be used to greatly prune the search space. Overall, the products of this dissertation can successfully assist data analysts with data exploration, discovering powerful description models, or concluding that no satisfactory models exist, implying therefore new experiments and data are required for the phenomena under investigation. This statement is supported by Materials Science researchers who corroborated our discoveries.In der Wissenschaft geht es um Hypothesenbildung, Entwerfen von Experimenten, Sammeln von Daten und Tests. Jüngst hat sich die Wissenschaft, durch das Aufkommen moderner Hardware und Datenanalyse, zu einem Big-Data-basierten Paradigma hin entwickelt, das zu einem beispiellosen Erfolg in verschiedenen Bereichen geführt hat. Ein erstaunliches Merkmal dieser neuen ra ist, dass interessante Hypothesen jetzt automatisch aus Beobachtungsdaten entdeckt werden k nnen. In dieser Dissertation werden Verfahren zur Wissensentdeckung untersucht, die genau dies tun. Insbesondere suchen wir nach Algorithmen, die Modelle identifizieren, die in der Lage sind, Aspekte der untersuchten Ph nomene sowohl in beaufsichtigten als auch in unbeaufsichtigten Szenarien kompakt zu “beschreiben”. Hierzu betrachten wir interpretierbare Modelle in Form von Untermengen der ursprünglichen Variablenmenge. Ziel ist es, dass diese Modelle alle m glichen Interaktionen erfassen (z.B. linear, nicht-lineare), zwischen allen Arten von Variablen unterscheiden (z.B. diskrete, kontinuierliche) und dass schlussendlich ihre Qualit t sinnvoll bewertet wird. Dazu setzen wir informationstheoretische Ma e ein, insbesondere den Informationsanteil für das überwachte und die normalisierte Gesamtkorrelation für das unüberwachte Szenario. Ersteres misst die Unsicherheitsreduktion der Zielvariablen, die durch ein Modell bedingt ist, und letztere misst die Informationsüberlappung der enthaltenen Variablen. Ohne Kontrolle des Datengenerierungsprozesses werden die oben genannten Ma e aus Beobachtungsdaten gesch tzt. Dies ist anf llig für statistische Fehler, die zu Verzerrungen in gr  eren Modellen führen. So entstehen Situationen, wobei die Ergebnisse v llig zuf llig sind und somit weitere Analysen st ren. Wir korrigieren dieses Verhalten mit Methoden aus der statistischen Lerntheorie. Insbesondere schlagen wir regularisierte Sch tzer vor, die unter der Hypothese der Unabh ngigkeit nicht verzerrt sind und somit zu einer robusten Sch tzung aus begrenzten Datenstichproben und willkürlichen-Dimensionalit ten führen. Darüber hinaus wenden wir dies für Modelle an, die sowohl aus diskreten als auch aus kontinuierlichen Variablen bestehen. Um die besten Modelle zu entdecken, leiten wir effektive Optimierungsalgorithmen mit verschiedenen Garantien ab. Diese Algorithmen basieren auf speziellen Begrenzungsfunktionen der vorgeschlagenen Sch tzer und erlauben es den Suchraum stark einzuschr nken. Insgesamt sind die Produkte dieser Arbeit sehr effektiv für die Wissensentdeckung. Letztere Aussage wurde von Materialwissenschaftlern best tigt

    Search based software engineering: Trends, techniques and applications

    Get PDF
    © ACM, 2012. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version is available from the link below.In the past five years there has been a dramatic increase in work on Search-Based Software Engineering (SBSE), an approach to Software Engineering (SE) in which Search-Based Optimization (SBO) algorithms are used to address problems in SE. SBSE has been applied to problems throughout the SE lifecycle, from requirements and project planning to maintenance and reengineering. The approach is attractive because it offers a suite of adaptive automated and semiautomated solutions in situations typified by large complex problem spaces with multiple competing and conflicting objectives. This article provides a review and classification of literature on SBSE. The work identifies research trends and relationships between the techniques applied and the applications to which they have been applied and highlights gaps in the literature and avenues for further research.EPSRC and E

    Learning Task Relatedness in Multi-Task Learning for Images in Context

    Full text link
    Multimedia applications often require concurrent solutions to multiple tasks. These tasks hold clues to each-others solutions, however as these relations can be complex this remains a rarely utilized property. When task relations are explicitly defined based on domain knowledge multi-task learning (MTL) offers such concurrent solutions, while exploiting relatedness between multiple tasks performed over the same dataset. In most cases however, this relatedness is not explicitly defined and the domain expert knowledge that defines it is not available. To address this issue, we introduce Selective Sharing, a method that learns the inter-task relatedness from secondary latent features while the model trains. Using this insight, we can automatically group tasks and allow them to share knowledge in a mutually beneficial way. We support our method with experiments on 5 datasets in classification, regression, and ranking tasks and compare to strong baselines and state-of-the-art approaches showing a consistent improvement in terms of accuracy and parameter counts. In addition, we perform an activation region analysis showing how Selective Sharing affects the learned representation.Comment: To appear in ICMR 2019 (Oral + Lightning Talk + Poster

    A Semi-Supervised Feature Engineering Method for Effective Outlier Detection in Mixed Attribute Data Sets

    Get PDF
    Outlier detection is one of the crucial tasks in data mining which can lead to the finding of valuable and meaningful information within the data. An outlier is a data point that is notably dissimilar from other data points in the data set. As such, the methods for outlier detection play an important role in identifying and removing the outliers, thereby increasing the performance and accuracy of the prediction systems. Outlier detection is used in many areas like financial fraud detection, disease prediction, and network intrusion detection. Traditional outlier detection methods are founded on the use of different distance measures to estimate the similarity between the points and are confined to data sets that are purely continuous or categorical. These methods, though effective, lack in elucidating the relationship between outliers and known clusters/classes in the data set. We refer to this relationship as the context for any reported outlier. Alternate outlier detection methods establish the context of a reported outlier using underlying contextual beliefs of the data. Contextual beliefs are the established relationships between the attributes of the data set. Various studies have been recently conducted where they explore the contextual beliefs to determine outlier behavior. However, these methods do not scale in the situations where the data points and their respective contexts are sparse. Thus, the outliers reported by these methods tend to lose meaning. Another limitation of these methods is that they assume all features are equally important and do not consider nor determine subspaces among the features for identifying the outliers. Furthermore, determining subspaces is computationally exacerbated, as the number of possible subspaces increases with increasing dimensionality. This makes searching through all the possible subspaces impractical. In this thesis, we propose a Hybrid Bayesian Network approach to capture the underlying contextual beliefs to detect meaningful outliers in mixed attribute data sets. Hybrid Bayesian Networks utilize their probability distributions to encode the information of the data and outliers are those points which violate this information. To deal with the sparse contexts, we use an angle-based similarity method which is then combined with the joint probability distributions of the Hybrid Bayesian Network in a robust manner. With regards to the subspace selection, we employ a feature engineering method that consists of two-stage feature selection using Maximal Information Coefficient and Markov blankets of Hybrid Bayesian Networks to select highly correlated feature subspaces. This proposed method was tested on a real world medical record data set. The results indicate that the algorithm was able to identify meaningful outliers successfully. Moreover, we compare the performance of our algorithm with the existing baseline outlier detection algorithms. We also present a detailed analysis of the reported outliers using our method and demonstrate its efficiency when handling data points with sparse contexts

    Discovering Dependencies with Reliable Mutual Information

    Get PDF
    corecore