218,436 research outputs found

    Mass data exploration in oncology: An information synthesis approach

    Full text link
    New technologies and equipment allow for mass treatment of samples and research teams share acquired data on an always larger scale. In this context scientists are facing a major data exploitation problem. More precisely, using these data sets through data mining tools or introducing them in a classical experimental approach require a preliminary understanding of the information space, in order to direct the process. But acquiring this grasp on the data is a complex activity, which is seldom supported by current software tools. The goal of this paper is to introduce a solution to this scientific data grasp problem. Illustrated in the Tissue MicroArrays application domain, the proposal is based on the synthesis notion, which is inspired by Information Retrieval paradigms. The envisioned synthesis model gives a central role to the study the researcher wants to conduct, through the task notion. It allows for the implementation of a task-oriented Information Retrieval prototype system. Cases studies and user studies were used to validate this prototype system. It opens interesting prospects for the extension of the model or extensions towards other application domains

    On the design of R-based scalable frameworks for data science applications

    Get PDF
    This thesis is comprised of three papers "On the design of R-based scalable frameworks for data science applications". We discuss the design of conceptual and computational frameworks for the R language for statistical computing and graphics and build software artifacts for two typical data science use cases: optimization problem solving and large scale text analysis. Each part follows a design science approach. We use a verification method for the software frameworks introduced, i.e., prototypical instantiations of the designed artifacts are evaluated on the basis of real-world applications in mixed integer optimization (consensus journal ranking) and text mining (culturomics). The first paper introduces an extensible object oriented R Optimization Infrastructure (ROI). Methods from the field of optimization play an important role in many techniques routinely used in statistics, machine learning and data science. Often, implementations of these methods rely on highly specialized optimization algorithms, designed to be only applicable within a specific application. However, in many instances recent advances, in particular in the field of convex optimization, make it possible to conveniently and straightforwardly use modern solvers instead with the advantage of enabling broader usage scenarios and thus promoting reusability. With ROI one can formulate and solve optimization problems in a consistent way. It is capable of modeling linear, quadratic, conic, and general nonlinear optimization problems. Furthermore, the paper discusses how extension packages can add additional optimization solvers, read/write functions and additional resources such as model collections. Selected examples from the field of statistics conclude the paper. With the second paper we aim to answer two questions. Firstly, it addresses the issue on how to construct suitable aggregates of individual journal rankings, using an optimization-based consensus ranking approach. Secondly, the presented application serves as an evaluation of the ROI prototype. Regarding the first research question we apply the proposed method to a subset of marketing-related journals from a list of collected journal rankings. Next, the paper studies the stability of the derived consensus solution, and degeneration effects that occur when excluding journals and/or rankings. Finally, we investigate the similarities/dissimilarities of the consensus with a naive meta-ranking and with individual rankings. The results show that, even though journals are not uniformly ranked, one may derive a consensus ranking with considerably high agreement with the individual rankings. In the third paper we examine how we can extend the text mining package tm to handle large (text) corpora. This enables statisticians to answer many interesting research questions via statistical analysis or modeling of data sets that cannot be analyzed easily otherwise, e.g., due to software or hardware induced data size limitations. Adequate programming models like MapReduce facilitate parallelization of text mining tasks and allow for processing large data sets by using a distributed file system possibly spanning over several machines, e.g., in a cluster of workstations. The paper presents a plug-in package to tm called tm.plugin.dc implementing a distributed corpus class which can take advantage of the Hadoop MapReduce library for large scale text mining tasks. We evaluate the presented prototype on the basis of an application in culturomics and show that it can handle data sets of significant size efficiently

    Relative-fuzzy: a novel approach for handling complex ambiguity for software engineering of data mining models

    Get PDF
    There are two main defined classes of uncertainty namely: fuzziness and ambiguity, where ambiguity is ‘one-to-many’ relationship between syntax and semantic of a proposition. This definition seems that it ignores ‘many-to-many’ relationship ambiguity type of uncertainty. In this thesis, we shall use complex-uncertainty to term many-to-many relationship ambiguity type of uncertainty. This research proposes a new approach for handling the complex ambiguity type of uncertainty that may exist in data, for software engineering of predictive Data Mining (DM) classification models. The proposed approach is based on Relative-Fuzzy Logic (RFL), a novel type of fuzzy logic. RFL defines a new formulation of the problem of ambiguity type of uncertainty in terms of States Of Proposition (SOP). RFL describes its membership (semantic) value by using the new definition of Domain of Proposition (DOP), which is based on the relativity principle as defined by possible-worlds logic. To achieve the goal of proposing RFL, a question is needed to be answered, which is: how these two approaches; i.e. fuzzy logic and possible-world, can be mixed to produce a new membership value set (and later logic) that able to handle fuzziness and multiple viewpoints at the same time? Achieving such goal comes via providing possible world logic the ability to quantifying multiple viewpoints and also model fuzziness in each of these multiple viewpoints and expressing that in a new set of membership value. Furthermore, a new architecture of Hierarchical Neural Network (HNN) called ML/RFL-Based Net has been developed in this research, along with a new learning algorithm and new recalling algorithm. The architecture, learning algorithm and recalling algorithm of ML/RFL-Based Net follow the principles of RFL. This new type of HNN is considered to be a RFL computation machine. The ability of the Relative Fuzzy-based DM prediction model to tackle the problem of complex ambiguity type of uncertainty has been tested. Special-purpose Integrated Development Environment (IDE) software, which generates a DM prediction model for speech recognition, has been developed in this research too, which is called RFL4ASR. This special purpose IDE is an extension of the definition of the traditional IDE. Using multiple sets of TIMIT speech data, the prediction model of type ML/RFL-Based Net has classification accuracy of 69.2308%. This accuracy is higher than the best achievements of WEKA data mining machines given the same speech data

    QCBA: Postoptimization of Quantitative Attributes in Classifiers based on Association Rules

    Full text link
    The need to prediscretize numeric attributes before they can be used in association rule learning is a source of inefficiencies in the resulting classifier. This paper describes several new rule tuning steps aiming to recover information lost in the discretization of numeric (quantitative) attributes, and a new rule pruning strategy, which further reduces the size of the classification models. We demonstrate the effectiveness of the proposed methods on postoptimization of models generated by three state-of-the-art association rule classification algorithms: Classification based on Associations (Liu, 1998), Interpretable Decision Sets (Lakkaraju et al, 2016), and Scalable Bayesian Rule Lists (Yang, 2017). Benchmarks on 22 datasets from the UCI repository show that the postoptimized models are consistently smaller -- typically by about 50% -- and have better classification performance on most datasets

    A Model-Based Frequency Constraint for Mining Associations from Transaction Data

    Full text link
    Mining frequent itemsets is a popular method for finding associated items in databases. For this method, support, the co-occurrence frequency of the items which form an association, is used as the primary indicator of the associations's significance. A single user-specified support threshold is used to decided if associations should be further investigated. Support has some known problems with rare items, favors shorter itemsets and sometimes produces misleading associations. In this paper we develop a novel model-based frequency constraint as an alternative to a single, user-specified minimum support. The constraint utilizes knowledge of the process generating transaction data by applying a simple stochastic mixture model (the NB model) which allows for transaction data's typically highly skewed item frequency distribution. A user-specified precision threshold is used together with the model to find local frequency thresholds for groups of itemsets. Based on the constraint we develop the notion of NB-frequent itemsets and adapt a mining algorithm to find all NB-frequent itemsets in a database. In experiments with publicly available transaction databases we show that the new constraint provides improvements over a single minimum support threshold and that the precision threshold is more robust and easier to set and interpret by the user

    Dynamic load balancing for the distributed mining of molecular structures

    Get PDF
    In molecular biology, it is often desirable to find common properties in large numbers of drug candidates. One family of methods stems from the data mining community, where algorithms to find frequent graphs have received increasing attention over the past years. However, the computational complexity of the underlying problem and the large amount of data to be explored essentially render sequential algorithms useless. In this paper, we present a distributed approach to the frequent subgraph mining problem to discover interesting patterns in molecular compounds. This problem is characterized by a highly irregular search tree, whereby no reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely, a dynamic partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiverinitiated load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer Institute’s HIV-screening data set, where we were able to show close-to linear speedup in a network of workstations. The proposed approach also allows for dynamic resource aggregation in a non dedicated computational environment. These features make it suitable for large-scale, multi-domain, heterogeneous environments, such as computational grids
    • 

    corecore