214,002 research outputs found

    Secret charing vs. encryption-based techniques for privacy preserving data mining

    Get PDF
    Privacy preserving querying and data publishing has been studied in the context of statistical databases and statistical disclosure control. Recently, large-scale data collection and integration efforts increased privacy concerns which motivated data mining researchers to investigate privacy implications of data mining and how data mining can be performed without violating privacy. In this paper, we first provide an overview of privacy preserving data mining focusing on distributed data sources, then we compare two technologies used in privacy preserving data mining. The first technology is encryption based, and it is used in earlier approaches. The second technology is secret-sharing which is recently being considered as a more efficient approach

    Decision making on operational data: a remote approach to distributed data monitoring

    Get PDF
    Information gathering and assimilation is normally performed by data mining tools and Online analytic processing (OLAP) operating on historic data stored in a data warehouse. Data mining and OLAP queries are very complex, access a significant fraction of a database and require significant time and resources to be executed. Therefore, it has been impossible to draw the data analysis benefits in operational data environments. When it comes to analysis of operational (dynamic) data, running complex queries on frequently changing data is next to impossible. The complexity of active data integration increases dramatically in distributed applications which are very common in automated or e-commerce applications. We suggest a remote data analysis approach to find hidden patterns and relationships in distributed operational data, which does not adversely affect routine transaction processing. Distributed data integration on frequently updated data has been performed by analysing SQL commands coming to the distributed databases and aggregating data centrally to produce a real-time view of fast changing data. This approach has been successfully evaluated on data sources for over 30 data sources for hotel properties. This paper presents the performance results of the method, and its comparative study of the state-of-the art data integration techniques. The remote approach to data integration and analysis has been built into a scalable data monitoring system. It demonstrates the ease of application and performance results of operational data integration

    Three-Way Joins on MapReduce: An Experimental Study

    Full text link
    We study three-way joins on MapReduce. Joins are very useful in a multitude of applications from data integration and traversing social networks, to mining graphs and automata-based constructions. However, joins are expensive, even for moderate data sets; we need efficient algorithms to perform distributed computation of joins using clusters of many machines. MapReduce has become an increasingly popular distributed computing system and programming paradigm. We consider a state-of-the-art MapReduce multi-way join algorithm by Afrati and Ullman and show when it is appropriate for use on very large data sets. By providing a detailed experimental study, we demonstrate that this algorithm scales much better than what is suggested by the original paper. However, if the join result needs to be summarized or aggregated, as opposed to being only enumerated, then the aggregation step can be integrated into a cascade of two-way joins, making it more efficient than the other algorithm, and thus becomes the preferred solution.Comment: 6 page

    Towards Interoperability in Genome Databases: The MAtDB (MIPS Arabidopsis Thaliana Database) Experience

    Get PDF
    Increasing numbers of whole-genome sequences are available, but to interpret them fully requires more than listing all genes. Genome databases are faced with the challenges of integrating heterogenous data and enabling data mining. In comparison to a data warehousing approach, where integration is achieved through replication of all relevant data in a unified schema, distributed approaches provide greater flexibility and maintainability. These are important in a field where new data is generated rapidly and our understanding of the data changes. Interoperability between distributed data sources allows data maintenance to be separated from integration and analysis. Simple ways to access the data can facilitate the development of new data mining tools and the transition from model genome analysis to comparative genomics. With the MIPS Arabidopsis thaliana genome database (MAtDB, http://mips.gsf.de/proj/thal/db) our aim is to go beyond a data repository towards creating an integrated knowledge resource. To this end, the Arabidopsis genome has been a backbone against which to structure and integrate heterogenous data. The challenges to be met are continuous updating of data, the design of flexible data models that can evolve with new data, the integration of heterogenous data, e.g. through the use of ontologies, comprehensive views and visualization of complex information, simple interfaces for application access locally or via the Internet, and knowledge transfer across species

    Data Mining for Fog Prediction and Low Clouds Detection

    Get PDF
    his paper describes our contribution to the research of parametrized models and methods for detection and prediction of significant meteorological phenomena, especially fog and low cloud cover. The project covered methods for integration of distributed meteorological data necessary for running the prediction models, training models and then mining the data in order to be able to efficiently and quickly predict even sparsely occurring phenomena. The detection and prediction methods are based on knowledge discovery -- data mining of meteorological data using neural networks and decision trees. The mined data were mainly METAR aerodrome messages, meteorological data from specialized stations and cloud data from special airport sensors -- laser ceilometers

    A Survey on Data Integration in Data Warehouse

    Get PDF
    Data warehousing embraces technology of integrating data from multiple distributed data sources and using that at an in annotated and aggregated form to support business decision-making and enterprise management. Although many techniques have been revisited or newly  developed in the context of data warehouses, such as view maintenance and OLAP, little attention has been paid to data mining techniques for supporting the most important and costly tasks of data integration for data warehouse design

    Integration of Data Mining into Scientific Data Analysis Processes

    Get PDF
    In recent years, using advanced semi-interactive data analysis algorithms such as those from the field of data mining gained more and more importance in life science in general and in particular in bioinformatics, genetics, medicine and biodiversity. Today, there is a trend away from collecting and evaluating data in the context of a specific problem or study only towards extensively collecting data from different sources in repositories which is potentially useful for subsequent analysis, e.g. in the Gene Expression Omnibus (GEO) repository of high throughput gene expression data. At the time the data are collected, it is analysed in a specific context which influences the experimental design. However, the type of analyses that the data will be used for after they have been deposited is not known. Content and data format are focused only to the first experiment, but not to the future re-use. Thus, complex process chains are needed for the analysis of the data. Such process chains need to be supported by the environments that are used to setup analysis solutions. Building specialized software for each individual problem is not a solution, as this effort can only be carried out for huge projects running for several years. Hence, data mining functionality was developed to toolkits, which provide data mining functionality in form of a collection of different components. Depending on the different research questions of the users, the solutions consist of distinct compositions of these components. Today, existing solutions for data mining processes comprise different components that represent different steps in the analysis process. There exist graphical or script-based toolkits for combining such components. The data mining tools, which can serve as components in analysis processes, are based on single computer environments, local data sources and single users. However, analysis scenarios in medical- and bioinformatics have to deal with multi computer environments, distributed data sources and multiple users that have to cooperate. Users need support for integrating data mining into analysis processes in the context of such scenarios, which lacks today. Typically, analysts working with single computer environments face the problem of large data volumes since tools do not address scalability and access to distributed data sources. Distributed environments such as grid environments provide scalability and access to distributed data sources, but the integration of existing components into such environments is complex. In addition, new components often cannot be directly developed in distributed environments. Moreover, in scenarios involving multiple computers, multiple distributed data sources and multiple users, the reuse of components, scripts and analysis processes becomes more important as more steps and configuration are necessary and thus much bigger efforts are needed to develop and set-up a solution. In this thesis we will introduce an approach for supporting interactive and distributed data mining for multiple users based on infrastructure principles that allow building on data mining components and processes that are already available instead of designing of a completely new infrastructure, so that users can keep working with their well-known tools. In order to achieve the integration of data mining into scientific data analysis processes, this thesis proposes an stepwise approach of supporting the user in the development of analysis solutions that include data mining. We see our major contributions as the following: first, we propose an approach to integrate data mining components being developed for a single processor environment into grid environments. By this, we support users in reusing standard data mining components with small effort. The approach is based on a metadata schema definition which is used to grid-enable existing data mining components. Second, we describe an approach for interactively developing data mining scripts in grid environments. The approach efficiently supports users when it is necessary to enhance available components, to develop new data mining components, and to compose these components. Third, building on that, an approach for facilitating the reuse of existing data mining processes based on process patterns is presented. It supports users in scenarios that cover different steps of the data mining process including several components or scripts. The data mining process patterns support the description of data mining processes at different levels of abstraction between the CRISP model as most general and executable workflows as most concrete representation

    DAME: A Distributed Web based Framework for Knowledge Discovery in Databases

    Get PDF
    Massive data sets explored in many e-science communities, as in the Astrophysics case, are gathered by a very large number of techniques and stored in very diversified and often-incompatible data repositories. Moreover, we need to integrate services across distributed, heterogeneous, dynamic virtual organizations formed from the different resources within a single enterprise and/or from external resource sharing and service provider relationships. The DAME project aims at creating a distributed e-infrastructure to guarantee integrated and asynchronous access to data collected by very different experiments and scientific communities in order to correlate them and improve their scientific usability. The project consists of a data mining framework with powerful software instruments capable to work on massive data sets, organized by following Virtual Observatory standards, in a distributed computing environment. The integration process can be technically challenging because of the need to achieve a specific quality of service when running on top of different native platforms. In these terms, the result of the DAME project effort is a service-oriented architecture, by using appropriate standards and incorporating Cloud/Grid paradigms andWeb services, that will have as main target the integration of interdisciplinary distributed systems within and across organizational domains
    corecore