5 research outputs found

    The Digital Persona and Trust Bank: A Privacy Management Framework

    Get PDF
    Recently, the government of India embarked on an ambitious project of designing and deploying the Integrated National Agricultural Resources Information System (INARIS) data warehouse for the agricultural sector. The system’s purpose is to support macro level planning. This paper presents some of the challenges faced in designing the data warehouse, specifically dimensional and deployment challenges of the warehouse. We also present some early user evaluations of the warehouse. Governmental data warehouse implementations are rare, especially at the national level. Furthermore, the motivations are significantly different from private sectors. Designing the INARIS agricultural data warehouse posed unique and significant challenges because, traditionally, the collection and dissemination of information are localized

    Advanced Data Mining Techniques for Compound Objects

    Get PDF
    Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in large data collections. The most important step within the process of KDD is data mining which is concerned with the extraction of the valid patterns. KDD is necessary to analyze the steady growing amount of data caused by the enhanced performance of modern computer systems. However, with the growing amount of data the complexity of data objects increases as well. Modern methods of KDD should therefore examine more complex objects than simple feature vectors to solve real-world KDD applications adequately. Multi-instance and multi-represented objects are two important types of object representations for complex objects. Multi-instance objects consist of a set of object representations that all belong to the same feature space. Multi-represented objects are constructed as a tuple of feature representations where each feature representation belongs to a different feature space. The contribution of this thesis is the development of new KDD methods for the classification and clustering of complex objects. Therefore, the thesis introduces solutions for real-world applications that are based on multi-instance and multi-represented object representations. On the basis of these solutions, it is shown that a more general object representation often provides better results for many relevant KDD applications. The first part of the thesis is concerned with two KDD problems for which employing multi-instance objects provides efficient and effective solutions. The first is the data mining in CAD parts, e.g. the use of hierarchic clustering for the automatic construction of product hierarchies. The introduced solution decomposes a single part into a set of feature vectors and compares them by using a metric on multi-instance objects. Furthermore, multi-step query processing using a novel filter step is employed, enabling the user to efficiently process similarity queries. On the basis of this similarity search system, it is possible to perform several distance based data mining algorithms like the hierarchical clustering algorithm OPTICS to derive product hierarchies. The second important application is the classification and search for complete websites in the world wide web (WWW). A website is a set of HTML-documents that is published by the same person, group or organization and usually serves a common purpose. To perform data mining for websites, the thesis presents several methods to classify websites. After introducing naive methods modelling websites as webpages, two more sophisticated approaches to website classification are introduced. The first approach uses a preprocessing that maps single HTML-documents within each website to so-called page classes. The second approach directly compares websites as sets of word vectors and uses nearest neighbor classification. To search the WWW for new, relevant websites, a focused crawler is introduced that efficiently retrieves relevant websites. This crawler minimizes the number of HTML-documents and increases the accuracy of website retrieval. The second part of the thesis is concerned with the data mining in multi-represented objects. An important example application for this kind of complex objects are proteins that can be represented as a tuple of a protein sequence and a text annotation. To analyze multi-represented objects, a clustering method for multi-represented objects is introduced that is based on the density based clustering algorithm DBSCAN. This method uses all representations that are provided to find a global clustering of the given data objects. However, in many applications there already exists a sophisticated class ontology for the given data objects, e.g. proteins. To map new objects into an ontology a new method for the hierarchical classification of multi-represented objects is described. The system employs the hierarchical structure of the ontology to efficiently classify new proteins, using support vector machines

    Measurement techniques and case studies for the characterization of Internet applications

    Get PDF
    This thesis characterizes the two current killer applications of the Internet: World Wide Web (WWW) and Peer-to-Peer (P2P) file sharing. With the advances in network technology and radical cost reduction for Internet connectivity the Internet grows at an awesome speed in terms of number of users, available content and network traffic. Due to the huge amount of available data, developing algorithms to efficiently locate desired information is a difficult research task. Thus, the characterization of the two most popular Internet applications, which enables the design and evaluation of novel search algorithms, constitutes the two key contributions of this work. As first contribution, this thesis provides a synthetic workload model for the query behavior of peers in P2P file sharing systems which can be used for evaluating new P2P system designs. Whereas previous work has solely focused on aggregate workload statistics, this thesis presents a characterization of individual peer behavior in a form that can be used for constructing representative synthetic workloads. The characterization is based on a comprehensive 40 days measurement study in the Gnutella P2P file sharing system comprising more than 10 GBytes of trace data. As a key feature, the characterization distinguishes between user behavior and queries that are automatically generated by the client software. The analysis of the measured data exposes heterogeneous behavior that occurs on different days, in different geographical regions or at different periods of the day. Moreover, the consideration of additional correlations among the workload measures allows the generation of realistic workloads. As second contribution, this thesis characterizes and models the structural properties of German Web sites for enabling their automated classification. These structural properties encompass the size, the organization, the composition of URLs, and the link structure of Web sites. In fact, the approach is independent of the content of Web pages. Opposed to previous work, this thesis characterizes structural properties of entire Web sites instead of individual Web pages. The measurement study is based upon more than 2,300 Web sites comprising 11 million crawled pages categorized into five major classes: Brochure, Listing, Blog, Institution, and Personal. As a key insight which can be exploited for improving Internet search engines and Web directories, this thesis reveals significant correlations between the structural properties and the class of a Web site

    Enhanced Query Processing on Complex Spatial and Temporal Data

    Get PDF
    Innovative technologies in the area of multimedia and mechanical engineering as well as novel methods for data acquisition in different scientific subareas, including geo-science, environmental science, medicine, biology and astronomy, enable a more exact representation of the data, and thus, a more precise data analysis. The resulting quantitative and qualitative growth of specifically spatial and temporal data leads to new challenges for the management and processing of complex structured objects and requires the employment of efficient and effective methods for data analysis. Spatial data denote the description of objects in space by a well-defined extension, a specific location and by their relationships to the other objects. Classical representatives of complex structured spatial objects are three-dimensional CAD data from the sector "mechanical engineering" and two-dimensional bounded regions from the area "geography". For industrial applications, efficient collision and intersection queries are of great importance. Temporal data denote data describing time dependent processes, as for instance the duration of specific events or the description of time varying attributes of objects. Time series belong to one of the most popular and complex type of temporal data and are the most important form of description for time varying processes. An elementary type of query in time series databases is the similarity query which serves as basic query for data mining applications. The main target of this thesis is to develop an effective and efficient algorithm supporting collision queries on spatial data as well as similarity queries on temporal data, in particular, time series. The presented concepts are based on the efficient management of interval sequences which are suitable for spatial and temporal data. The effective analysis of the underlying objects will be efficiently supported by adequate access methods. First, this thesis deals with collision queries on complex spatial objects which can be reduced to intersection queries on interval sequences. We introduce statistical methods for the grouping of subsequences. Involving the concept of multi-step query processing, these methods enable the user to accelerate the query process drastically. Furthermore, in this thesis we will develop a cost model for the multi-step query process of interval sequences in distributed systems. The proposed approach successfully supports a cost based query strategy. Second, we introduce a novel similarity measure for time series. It allows the user to focus specific time series amplitudes for the similarity measurement. The new similarity model defines two time series to be similar iff they show similar temporal behavior w.r.t. being below or above a specific threshold. This type of query is primarily required in natural science applications. The main goal of this new query method is the detection of anomalies and the adaptation to new claims in the area of data mining in time series databases. In addition, a semi-supervised cluster analysis method will be presented which is based on the introduced similarity model for time series. The efficiency and effectiveness of the proposed techniques will be extensively discussed and the advantages against existing methods experimentally proofed by means of datasets derived from real-world applications
    corecore