1,753 research outputs found

    Dwarf: A Complete System for Analyzing High-Dimensional Data Sets

    Get PDF
    The need for data analysis by different industries, including telecommunications, retail, manufacturing and financial services, has generated a flurry of research, highly sophisticated methods and commercial products. However, all of the current attempts are haunted by the so-called "high-dimensionality curse"; the complexity of space and time increases exponentially with the number of analysis "dimensions". This means that all existing approaches are limited only to coarse levels of analysis and/or to approximate answers with reduced precision. As the need for detailed analysis keeps increasing, along with the volume and the detail of the data that is stored, these approaches are very quickly rendered unusable. I have developed a unique method for efficiently performing analysis that is not affected by the high-dimensionality of data and scales only polynomially -and almost linearly- with the dimensions without sacrificing any accuracy in the returned results. I have implemented a complete system (called "Dwarf") and performed an extensive experimental evaluation that demonstrated tremendous improvements over existing methods for all aspects of performing analysis -initial computation, storing, querying and updating it. I have extended my research to the "data-streaming" model where updates are performed on-line, exacerbating any concurrent analysis but has a very high impact on applications like security, network management/monitoring router traffic control and sensor networks. I have devised streaming algorithms that provide complex statistics within user-specified relative-error bounds over a data stream. I introduced the class of "distinct implicated statistics", which is much more general than the established class of "distinct count" statistics. The latter has been proved invaluable in applications such as analyzing and monitoring the distinct count of species in a population or even in query optimization. The "distinct implicated statistics" class provides invaluable information about the correlations in the stream and is necessary for applications such as security. My algorithms are designed to use bounded amounts of memory and processing -so that they can even be implemented in hardware for resource-limited environments such as network-routers or sensors- and also to work in "noisy" environments, where some data may be flawed either implicitly due to the extraction process or explicitly

    Business Intelligence for Small and Middle-Sized Entreprises

    Full text link
    Data warehouses are the core of decision support sys- tems, which nowadays are used by all kind of enter- prises in the entire world. Although many studies have been conducted on the need of decision support systems (DSSs) for small businesses, most of them adopt ex- isting solutions and approaches, which are appropriate for large-scaled enterprises, but are inadequate for small and middle-sized enterprises. Small enterprises require cheap, lightweight architec- tures and tools (hardware and software) providing on- line data analysis. In order to ensure these features, we review web-based business intelligence approaches. For real-time analysis, the traditional OLAP architecture is cumbersome and storage-costly; therefore, we also re- view in-memory processing. Consequently, this paper discusses the existing approa- ches and tools working in main memory and/or with web interfaces (including freeware tools), relevant for small and middle-sized enterprises in decision making

    Design of petroleum company's metadata and an effective knowledge mapping methodology

    Get PDF
    Success of information flow depends on intelligent datastorage and its management in a multi-disciplinaryenvironment. Multi-dimensional data entities, data typesand ambiguous semantics, often pose uncertainty andinconsistency in data retrieval from volumes of petroleumdata sources. In our approach, conceptual schemas andsub-schemas have been described based on variousoperational functions of the petroleum industry. Theseschemas are integrated, to ensure their consistency andvalidity, so that the information retrieved from anintegrated metadata (in the form of a data warehouse)structure derives its authenticity from its implementation.The data integration process validating the petroleummetadata has been demonstrated for one of the Gulfoffshore basins for an effective knowledge mapping andinterpreting it successfully for the derivation of usefulgeological knowledge. Warehoused data are used formining data patterns, trends and correlations amongknowledge-base data attributes that led to interpretation ofinteresting geological features. These technologies appearto be more amenable for exploration of more petroleumresources in the mature gulf basins

    The Dwarf Data Cube Eliminates the Highy Dimensionality Curse

    Get PDF
    The data cube operator encapsulates all possible groupings of a data set and has proved to be an invaluable tool in analyzing vast amounts of data. However its apparent exponential complexity has significantly limited its applicability to low dimensional datasets. Recently the idea of the dwarf data cube model was introduced, and showed that high-dimensional ``dwarf data cubes'' are orders of magnitudes smaller in size than the original data cubes even when they calculate and store every possible aggregation with 100\% precision. In this paper we present a surprising analytical result proving that the size of dwarf cubes grows polynomially with the dimensionality of the data set and, therefore, a full data cube at 100% precision is not inherently cursed by high dimensionality. This striking result of polynomial complexity reformulates the context of cube management and redefines most of the problems associated with data-warehousing and On-Line Analytical Processing. We also develop an efficient algorithm for estimating the size of dwarf data cubes before actually computing them. Finally, we complement our analytical approach with an experimental evaluation using real and synthetic data sets, and demonstrate our results. UMIACS-TR-2003-12

    A Sensor Network Data Compression Algorithm Based on Suboptimal Clustering and Virtual Landmark Routing Within Clusters

    Get PDF
    A kind of data compression algorithm for sensor networks based on suboptimal clustering and virtual landmark routing within clusters is proposed in this paper. Firstly, temporal redundancy existing in data obtained by the same node in sequential instants can be eliminated. Then sensor networks nodes will be clustered. Virtual node landmarks in clusters can be established based on cluster heads. Routing in clusters can be realized by combining a greedy algorithm and a flooding algorithm. Thirdly, a global structure tree based on cluster heads will be established. During the course of data transmissions from nodes to cluster heads and from cluster heads to sink, the spatial redundancy existing in the data will be eliminated. Only part of the raw data needs to be transmitted from nodes to sink, and all raw data can be recovered in the sink based on a compression code and part of the raw data. Consequently, node energy can be saved, largely because transmission of redundant data can be avoided. As a result the overall performance of the sensor network can obviously be improved

    Processing of an iceberg query on distributed and centralized databases

    Get PDF
    Master'sMASTER OF SCIENC

    Query Rewriting in Itemset Mining

    Get PDF
    Abstract. In recent years, researchers have begun to study inductive databases, a new generation of databases for leveraging decision support applications. In this context, the user interacts with the DBMS using advanced, constraint-based languages for data mining where constraints have been specifically introduced to increase the relevance of the results and, at the same time, to reduce its volume. In this paper we study the problem of mining frequent itemsets using an inductive database 1 . We propose a technique for query answering which consists in rewriting the query in terms of union and intersection of the result sets of other queries, previously executed and materialized. Unfortunately, the exploitation of past queries is not always applicable. We then present sufficient conditions for the optimization to apply and show that these conditions are strictly connected with the presence of functional dependencies between the attributes involved in the queries. We show some experiments on an initial prototype of an optimizer which demonstrates that this approach to query answering is not only viable but in many practical cases absolutely necessary since it reduces drastically the execution time
    corecore