8,480 research outputs found

    Scalable approximate FRNN-OWA classification

    Get PDF
    Fuzzy Rough Nearest Neighbour classification with Ordered Weighted Averaging operators (FRNN-OWA) is an algorithm that classifies unseen instances according to their membership in the fuzzy upper and lower approximations of the decision classes. Previous research has shown that the use of OWA operators increases the robustness of this model. However, calculating membership in an approximation requires a nearest neighbour search. In practice, the query time complexity of exact nearest neighbour search algorithms in more than a handful of dimensions is near-linear, which limits the scalability of FRNN-OWA. Therefore, we propose approximate FRNN-OWA, a modified model that calculates upper and lower approximations of decision classes using the approximate nearest neighbours returned by Hierarchical Navigable Small Worlds (HNSW), a recent approximative nearest neighbour search algorithm with logarithmic query time complexity at constant near-100% accuracy. We demonstrate that approximate FRNN-OWA is sufficiently robust to match the classification accuracy of exact FRNN-OWA while scaling much more efficiently. We test four parameter configurations of HNSW, and evaluate their performance by measuring classification accuracy and construction and query times for samples of various sizes from three large datasets. We find that with two of the parameter configurations, approximate FRNN-OWA achieves near-identical accuracy to exact FRNN-OWA for most sample sizes within query times that are up to several orders of magnitude faster

    Attribute Value Reordering For Efficient Hybrid OLAP

    Get PDF
    The normalization of a data cube is the ordering of the attribute values. For large multidimensional arrays where dense and sparse chunks are stored differently, proper normalization can lead to improved storage efficiency. We show that it is NP-hard to compute an optimal normalization even for 1x3 chunks, although we find an exact algorithm for 1x2 chunks. When dimensions are nearly statistically independent, we show that dimension-wise attribute frequency sorting is an optimal normalization and takes time O(d n log(n)) for data cubes of size n^d. When dimensions are not independent, we propose and evaluate several heuristics. The hybrid OLAP (HOLAP) storage mechanism is already 19%-30% more efficient than ROLAP, but normalization can improve it further by 9%-13% for a total gain of 29%-44% over ROLAP

    Stochastic Query Covering for Fast Approximate Document Retrieval

    Get PDF
    We design algorithms that, given a collection of documents and a distribution over user queries, return a small subset of the document collection in such a way that we can efficiently provide high-quality answers to user queries using only the selected subset. This approach has applications when space is a constraint or when the query-processing time increases significantly with the size of the collection. We study our algorithms through the lens of stochastic analysis and prove that even though they use only a small fraction of the entire collection, they can provide answers to most user queries, achieving a performance close to the optimal. To complement our theoretical findings, we experimentally show the versatility of our approach by considering two important cases in the context of Web search. In the first case, we favor the retrieval of documents that are relevant to the query, whereas in the second case we aim for document diversification. Both the theoretical and the experimental analysis provide strong evidence of the potential value of query covering in diverse application scenarios

    Impact of different time series aggregation methods on optimal energy system design

    Full text link
    Modelling renewable energy systems is a computationally-demanding task due to the high fluctuation of supply and demand time series. To reduce the scale of these, this paper discusses different methods for their aggregation into typical periods. Each aggregation method is applied to a different type of energy system model, making the methods fairly incomparable. To overcome this, the different aggregation methods are first extended so that they can be applied to all types of multidimensional time series and then compared by applying them to different energy system configurations and analyzing their impact on the cost optimal design. It was found that regardless of the method, time series aggregation allows for significantly reduced computational resources. Nevertheless, averaged values lead to underestimation of the real system cost in comparison to the use of representative periods from the original time series. The aggregation method itself, e.g. k means clustering, plays a minor role. More significant is the system considered: Energy systems utilizing centralized resources require fewer typical periods for a feasible system design in comparison to systems with a higher share of renewable feed-in. Furthermore, for energy systems based on seasonal storage, currently existing models integration of typical periods is not suitable

    Keyword search in graphs, relational databases and social networks

    Get PDF
    Keyword search, a well known mechanism for retrieving relevant information from a set of documents, has recently been studied for extracting information from structured data (e.g., relational databases and XML documents). It offers an alternative way to query languages (e.g., SQL) to explore databases, which is effective for lay users who may not be familiar with the database schema or the query language. This dissertation addresses some issues in keyword search in structured data. Namely, novel solutions to existing problems in keyword search in graphs or relational databases are proposed. In addition, a problem related to graph keyword search, team formation in social networks, is studied. The dissertation consists of four parts. The first part addresses keyword search over a graph which finds a substructure of the graph containing all or some of the query keywords. Current methods for keyword search over graphs may produce answers in which some content nodes (i.e., nodes that contain input keywords) are not very close to each other. In addition, current methods explore both content and non-content nodes while searching for the result and are thus both time and memory consuming for large graphs. To address the above problems, we propose algorithms for finding r-cliques in graphs. An r-clique is a group of content nodes that cover all the input keywords and the distance between each pair of nodes is less than or equal to r. Two approximation algorithms that produce r-cliques with a bounded approximation ratio in polynomial delay are proposed. In the second part, the problem of duplication-free and minimal keyword search in graphs is studied. Current methods for keyword search in graphs may produce duplicate answers that contain the same set of content nodes. In addition, an answer found by these methods may not be minimal in the sense that some of the nodes in the answer may contain query keywords that are all covered by other nodes in the answer. Removing these nodes does not change the coverage of the answer but can make the answer more compact. We define the problem of finding duplication-free and minimal answers, and propose algorithms for finding such answers efficiently. Meaningful keyword search in relational databases is the subject of the third part of this dissertation. Keyword search over relational databases returns a join tree spanning tuples containing the query keywords. As many answers of varying quality can be found, and the user is often only interested in seeing the·top-k answers, how to gauge the relevance of answers to rank them is of paramount importance. This becomes more pertinent for databases with large and complex schemas. We focus on the relevance of join trees as the fundamental means to rank the answers. We devise means to measure relevance of relations and foreign keys in the schema over the information content of the database. The problem of keyword search over graph data is similar to the problem of team formation in social networks. In this setting, keywords represent skills and the nodes in a graph represent the experts that possess skills. Given an expert network, in which a node represents an expert that has a cost for using the expert service and an edge represents the communication cost between the two corresponding experts, we tackle the problem of finding a team of experts that covers a set of required skills and also minimizes the communication cost as well as the personnel cost of the team. We propose two types of approximation algorithms to solve this bi-criteria problem in the fourth part of this dissertation

    The ROMES method for statistical modeling of reduced-order-model error

    Full text link
    This work presents a technique for statistically modeling errors introduced by reduced-order models. The method employs Gaussian-process regression to construct a mapping from a small number of computationally inexpensive `error indicators' to a distribution over the true error. The variance of this distribution can be interpreted as the (epistemic) uncertainty introduced by the reduced-order model. To model normed errors, the method employs existing rigorous error bounds and residual norms as indicators; numerical experiments show that the method leads to a near-optimal expected effectivity in contrast to typical error bounds. To model errors in general outputs, the method uses dual-weighted residuals---which are amenable to uncertainty control---as indicators. Experiments illustrate that correcting the reduced-order-model output with this surrogate can improve prediction accuracy by an order of magnitude; this contrasts with existing `multifidelity correction' approaches, which often fail for reduced-order models and suffer from the curse of dimensionality. The proposed error surrogates also lead to a notion of `probabilistic rigor', i.e., the surrogate bounds the error with specified probability

    Web Service Retrieval by Structured Models

    Get PDF
    Much of the information available on theWorldWideWeb cannot effectively be found by the help of search engines because the information is dynamically generated on a user’s request.This applies to online decision support services as well as Deep Web information. We present in this paper a retrieval system that uses a variant of structured modeling to describe such information services, and similarity of models for retrieval. The computational complexity of the similarity problem is discussed, and graph algorithms for retrieval on repositories of service descriptions are introduced. We show how bounds for combinatorial optimization problems can provide filter algorithms in a retrieval context. We report about an evaluation of the retrieval system in a classroom experiment and give computational results on a benchmark library.Economics ;
