14 research outputs found

    Keyword-based object search and exploration in multidimensional text databases

    Get PDF
    We propose a novel system TEXplorer that integrates keyword-based object ranking with the aggregation and exploration power of OLAP in a text database with rich structured attributes available, e.g., a product review database. TEXplorer can be implemented within a multi-dimensional text database, where each row is associated with structural dimensions (attributes) and text data (e.g., a document). The system utilizes the text cube data model, where a cell aggregates a set of documents with matching values in a subset of dimensions. Cells in a text cube capture different levels of summarization of the documents, and can represent objects at different conceptual levels. Users query the system by submitting a set of keywords. Instead of returning a ranked list of all the cells, we propose a keyword-based interactive exploration framework that could offer flexible OLAP navigational guides and help users identify the levels and objects they are interested in. A novel significance measure of dimensions is proposed based on the distribution of IR relevance of cells. During each interaction stage, dimensions are ranked according to their significance scores to guide drilling down; and cells in the same cuboids are ranked according to their relevance to guide exploration. We propose efficient algorithms and materialization strategies for ranking top-k dimensions and cells. Finally, extensive experiments on real datasets demonstrate the efficiency and effectiveness of our approach

    Efficient Algorithms for k-Regret Minimizing Sets

    Get PDF
    A regret minimizing set Q is a small size representation of a much larger database P so that user queries executed on Q return answers whose scores are not much worse than those on the full dataset. In particular, a k-regret minimizing set has the property that the regret ratio between the score of the top-1 item in Q and the score of the top-k item in P is minimized, where the score of an item is the inner product of the item\u27s attributes with a user\u27s weight (preference) vector. The problem is challenging because we want to find a single representative set Q whose regret ratio is small with respect to all possible user weight vectors. We show that k-regret minimization is NP-Complete for all dimensions d>=3, settling an open problem from Chester et al. [VLDB 2014]. Our main algorithmic contributions are two approximation algorithms, both with provable guarantees, one based on coresets and another based on hitting sets. We perform extensive experimental evaluation of our algorithms, using both real-world and synthetic data, and compare their performance against the solution proposed in [VLDB 14]. The results show that our algorithms are significantly faster and scalable to much larger sets than the greedy algorithm of Chester et al. for comparable quality answers

    Ad-hoc Holistic Ranking Aggregation

    Get PDF
    Data exploration is considered one of the major processes that enables the user to analyze massive amount of data in order to find the most important and relevant informa- tion needed. Aggregation and Ranking are two of the most frequently used tools in data exploration. The interaction between ranking and aggregation has been studied widely from different perspectives. In this thesis, a comprehensive survey about this interaction is studied. Holistic Ranking Aggregation which is a new interaction is introduced. Finally, various algorithms are proposed to efficiently process ad-hoc holistic ranking aggregation for both monotone and generic scoring functions

    Desenvolvimento de uma infraestrutura computacional que visa o aumento da eficiĂȘncia energĂ©tica em edifĂ­cios atravĂ©s da utilização de redes Ad Hoc

    Get PDF
    Dissertação para obtenção do Grau de Mestre em Engenharia ElectrotĂ©cnica e de ComputadoresSinais identificados na literatura e no comportamento social apontam para uma acentuada diminuição da eficiĂȘncia energĂ©tica em edifĂ­cios, tornando-se este facto num aspecto social do qual nenhum de nĂłs se poderĂĄ distanciar. Esta diminuição tem alcançado nos Ășltimos anos nĂșmeros que nos devem alarmar para a criação de uma rĂĄpida solução para este problema global. Por outro lado, as tecnologias de redes sem fios constituem um enorme suporte para o desenvolvimento das mais variadas aplicaçÔes baseadas nas comunicaçÔes mĂłveis. Mais, com o crescimento do mercado dos dispositivos mĂłveis (como sĂŁo exemplo os Smartphones e PDAs) essas tecnologias de redes sem fios estĂŁo agora e, com o avançar dos tempos, cada vez mais ao alcance de um elevado nĂșmero de pessoas. Nesta dissertação Ă© desenvolvida uma solução computacional denominada Building Sensors Manager (BSM), que visa auxiliar a gestĂŁo de edifĂ­cios de forma energeticamente eficiente, baseada na utilização de tecnologias de redes sem fios (mais precisamente redes ad hoc). Nessa solução, as pessoas sĂŁo transformadas em elementos activos no processo, contribuindo elas prĂłprias para uma melhoria da sua qualidade de vida. O objectivo do BSM Ă© recolher valores relativos Ă s constantes fĂ­sicas verificadas no edifĂ­cio para poder actuar nele de uma forma calculada. Desta forma, sĂŁo aplicadas no edifĂ­cio as medidas que maximizam a sua eficiĂȘncia energĂ©tica. Neste processo, o papel das pessoas consiste em difundir a informação relativa ao estado fĂ­sico do edifĂ­cio atĂ© a fazer chegar a um local deste que possua os meios necessĂĄrios para poder actuar. Essa difusĂŁo Ă© efectuada atravĂ©s da criação de redes ad hoc de forma a que a informação circule atravĂ©s de dispositivos mĂłveis atĂ© chegar ao seu destino, ao mesmo tempo que informa as pessoas na rede acerca estado actual do edifĂ­cio. Quanto Ă s funcionalidades do BSM, destacam-se: a capacidade de os dispositivos mĂłveis na mesma rede ad hoc poderem comunicar livremente entre si, quer atravĂ©s de mensagens de texto, quer atravĂ©s de ficheiros; a possibilidade de os dispositivos mĂłveis no edifĂ­cio servirem de ponte entre dispositivos de recolha de dados e um sistema de gestĂŁo do edifĂ­cio; e a capacidade para juntar num local prĂ©-determinado no edifĂ­cio toda a informação relevante acerca deste e que pode ser utilizada para se poder actuar nele de forma ponderada

    Efficient Indexing for Structured and Unstructured Data

    Get PDF
    The collection of digital data is growing at an exponential rate. Data originates from wide range of data sources such as text feeds, biological sequencers, internet traffic over routers, through sensors and many other sources. To mine intelligent information from these sources, users have to query the data. Indexing techniques aim to reduce the query time by preprocessing the data. Diversity of data sources in real world makes it imperative to develop application specific indexing solutions based on the data to be queried. Data can be structured i.e., relational tables or unstructured i.e., free text. Moreover, increasingly many applications need to seamlessly analyze both kinds of data making data integration a central issue. Integrating text with structured data needs to account for missing values, errors in the data etc. Probabilistic models have been proposed recently for this purpose. These models are also useful for applications where uncertainty is inherent in data e.g. sensor networks. This dissertation aims to propose efficient indexing solutions for several problems that lie at the intersection of database and information retrieval such as joining ranked inputs, full-text documents searching etc. Other well-known problems of ranked retrieval and pattern matching are also studied under probabilistic settings. For each problem, the worst-case theoretical bounds of the proposed solutions are established and/or their practicality is demonstrated by thorough experimentation

    Personnalisation d'analyses décisionnelles sur des données multidimensionnelles

    Get PDF
    This thesis investigates OLAP analysis personalization within multidimensional databases. OLAP analyse is modeled through a graph where nodes represent the analysis contexts and graph edges represent the user operations. The analysis context regroups the user query as well as result. It is well described by a specific tree structure that is independent on the visualization structures of data and query languages. We provided a model for user preferences on the multidimensional schema and values. Each preference is associated with a specific analysis context. Based on previous models, we proposed a generic framework that includes two personalization processes. First process, denoted query personalization, aims to enhancing user query with related preferences in order to produce a new one that generates a personalized result. Second personalization process is query recommendation that allows helping user throughout the OLAP data exploration phase. Our recommendation framework supports three recommendation scenarios, i.e., assisting user in query composition, suggesting the forthcoming query, and suggesting alternative queries. Recommendations are built progressively basing on user preferences. In order to implement our framework, we developed a prototype system that supports query personalization and query recommendation processes. We present experimental results showing the efficiency and the effectiveness of our approaches.Le travail prĂ©sentĂ© dans cette thĂšse aborde la problĂ©matique de la personnalisation des analyses OLAP au sein des bases de donnĂ©es multidimensionnelles. Une analyse OLAP est modĂ©lisĂ©e par un graphe dont les noeuds reprĂ©sentent les contextes d'analyse et les arcs traduisent les opĂ©rations de l'utilisateur. Le contexte d'analyse regroupe la requĂȘte et le rĂ©sultat. Il est dĂ©crit par un arbre spĂ©cifique qui est indĂ©pendant des structures de visualisation des donnĂ©es et des langages de requĂȘte. Par ailleurs, nous proposons un modĂšle de prĂ©fĂ©rences utilisateur exprimĂ©es sur le schĂ©ma multidimensionnel et sur les valeurs. Chaque prĂ©fĂ©rence est associĂ©e Ă  un contexte d'analyse particulier. En nous basant sur ces modĂšles, nous proposons un cadre gĂ©nĂ©rique comportant deux mĂ©canismes de personnalisation. Le premier mĂ©canisme est la personnalisation de requĂȘte. Il permet d'enrichir la requĂȘte utilisateur Ă  l'aide des prĂ©fĂ©rences correspondantes afin de gĂ©nĂ©rer un rĂ©sultat qui satisfait au mieux aux besoins de l'usager. Le deuxiĂšme mĂ©canisme de personnalisation est la recommandation de requĂȘtes qui permet d'assister l'utilisateur tout au long de son exploration des donnĂ©es OLAP. Trois scĂ©narios de recommandation sont dĂ©finis : l'assistance Ă  la formulation de requĂȘte, la proposition de la prochaine requĂȘte et la suggestion de requĂȘtes alternatives. Ces recommandations sont construites progressivement Ă  l'aide des prĂ©fĂ©rences de l'utilisateur. Afin valider nos diffĂ©rentes contributions, nous avons dĂ©veloppĂ© un prototype qui intĂšgre les mĂ©canismes de personnalisation et de recommandation de requĂȘte proposĂ©s. Nous prĂ©sentons les rĂ©sultats d'expĂ©rimentations montrant la performance et l'efficacitĂ© de nos approches. Mots-clĂ©s: OLAP, analyse dĂ©cisionnelle, personnalisation de requĂȘte, systĂšme de recommandation, prĂ©fĂ©rence utilisateur, contexte d'analyse, appariement d'arbres de contexte

    Rank-aware, Approximate Query Processing on the Semantic Web

    Get PDF
    Search over the Semantic Web corpus frequently leads to queries having large result sets. So, in order to discover relevant data elements, users must rely on ranking techniques to sort results according to their relevance. At the same time, applications oftentimes deal with information needs, which do not require complete and exact results. In this thesis, we face the problem of how to process queries over Web data in an approximate and rank-aware fashion

    Querying Large Collections of Semistructured Data

    Get PDF
    An increasing amount of data is published as semistructured documents formatted with presentational markup. Examples include data objects such as mathematical expressions encoded with MathML or web pages encoded with XHTML. Our intention is to improve the state of the art in retrieving, manipulating, or mining such data. We focus first on mathematics retrieval, which is appealing in various domains, such as education, digital libraries, engineering, patent documents, and medical sciences. Capturing the similarity of mathematical expressions also greatly enhances document classification in such domains. Unlike text retrieval, where keywords carry enough semantics to distinguish text documents and rank them, math symbols do not contain much semantic information on their own. Unfortunately, considering the structure of mathematical expressions to calculate relevance scores of documents results in ranking algorithms that are computationally more expensive than the typical ranking algorithms employed for text documents. As a result, current math retrieval systems either limit themselves to exact matches, or they ignore the structure completely; they sacrifice either recall or precision for efficiency. We propose instead an efficient end-to-end math retrieval system based on a structural similarity ranking algorithm. We describe novel optimization techniques to reduce the index size and the query processing time. Thus, with the proposed optimizations, mathematical contents can be fully exploited to rank documents in response to mathematical queries. We demonstrate the effectiveness and the efficiency of our solution experimentally, using a special-purpose testbed that we developed for evaluating math retrieval systems. We finally extend our retrieval system to accommodate rich queries that consist of combinations of math expressions and textual keywords. As a second focal point, we address the problem of recognizing structural repetitions in typical web documents. Most web pages use presentational markup standards, in which the tags control the formatting of documents rather than semantically describing their contents. Hence, their structures typically contain more irregularities than descriptive (data-oriented) markup languages. Even though applications would greatly benefit from a grammar inference algorithm that captures structure to make it explicit, the existing algorithms for XML schema inference, which target data-oriented markup, are ineffective in inferring grammars for web documents with presentational markup. There is currently no general-purpose grammar inference framework that can handle irregularities commonly found in web documents and that can operate with only a few examples. Although inferring grammars for individual web pages has been partially addressed by data extraction tools, the existing solutions rely on simplifying assumptions that limit their application. Hence, we describe a principled approach to the problem by defining a class of grammars that can be inferred from very small sample sets and can capture the structure of most web documents. The effectiveness of this approach, together with a comparison against various classes of grammars including DTDs and XSDs, is demonstrated through extensive experiments on web documents. We finally use the proposed grammar inference framework to extend our math retrieval system and to optimize it further

    Mining interesting events on large and dynamic data

    Get PDF
    Nowadays, almost every human interaction produces some form of data. These data are available either to every user, e.g.~images uploaded on Flickr or to users with specific privileges, e.g.~transactions in a bank. The huge amount of these produced data can easily overwhelm humans that try to make sense out of it. The need for methods that will analyse the content of the produced data, identify emerging topics in it and present the topics to the users has emerged. In this work, we focus on emerging topics identification over large and dynamic data. More specifically, we analyse two types of data: data published in social networks like Twitter, Flickr etc.~and structured data stored in relational databases that are updated through continuous insertion queries. In social networks, users post text, images or videos and annotate each of them with a set of tags describing its content. We define sets of co-occurring tags to represent topics and track the correlations of co-occurring tags over time. We split the tags to multiple nodes and make each node responsible of computing the correlations of its assigned tags. We implemented our approach in Storm, a distributed processing engine, and conducted a user study to estimate the quality of our results. In structured data stored in relational databases, top-k group-by queries are defined and an emerging topic is considered to be a change in the top-k results. We maintain the top-k result sets in the presence of updates minimising the interaction with the underlying database. We implemented and experimentally tested our approach.Heutzutage entstehen durch fast jede menschliche Aktion und Interaktion Daten. Fotos werden auf Flickr bereitgestellt, Neuigkeiten ĂŒber Twitter verbreitet und Kontakte in Linkedin und Facebook verwaltet; neben traditionellen VorgĂ€ngen wie Banktransaktionen oder Flugbuchungen, die Änderungen in Datenbanken erzeugen. Solch eine riesige Menge an Daten kann leicht ĂŒberwĂ€ltigend sein bei dem Versuch die Essenz dieser Daten zu extrahieren. Neue Methoden werden benötigt, um Inhalt der Daten zu analysieren, neu entstandene Themen zu identifizieren und die so gewonnenen Erkenntnisse dem Benutzer in einer ĂŒbersichtlichen Art und Weise zu prĂ€sentieren. In dieser Arbeit werden Methoden zur Identifikation neuer Themen in großen und dynamischen Datenmengen behandelt. Dabei werden einerseits die veröffentlichten Daten aus sozialen Netzwerken wie Twitter und Flickr und andererseits strukturierte Daten aus relationalen Datenbanken, welche kontinuierlich aktualisiert werden, betrachtet. In sozialen Netzwerken stellen die Benutzer Texte, Bilder oder Videos online und beschreiben diese fĂŒr andere Nutzer mit Schlagworten, sogenannten Tags. Wir interpretieren Gruppen von zusammen auftretenden Tags als eine Art Thema und verfolgen die Beziehung bzw. Korrelation dieser Tags ĂŒber einen gewissen Zeitraum. Abrupte Anstiege in der Korrelation werden als Hinweis auf Trends aufgefasst. Die eigentlich Aufgabe, das ZĂ€hlen von zusammen auftretenden Tags zur Berechnung von Korrelationsmaßen, wird dabei auf eine Vielzahl von Computerknoten verteilt. Die entwickelten Algorithmen wurden in Storm, einem neuartigen verteilten Datenstrommanagementsystem, implementiert und bzgl. Lastbalancierung und anfallender Netzwerklast sorgfĂ€ltig evaluiert. Durch eine Benutzerstudie wird darĂŒber hinaus gezeigt, dass die QualitĂ€t der gewonnenen Trends höher ist als die QualitĂ€t der Ergebnisse bestehender Systeme. In strukturierten Daten von relationalen Datenbanksystemen werden Beste-k Ergebnislisten durch Aggregationsanfragen in SQL definiert. Interessant dabei sind eintretende Änderungen in diesen Listen, was als Ereignisse (Trends) aufgefasst wird. In dieser Arbeit werden Methoden prĂ€sentiert diese Ergebnislisten möglichst effizient instand zu halten, um Interaktionen mit der eigentlichen Datenbank zu minimieren
    corecore