Search CORE

6 research outputs found

On the Hardness of Category Tree Construction

Author: Avron Uri
Gershtein Shay
Guy Ido
Milo Tova
Novgorodov Slava
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 25th International Conference on Database Theory (ICDT 2022)
Publication date: 01/01/2022
Field of study

Category trees, or taxonomies, are rooted trees where each node, called a category, corresponds to a set of related items. The construction of taxonomies has been studied in various domains, including e-commerce, document management, and question answering. Multiple algorithms for automating construction have been proposed, employing a variety of clustering approaches and crowdsourcing. However, no formal model to capture such categorization problems has been devised, and their complexity has not been studied. To address this, we propose in this work a combinatorial model that captures many practical settings and show that the aforementioned empirical approach has been warranted, as we prove strong inapproximability bounds for various problem variants and special cases when the goal is to produce a categorization of the maximum utility. In our model, the input is a set of n weighted item sets that the tree would ideally contain as categories. Each category, rather than perfectly match the corresponding input set, is allowed to exceed a given threshold for a given similarity function. The goal is to produce a tree that maximizes the total weight of the sets for which it contains a matching category. A key parameter is an upper bound on the number of categories an item may belong to, which produces the hardness of the problem, as initially each item may be contained in an arbitrary number of input sets. For this model, we prove inapproximability bounds, of order ??(?n) or ??(n), for various problem variants and special cases, loosely justifying the aforementioned heuristic approach. Our work includes reductions based on parameterized randomized constructions that highlight how various problem parameters and properties of the input may affect the hardness. Moreover, for the special case where the category must be identical to the corresponding input set, we devise an algorithm whose approximation guarantee depends solely on a more granular parameter, allowing improved worst-case guarantees. Finally, we also generalize our results to DAG-based and non-hierarchical categorization

Dagstuhl Research Online Publication Server

Economics of cookie matching (Final Project in Computational Games Theory Course)

Author: Alex Fonar
Slava Novgorodov
Publication venue
Publication date
Field of study

In many cases, sites try to get more revenue from advertising. One way to do it is cookie matching. Cookies are small files placed on a user’s computer that permit a website to record information about a previous visit. Website saves copy of this information too. Cookies can be shared between sites. This helps to get more information about user

CiteSeerX

DiRec: Diversified Recommendations for Semantic-less Collaborative Filtering

Author: Rubi Boim
Slava Novgorodov
Tova Milo
Publication venue
Publication date: 07/05/2012
Field of study

Abstract—In this demo we present DiRec, a plug-in tha

CiteSeerX

Crossref

CrowdPlanr: Planning Made Easy with Crowd

Author: Ilia Lotosh
Slava Novgorodov
Tova Milo
Publication venue
Publication date: 15/07/2013
Field of study

Abstract—Recent research has shown that crowd sourcing can be used effectively to solve problems that are difficult for computers, e.g., optical character recognition and identification of the structural configuration of natural proteins [1]. In this demo we propose to use the power of the crowd to address yet another difficult problem that frequently occurs in a daily life-planning a sequence of actions, when the goal is hard to formalize. For example, planning the sequence of places/attractions to visit in the course of a vacation, where the goal is to enjoy the resulting vacation the most, or planning the sequence of courses to take in an academic schedule planning, where the goal is to obtain solid knowledge of a given subject domain. Such goals may be easily understandable by humans, but hard or even impossible to formalize for a computer. We present a novel algorithm for efficiently harnessing the crowd to assist in solving such planning problems. The algorithm builds the desired plans incrementally, optimally choosing at each step the ‘best ’ questions so that the overall number of questions that need to be asked is minimized. We demonstrate the effectiveness of our solution in CrowdPlanr, a system for vacation travel planning. Given a destination, dates, preferred activities and other constraints CrowdPlanr employs the crowd to build a vacation plan (sequence of places to visit) that is expected to maximize the ”enjoyment ” of the vacation. I

CiteSeerX

Crossref

Answering Planning Queries with the Crowd

Author: Haim Kaplan
Ilia Lotosh
Slava Novgorodov
Tova Milo
Publication venue
Publication date: 01/01/2013
Field of study

Recent research has shown that crowd sourcing can be used effectively to solve problems that are difficult for computers, e.g., optical character recognition and identification of the structural configuration of natural proteins. In this paper we propose to use the power of the crowd to address yet another difficult problem that frequently occurs in a daily life- answering planning queries whose output is a sequence of objects/actions, when the goal, i.e, the notion of “best output”, is hard to formalize. For example, planning the sequence of places/attractions to visit in the course of a vacation, where the goal is to enjoy the resulting vacation the most, or planning the sequence of courses to take in an academic schedule planning, where the goal is to obtain solid knowledge of a given subject domain. Such goals may be easily understandable by humans, but hard or even impossible to formalize for a computer. We present a novel algorithm for efficiently harnessing the crowd to assist in answering such planning queries. The algorithm builds the desired plans incrementally, choosing at each step the ‘best ’ questions so that the overall number of questions that need to be asked is minimized. We prove the algorithm to be optimal within its class and demonstrate experimentally its effectiveness and efficiency. 1

CiteSeerX

QOCO: A Query Oriented Data Cleaning System with Oracles

Author: Moria Bergman
Slava Novgorodov
Tova Milo
Wang-Chiew Tan
Publication venue
Publication date: 11/04/2020
Field of study

ABSTRACT As key decisions are often made based on information contained in a database, it is important for the database to be as complete and correct as possible. For this reason, many data cleaning tools have been developed to automatically resolve inconsistencies in databases. However, data cleaning tools provide only best-effort results and usually cannot eradicate all errors that may exist in a database. Even more importantly, existing data cleaning tools do not typically address the problem of determining what information is missing from a database. To tackle these problems, we present QOCO, a novel query oriented cleaning system that leverages materialized views that are defined by user queries as a trigger for identifying the remaining incorrect/missing information. Given a user query, QOCO interacts with domain experts (which we model as oracle crowds) to identify potentially wrong or missing answers in the result of the user query, as well as determine and correct the wrong data that is the cause for the error(s). We will demonstrate QOCO over a World Cup Games database, and illustrate the interaction between QOCO and the oracles. Our demo audience will play the role of oracles, and we show how QOCO's underlying operations and optimization mechanisms can effectively prune the search space and minimize the number of questions that need to be posed to accelerate the cleaning process

CiteSeerX