1,394 research outputs found

    Targeted matrix completion

    Full text link
    Matrix completion is a problem that arises in many data-analysis settings where the input consists of a partially-observed matrix (e.g., recommender systems, traffic matrix analysis etc.). Classical approaches to matrix completion assume that the input partially-observed matrix is low rank. The success of these methods depends on the number of observed entries and the rank of the matrix; the larger the rank, the more entries need to be observed in order to accurately complete the matrix. In this paper, we deal with matrices that are not necessarily low rank themselves, but rather they contain low-rank submatrices. We propose Targeted, which is a general framework for completing such matrices. In this framework, we first extract the low-rank submatrices and then apply a matrix-completion algorithm to these low-rank submatrices as well as the remainder matrix separately. Although for the completion itself we use state-of-the-art completion methods, our results demonstrate that Targeted achieves significantly smaller reconstruction errors than other classical matrix-completion methods. One of the key technical contributions of the paper lies in the identification of the low-rank submatrices from the input partially-observed matrices.Comment: Proceedings of the 2017 SIAM International Conference on Data Mining (SDM

    Generating Preview Tables for Entity Graphs

    Full text link
    Users are tapping into massive, heterogeneous entity graphs for many applications. It is challenging to select entity graphs for a particular need, given abundant datasets from many sources and the oftentimes scarce information for them. We propose methods to produce preview tables for compact presentation of important entity types and relationships in entity graphs. The preview tables assist users in attaining a quick and rough preview of the data. They can be shown in a limited display space for a user to browse and explore, before she decides to spend time and resources to fetch and investigate the complete dataset. We formulate several optimization problems that look for previews with the highest scores according to intuitive goodness measures, under various constraints on preview size and distance between preview tables. The optimization problem under distance constraint is NP-hard. We design a dynamic-programming algorithm and an Apriori-style algorithm for finding optimal previews. Results from experiments, comparison with related work and user studies demonstrated the scoring measures' accuracy and the discovery algorithms' efficiency.Comment: This is the camera-ready version of a SIGMOD16 paper. There might be tiny differences in layout, spacing and linebreaking, compared with the version in the SIGMOD16 proceedings, since we must submit TeX files and use arXiv to compile the file

    Keyword search in graphs, relational databases and social networks

    Get PDF
    Keyword search, a well known mechanism for retrieving relevant information from a set of documents, has recently been studied for extracting information from structured data (e.g., relational databases and XML documents). It offers an alternative way to query languages (e.g., SQL) to explore databases, which is effective for lay users who may not be familiar with the database schema or the query language. This dissertation addresses some issues in keyword search in structured data. Namely, novel solutions to existing problems in keyword search in graphs or relational databases are proposed. In addition, a problem related to graph keyword search, team formation in social networks, is studied. The dissertation consists of four parts. The first part addresses keyword search over a graph which finds a substructure of the graph containing all or some of the query keywords. Current methods for keyword search over graphs may produce answers in which some content nodes (i.e., nodes that contain input keywords) are not very close to each other. In addition, current methods explore both content and non-content nodes while searching for the result and are thus both time and memory consuming for large graphs. To address the above problems, we propose algorithms for finding r-cliques in graphs. An r-clique is a group of content nodes that cover all the input keywords and the distance between each pair of nodes is less than or equal to r. Two approximation algorithms that produce r-cliques with a bounded approximation ratio in polynomial delay are proposed. In the second part, the problem of duplication-free and minimal keyword search in graphs is studied. Current methods for keyword search in graphs may produce duplicate answers that contain the same set of content nodes. In addition, an answer found by these methods may not be minimal in the sense that some of the nodes in the answer may contain query keywords that are all covered by other nodes in the answer. Removing these nodes does not change the coverage of the answer but can make the answer more compact. We define the problem of finding duplication-free and minimal answers, and propose algorithms for finding such answers efficiently. Meaningful keyword search in relational databases is the subject of the third part of this dissertation. Keyword search over relational databases returns a join tree spanning tuples containing the query keywords. As many answers of varying quality can be found, and the user is often only interested in seeing the·top-k answers, how to gauge the relevance of answers to rank them is of paramount importance. This becomes more pertinent for databases with large and complex schemas. We focus on the relevance of join trees as the fundamental means to rank the answers. We devise means to measure relevance of relations and foreign keys in the schema over the information content of the database. The problem of keyword search over graph data is similar to the problem of team formation in social networks. In this setting, keywords represent skills and the nodes in a graph represent the experts that possess skills. Given an expert network, in which a node represents an expert that has a cost for using the expert service and an edge represents the communication cost between the two corresponding experts, we tackle the problem of finding a team of experts that covers a set of required skills and also minimizes the communication cost as well as the personnel cost of the team. We propose two types of approximation algorithms to solve this bi-criteria problem in the fourth part of this dissertation

    Enabling Scalability: Graph Hierarchies and Fault Tolerance

    Get PDF
    In this dissertation, we explore approaches to two techniques for building scalable algorithms. First, we look at different graph problems. We show how to exploit the input graph\u27s inherent hierarchy for scalable graph algorithms. The second technique takes a step back from concrete algorithmic problems. Here, we consider the case of node failures in large distributed systems and present techniques to quickly recover from these. In the first part of the dissertation, we investigate how hierarchies in graphs can be used to scale algorithms to large inputs. We develop algorithms for three graph problems based on two approaches to build hierarchies. The first approach reduces instance sizes for NP-hard problems by applying so-called reduction rules. These rules can be applied in polynomial time. They either find parts of the input that can be solved in polynomial time, or they identify structures that can be contracted (reduced) into smaller structures without loss of information for the specific problem. After solving the reduced instance using an exponential-time algorithm, these previously contracted structures can be uncontracted to obtain an exact solution for the original input. In addition to a simple preprocessing procedure, reduction rules can also be used in branch-and-reduce algorithms where they are successively applied after each branching step to build a hierarchy of problem kernels of increasing computational hardness. We develop reduction-based algorithms for the classical NP-hard problems Maximum Independent Set and Maximum Cut. The second approach is used for route planning in road networks where we build a hierarchy of road segments based on their importance for long distance shortest paths. By only considering important road segments when we are far away from the source and destination, we can substantially speed up shortest path queries. In the second part of this dissertation, we take a step back from concrete graph problems and look at more general problems in high performance computing (HPC). Here, due to the ever increasing size and complexity of HPC clusters, we expect hardware and software failures to become more common in massively parallel computations. We present two techniques for applications to recover from failures and resume computation. Both techniques are based on in-memory storage of redundant information and a data distribution that enables fast recovery. The first technique can be used for general purpose distributed processing frameworks: We identify data that is redundantly available on multiple machines and only introduce additional work for the remaining data that is only available on one machine. The second technique is a checkpointing library engineered for fast recovery using a data distribution method that achieves balanced communication loads. Both our techniques have in common that they work in settings where computation after a failure is continued with less machines than before. This is in contrast to many previous approaches that---in particular for checkpointing---focus on systems that keep spare resources available to replace failed machines. Overall, we present different techniques that enable scalable algorithms. While some of these techniques are specific to graph problems, we also present tools for fault tolerant algorithms and applications in a distributed setting. To show that those can be helpful in many different domains, we evaluate them for graph problems and other applications like phylogenetic tree inference

    Collective Approaches to Named Entity Disambiguation

    Get PDF
    Internet content has become one of the most important resources of information. Much of this information is in the form of natural language text and one of the important components of natural language text is named entities. So automatic recognition and classification of named entities has attracted researchers for many years. Named entities are mentioned in different textual forms in different documents. Also, the same textual mention may refer to different named entities. This problem is well known in NLP as a disambiguation problem. Named Entity Disambiguation (NED) refers to the task of mapping different named entity mentions in running text to their correct interpretations in a specific knowledge base (KB). NED is important for many applications like search engines and software agents that aim to aggregate information on real world entities from sources such as the Web. The main goal of this research is to develop new methods for named entity disambiguation, emphasising the importance of interdependency of named entity candidates of different textual mentions in the document. The thesis focuses on two connected problems related to disambiguation. The first is Candidates Generation, the process of finding a small set of named entity candidate entries in the knowledge base for a specific textual mention, where this set contains the correct entry in the knowledge base. The second problem is Collective Disambiguation, where all named entity textual mentions in the document are disambiguated jointly, using interdependence and semantic relations between the different NE candidates of different textual mentions. Wikipedia is used as a reference knowledge base in this research. An information retrieval framework is used to generate the named entity candidates for a textual mention. A novel document similarity function (NEBSim) based on NE co-occurrence is introduced to calculate the similarity between two documents given a specific named entity textual mention. NEB-sim is also used in conjunction with the traditional cosine similarity measure to learn a model for ranking the named entity candidates. Na\"{i}ve Bayes and SVM classifiers are used to re-rank the retrieved documents. Our experiments, carried out on TAC-KBP 2011 data, show NEBsim achieves significant improvement in accuracy as compared with a cosine similarity approach. Two novel approaches to collectively disambiguate textual mentions of named entities against Wikipedia are developed and tested using the AIDA dataset. The first represents the conditional dependencies between different named entities across Wikipedia as a Markov network, where named entities are treated as hidden variables and textual mentions as observations. The number of states and observations is huge, and na\"{i}vely using the Viterbi algorithm to find the hidden state sequence which emits the query observation sequence is computationally infeasible given a state space of this size. Based on an observation that is specific to the disambiguation problem, we develop an approach that uses a tailored approximation to reduce the size of the state space, making the Viterbi algorithm feasible. Results show good improvement in disambiguation accuracy relative to the baseline approach, and to some state-of-the-art approaches. Our approach also shows how, with suitable approximations, HMMs can be used in such large-scale state space problems. The second collective disambiguation approach uses a graph model, where all possible NE candidates are represented as nodes in the graph, and associations between different candidates are represented by edges between the nodes. Each node has an initial confidence score, e.g. entity popularity. Page-Rank is used to rank nodes, and the final rank is combined with the initial confidence for candidate selection. Experiments show the effectiveness of using Page-Rank in conjunction with initial confidence, achieving 87\% accuracy, outperforming both baseline and state-of-the-art approaches

    Search Rank Fraud Prevention in Online Systems

    Get PDF
    The survival of products in online services such as Google Play, Yelp, Facebook and Amazon, is contingent on their search rank. This, along with the social impact of such services, has also turned them into a lucrative medium for fraudulently influencing public opinion. Motivated by the need to aggressively promote products, communities that specialize in social network fraud (e.g., fake opinions and reviews, likes, followers, app installs) have emerged, to create a black market for fraudulent search optimization. Fraudulent product developers exploit these communities to hire teams of workers willing and able to commit fraud collectively, emulating realistic, spontaneous activities from unrelated people. We call this behavior “search rank fraud”. In this dissertation, we argue that fraud needs to be proactively discouraged and prevented, instead of only reactively detected and filtered. We introduce two novel approaches to discourage search rank fraud in online systems. First, we detect fraud in real-time, when it is posted, and impose resource consuming penalties on the devices that post activities. We introduce and leverage several novel concepts that include (i) stateless, verifiable computational puzzles that impose minimal performance overhead, but enable the efficient verification of their authenticity, (ii) a real-time, graph based solution to assign fraud scores to user activities, and (iii) mechanisms to dynamically adjust puzzle difficulty levels based on fraud scores and the computational capabilities of devices. In a second approach, we introduce the problem of fraud de-anonymization: reveal the crowdsourcing site accounts of the people who post large amounts of fraud, thus their bank accounts, and provide compelling evidence of fraud to the users of products that they promote. We investigate the ability of our solutions to ensure that fraud does not pay off

    Optimization opportunities in human in the loop computational paradigm

    Get PDF
    An emerging trend is to leverage human capabilities in the computational loop at different capacities, ranging from tapping knowledge from a richly heterogeneous pool of knowledge resident in the general population to soliciting expert opinions. These practices are, in general, termed human-in-the-loop (HITL) computations. A HITL process requires holistic treatment and optimization from multiple standpoints considering all stakeholders: a. applications, b. platforms, c. humans. In application-centric optimization, the factors of interest usually are latency (how long it takes for a set of tasks to finish), cost (the monetary or computational expenses incurred in the process), and quality of the completed tasks. Platform-centric optimization studies throughput, or revenue maximization, while human-centric optimization deals with the characteristics of the human workers, referred to as human factors, such as their skill improvement and learning, to name a few. Finally, fairness and ethical consideration are also of utmost importance in these processes./p\u3e This dissertation aims to design solutions for each of the aforementioned stakeholders. The first contribution of this dissertation is the study of recommending deployment strategies for applications consistent with task requesters’ deployment parameters. From the worker’s standpoint, this dissertation focuses on investigating online group formation where members seek to increase their learning potential via collaboration. Finally, it studies how to consolidate preferences from different workers/applications in a fair manner, such that the final order is both consistent with individual preferences and complies with a group fairness criteria. The technical contributions of this dissertation are to rigorously study these problems from theoretical standpoints, present principled algorithms with theoretical guarantees, and conduct extensive experimental analysis using large-scale real-world datasets to demonstrate their effectiveness and scalability

    Community Detection In Social Networks Using Parallel Clique-finding Ants

    Get PDF
    Tez (Yüksek Lisans) -- İstanbul Teknik Üniversitesi, Bilişim Enstitüsü, 2010Thesis (M.Sc.) -- İstanbul Technical University, Institute of Informatics, 2010İnternet ağının sürekli artan popülaritesiyle birlikte, insanlar daha çok bilgiyi ağ üzerinden dünyanın geri kalanıyla paylaşmaya ve geliştirmeye başladılar; buna bağlı olarak farklı disiplinlerde sosyal ağların analizi konusu da popüler hale geldi. Günümüzde sosyal ağlar üzerinde bulunan topluluk yapılarının tespiti, bilgisayar bilimleri açısından da önem kazandı. Bu amaçla kullanılan topluluk bulma algoritmaları iyi sonuçlar üretse de, büyük ölçekli sosyal ağlarda işlem karmaşıklığı ve buna bağlı ölçeklendirme konusunda yetersiz kalmaktadır. Bu tezin ana amacı, elde bulunan sosyal ağ çizgesini, çizgenin ana özelliklerini koruyarak daha küçük bir hale indirgemek, dolayısıyla topluluk bulma algoritmalarının verimini çözüm kalitesinden kayıp olmadan arttırmaktır. Bu çalışmada Karınca Kolonisi İyileştirme yöntemi sayesinde yarı bağlı alt çizgeler bulunmakta ve bu alt çizgeler ile ana çizge daha küçük bir hale indirgenmekte, son olarak indirgenmiş çizge üzerinde topluluk bulma algoritmaları koşturulmaktadır. Çeşitli sosyal ağ çizgeleri üzerinde koşulan testlerin sonuçları, uygulanan indirgeme yöntemi sonrasında topluluk bulma algoritmalarının çalışma sürelerinde iyileşme gözlenmiş, buna bağlı olarak indirgenme sonrasında çözüm kalitesinin de korunduğu tespit edilmiştir.Attractiveness of social network analysis as a research topic in many different disciplines is growing in parallel to the continuous growth of the Internet, which allows people to share and collaborate more. Nowadays, detection of community structures, which may be established on social networks, is a popular topic in Computer Science. High computational costs and non-scalability on large-scale social networks are the biggest drawbacks of popular community detection methods. The main aim of this thesis is to reduce the original network graph to a maintainable size so that computational costs decrease without loss of solution quality, thus increasing scalability on such networks. In this study, we focus on Ant Colony Optimization techniques to find quasi-cliques in the network and assign these quasi-cliques as nodes in a reduced graph to use with community detection algorithms. Experiments are performed on commonly used social networks with the addition of several large-scale networks. Based on the experimental results on various sized social networks, we may say that the execution times of the community detection methods are decreased while the overall quality of the solution is preserved.Yüksek LisansM.Sc