1,138 research outputs found

    Catching Numeric Inconsistencies in Graphs

    Get PDF

    Towards effective analysis of big graphs: from scalability to quality

    Get PDF
    This thesis investigates the central issues underlying graph analysis, namely, scalability and quality. We first study the incremental problems for graph queries, which aim to compute the changes to the old query answer, in response to the updates to the input graph. The incremental problem is called bounded if its cost is decided by the sizes of the query and the changes only. No matter how desirable, however, our first results are negative: for common graph queries such as graph traversal, connectivity, keyword search and pattern matching, their incremental problems are unbounded. In light of the negative results, we propose two new characterizations for the effectiveness of incremental computation, and show that the incremental computations above can still be effectively conducted, by either reducing the computations on big graphs to small data, or incrementalizing batch algorithms by minimizing unnecessary recomputation. We next study the problems with regards to improving the quality of the graphs. To uniquely identify entities represented by vertices in a graph, we propose a class of keys that are recursively defined in terms of graph patterns, and are interpreted with subgraph isomorphism. As an application, we study the entity matching problem, which is to find all pairs of entities in a graph that are identified by a given set of keys. Although the problem is proved to be intractable, and cannot be parallelized in logarithmic rounds, we provide two parallel scalable algorithms for it. In addition, to catch numeric inconsistencies in real-life graphs, we extend graph functional dependencies with linear arithmetic expressions and comparison predicates, referred to as NGDs. Indeed, NGDs strike a balance between expressivity and complexity, since if we allow non-linear arithmetic expressions, even of degree at most 2, the satisfiability and implication problems become undecidable. A localizable incremental algorithm is developed to detect errors using NGDs, where the cost is determined by small neighbors of nodes in the updates instead of the entire graph. Finally, a rule-based method to clean graphs is proposed. We extend graph entity dependencies (GEDs) as data quality rules. Given a graph, a set of GEDs and a block of ground truth, we fix violations of GEDs in the graph by combining data repairing and object identification. The method finds certain fixes to errors detected by GEDs, i.e., as long as the GEDs and the ground truth are correct, the fixes are assured correct as their logical consequences. Several fundamental results underlying the method are established, and an algorithm is developed to implement the method. We also parallelize the method and guarantee to reduce its running time with the increase of processors

    Infrared: A Meta Bug Detector

    Full text link
    The recent breakthroughs in deep learning methods have sparked a wave of interest in learning-based bug detectors. Compared to the traditional static analysis tools, these bug detectors are directly learned from data, thus, easier to create. On the other hand, they are difficult to train, requiring a large amount of data which is not readily available. In this paper, we propose a new approach, called meta bug detection, which offers three crucial advantages over existing learning-based bug detectors: bug-type generic (i.e., capable of catching the types of bugs that are totally unobserved during training), self-explainable (i.e., capable of explaining its own prediction without any external interpretability methods) and sample efficient (i.e., requiring substantially less training data than standard bug detectors). Our extensive evaluation shows our meta bug detector (MBD) is effective in catching a variety of bugs including null pointer dereference, array index out-of-bound, file handle leak, and even data races in concurrent programs; in the process MBD also significantly outperforms several noteworthy baselines including Facebook Infer, a prominent static analysis tool, and FICS, the latest anomaly detection method

    Big Graph Analyses: From Queries to Dependencies and Association Rules

    Get PDF

    Graduate Research in Education: Learning the Research Story Through the Story of a Slow Cat

    Get PDF
    This book is designed to facilitate understanding of education research and guide the development and the writing of a research project. In education, like other social sciences, we are investigating issues directly involving or influencing humans. Research involving humans can be a complex and awkward endeavor. However, if we view this endeavor through the lens of a research story, we find a familiar genre we can relate to and understand. This concise, open textbook weaves the story of a cat throughout to explain the research components of typical education research projects.https://scholars.fhsu.edu/all_oer/1006/thumbnail.jp

    Capstone Projects in Education: Learning the Research Story

    Get PDF
    This book is designed to facilitate understanding of education research and guide the development and the writing of a capstone project. In education, like other social sciences, we are investigating issues directly involving or influencing humans. Research involving humans can be a complex and awkward endeavor. However, if we view this endeavor through the lens of a research story, we find a familiar genre we can relate to and understand. This concise, open textbook weaves the story of a cat throughout to explain the components of typical education capstone projects.https://scholars.fhsu.edu/all_oer/1008/thumbnail.jp

    Prime Number-Based Hierarchical Data Labeling Scheme for Relational Databases

    Get PDF
    Hierarchical data structures are an important aspect of many computer science fields including data mining, terrain modeling, and image analysis. A good representation of such data accurately captures the parent-child and ancestor-descendent relationships between nodes. There exist a number of different ways to capture and manage hierarchical data while preserving such relationships. For instance, one may use a custom system designed for a specific kind of hierarchy. Object oriented databases may also be used to model hierarchical data. Relational database systems, on the other hand, add an additional benefit of mature mathematical theory, reliable implementations, superior functionality and scalability. Relational databases were not originally designed with hierarchical data management in mind. As a result, abstract information can not be natively stored in database relations. Database labeling schemes resolve this issue by labeling all nodes in a way that reveals their relationships. Labels usually encode the node's position in a hierarchy as a number or a string that can be stored, indexed, searched, and retrieved from a database. Many different labeling schemes have been developed in the past. All of them may be classified into three broad categories: recursive expansion, materialized path, and nested sets. Each model has its strengths and weaknesses. Each model implementation attempts to reduce the number of weaknesses inherent to the respective model. One of the most prominent implementations of the materialized path model uses the unique characteristics of prime numbers for its labeling purposes. However, the performance and space utilization of this prime number labeling scheme could be significantly improved. This research introduces a new scheme called reusable prime number labeling (rPNL) that reduces the effects of the mentioned weaknesses. The proposed scheme advantage is discussed in detail, proven mathematically, and experimentally confirmed

    Understanding the New York rabies epizootic 1985-2005

    Get PDF
    Surveillance data are an important part of medical geography. These data are used to produce much of the analyses that define the subdiscipline. It is understood that surveillance data may contain biases, but there have only been limited studies devoted to determining in what ways the data are not representative of actual disease prevalence. New York was selected for this research for several reasons. First, it has a strong rabies data set. Second, it has a centralized system of licensing animal and dog control officers. Third, it is well-represented in terms of local media. This dissertation attempts to better understand the New York rabies epizootic, using not only rabies surveillance data, but also data collected from animal control officers and media reports, particularly newspaper articles. These data can help provide a fuller picture of the function of a rabies epizootic within a state, particularly in terms of the relationship of the disease to the society at large. The generation of surveillance data itself is not often the subject of investigation. One part of that system that receives little attention from researchers is the part that physically collects animals- animal and dog control officers. As the lowest level in the surveillance system, control officers are often overlooked in terms of their contribution to the system. The media presentation of the rabies epizootic is the other subject of this work. The relationship between a disease and media reports of a disease are often not clear. In New York, reporting of rabies in local newspapers often reflected the submissions of suspicious animals for rabies testing. This research found that the levels of training found in animal and dog control officers in New York were low considering that this was a state with epizootic rabies. The attitudes of the control officers revealed that as a group they considered themselves part of the public health system, but they were often not treated as such. The media investigation revealed that articles about rabies in small, local newspapers can reflect rabies submissions in the adjacent area. This was not true for larger newspapers
    • …
    corecore