Search CORE

266 research outputs found

Recommended from our members

Enhancing Usability and Explainability of Data Systems

Author: Fariha Anna
Publication venue: ScholarWorks@UMass Amherst
Publication date: 20/10/2021
Field of study

The recent growth of data science expanded its reach to an ever-growing user base of nonexperts, increasing the need for usability, understandability, and explainability in these systems. Enhancing usability makes data systems accessible to people with different skills and backgrounds alike, leading to democratization of data systems. Furthermore, proper understanding of data and data-driven systems is necessary for the users to trust the function of the systems that learn from data. Finally, data systems should be transparent: when a data system behaves unexpectedly or malfunctions, the users deserve proper explanation of what caused the observed incident. Unfortunately, most existing data systems offer limited usability and support for explanations: these systems are usable only by experts with sound technical skills, and even expert users are hindered by the lack of transparency into the systems\u27 inner workings and functions. The aim of my thesis is to bridge the usability gap between nonexpert users and complex data systems, aid all sort of users, including the expert ones, in data and system understanding, and provide explanations that help reason about unexpected outcomes involving data systems. Specifically, my thesis has the following three goals: (1) enhancing usability of data systems for nonexperts, (2) enable data understanding that can assist users in a variety of tasks such as achieving trust in data-driven machine learning, gaining data understanding, and data cleaning, and (3) explaining causes of unexpected outcomes involving data and data systems. For enhancing usability, we focus on example-driven user intent discovery. We develop systems based on example-driven interactions in two different settings: querying relational databases and personalized document summarization. Towards data understanding, we develop a new data-profiling primitive that can characterize tuples for which a machine-learned model is likely to produce untrustworthy predictions. We also develop an explanation framework to explain causes of such untrustworthy predictions. Additionally, this new data-profiling primitive enables interactive data cleaning. Finally, we develop two explanation frameworks, tailored to provide explanations in debugging data system components, including the data itself. The explanation frameworks focus on explaining the root cause of a concurrent application\u27s intermittent failure and exposing issues in the data that cause a data-driven system to malfunction

ScholarWorks@UMass Amherst

Mining and Managing Large-Scale Temporal Graphs

Author: Zong Bo
Publication venue: eScholarship, University of California
Publication date: 01/01/2015
Field of study

Large-scale temporal graphs are everywhere in our daily life. From online social networks, mobile networks, brain networks to computer systems, entities in these large complex systems communicate with each other, and their interactions evolve over time. Unlike traditional graphs, temporal graphs are dynamic: both topologies and attributes on nodes/edges may change over time. On the one hand, the dynamics have inspired new applications that rely on mining and managing temporal graphs. On the other hand, the dynamics also raise new technical challenges. First, it is difficult to discover or retrieve knowledge from complex temporal graph data. Second, because of the extra time dimension, we also face new scalability problems. To address these new challenges, we need to develop new methods that model temporal information in graphs so that we can deliver useful knowledge, new queries with temporal and structural constraints where users can obtain the desired knowledge, and new algorithms that are cost-effective for both mining and management tasks.In this dissertation, we discuss our recent works on mining and managing large-scale temporal graphs.First, we investigate two mining problems, including node ranking and link prediction problems. In these works, temporal graphs are applied to model the data generated from computer systems and online social networks. We formulate data mining tasks that extract knowledge from temporal graphs. The discovered knowledge can help domain experts identify critical alerts in system monitoring applications and recover the complete traces for information propagation in online social networks. To address computation efficiency problems, we leverage the unique properties in temporal graphs to simplify mining processes. The resulting mining algorithms scale well with large-scale temporal graphs with millions of nodes and billions of edges. By experimental studies over real-life and synthetic data, we confirm the effectiveness and efficiency of our algorithms.Second, we focus on temporal graph management problems. In these study, temporal graphs are used to model datacenter networks, mobile networks, and subscription relationships between stream queries and data sources. We formulate graph queries to retrieve knowledge that supports applications in cloud service placement, information routing in mobile networks, and query assignment in stream processing system. We investigate three types of queries, including subgraph matching, temporal reachability, and graph partitioning. By utilizing the relatively stable components in these temporal graphs, we develop flexible data management techniques to enable fast query processing and handle graph dynamics. We evaluate the soundness of the proposed techniques by both real and synthetic data. Through these study, we have learned valuable lessons. For temporal graph mining, temporal dimension may not necessarily increase computation complexity; instead, it may reduce computation complexity if temporal information can be wisely utilized. For temporal graph management, temporal graphs may include relatively stable components in real applications, which can help us develop flexible data management techniques that enable fast query processing and handle dynamic changes in temporal graphs

Ezid

eScholarship - University of California

New Perspectives on Structured Encryption:Attacks, Constructions and Foundations

Author: Gui Zichen
Publication venue
Publication date: 12/01/2022
Field of study

Explore Bristol Research

AI Solutions for MDS: Artificial Intelligence Techniques for Misuse Detection and Localisation in Telecommunication Environments

Author: Anyakoha Chukwudi
Bauerdick H.
Gottfried B.
Mintram Robert
Muthuraman S.
Phalp Keith T.
Vincent Jonathan
Publication venue: 'Indiana University Press (Project Muse)'
Publication date: 15/07/2006
Field of study

This report considers the application of Articial Intelligence (AI) techniques to the problem of misuse detection and misuse localisation within telecommunications environments. A broad survey of techniques is provided, that covers inter alia rule based systems, model-based systems, case based reasoning, pattern matching, clustering and feature extraction, articial neural networks, genetic algorithms, arti cial immune systems, agent based systems, data mining and a variety of hybrid approaches. The report then considers the central issue of event correlation, that is at the heart of many misuse detection and localisation systems. The notion of being able to infer misuse by the correlation of individual temporally distributed events within a multiple data stream environment is explored, and a range of techniques, covering model based approaches, `programmed' AI and machine learning paradigms. It is found that, in general, correlation is best achieved via rule based approaches, but that these suffer from a number of drawbacks, such as the difculty of developing and maintaining an appropriate knowledge base, and the lack of ability to generalise from known misuses to new unseen misuses. Two distinct approaches are evident. One attempts to encode knowledge of known misuses, typically within rules, and use this to screen events. This approach cannot generally detect misuses for which it has not been programmed, i.e. it is prone to issuing false negatives. The other attempts to `learn' the features of event patterns that constitute normal behaviour, and, by observing patterns that do not match expected behaviour, detect when a misuse has occurred. This approach is prone to issuing false positives, i.e. inferring misuse from innocent patterns of behaviour that the system was not trained to recognise. Contemporary approaches are seen to favour hybridisation, often combining detection or localisation mechanisms for both abnormal and normal behaviour, the former to capture known cases of misuse, the latter to capture unknown cases. In some systems, these mechanisms even work together to update each other to increase detection rates and lower false positive rates. It is concluded that hybridisation offers the most promising future direction, but that a rule or state based component is likely to remain, being the most natural approach to the correlation of complex events. The challenge, then, is to mitigate the weaknesses of canonical programmed systems such that learning, generalisation and adaptation are more readily facilitated

Bournemouth University Research Online

Leveraging EST Evidence to Automatically Predict Alternatively Spliced Genes, Master\u27s Thesis, December 2006

Author: Zimmermann Robert
Publication venue: Washington University Open Scholarship
Publication date: 01/01/2007
Field of study

Current methods for high-throughput automatic annotation of newly sequenced genomes are largely limited to tools which predict only one transcript per gene locus. Evidence suggests that 20-50% of genes in higher eukariotic organisms are alternatively spliced. This leaves the remainder of the transcripts to be annotated by hand, an expensive time-consuming process. Genomes are being sequenced at a much higher rate than they can be annotated. We present three methods for using the alignments of inexpensive Expressed Sequence Tags in combination with HMM-based gene prediction with N-SCAN EST to recreate the vast majority of hand annotations in the D.melanogaster genome. In our ﬁrst method, we “piece together” N-SCAN EST predictions with clustered EST alignments to increase the number of transcripts per locus predicted. This is shown to be a sensitve and accurate method, predicting the vast majority of known transcripts in the D.melanogaster genome. We present an approach of using these clusters of EST alignments to construct a Multi-Pass gene prediction phase, again, piecing it together with clusters of EST alignments. While time consuming, Multi-Pass gene prediction is very accurate and more sensitive than single-pass. Finally, we present a new Hidden Markov Model instance, which augments the current N-SCAN EST HMM, that predicts multiple splice forms in a single pass of prediction. This method is less time consuming, and performs nearly as well as the multi-pass approach

Washington University St. Louis: Open Scholarship

IPAC Image Processing and Data Archiving for the Palomar Transient Factory

Author: Barlow Tom
Bellm Eric
Desai Vandana
Grillmair Carl J.
Groom Steve
Hale David
Hamam Nouhad
Helou George
Horesh Assaf
Jackson Ed
Kasliwal Mansi
Kulkarni Shrinivas R.
Laher Russ R.
Levitan David
Masci Frank J.
Mi Wei
Prince Thomas A.
Quimby Robert
Sesar Branimir
Smith Roger
Surace Jason
Teplitz Harry
Waszczak Adam
Publication venue: 'Astronomical Society of the Pacific Conference Series'
Publication date: 01/07/2014
Field of study

The Palomar Transient Factory (PTF) is a multiepochal robotic survey of the northern sky that acquires data for the scientific study of transient and variable astrophysical phenomena. The camera and telescope provide for wide-field imaging in optical bands. In the five years of operation since first light on 2008 December 13, images taken with Mould-R and SDSS-g′ camera filters have been routinely acquired on a nightly basis (weather permitting), and two different Hα filters were installed in 2011 May (656 and 663 nm). The PTF image-processing and data-archival program at the Infrared Processing and Analysis Center (IPAC) is tailored to receive and reduce the data, and, from it, generate and preserve astrometrically and photometrically calibrated images, extracted source catalogs, and co-added reference images. Relational databases have been deployed to track these products in operations and the data archive. The fully automated system has benefited by lessons learned from past IPAC projects and comprises advantageous features that are potentially incorporable into other ground-based observatories. Both off-the-shelf and in-house software have been utilized for economy and rapid development. The PTF data archive is curated by the NASA/IPAC Infrared Science Archive (IRSA). A state-of-the-art custom Web interface has been deployed for downloading the raw images, processed images, and source catalogs from IRSA. Access to PTF data products is currently limited to an initial public data release (M81, M44, M42, SDSS Stripe 82, and the Kepler Survey Field). It is the intent of the PTF collaboration to release the full PTF data archive when sufficient funding becomes available

Caltech Authors

Data Mining

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

Data mining is a branch of computer science that is used to automatically extract meaningful, useful knowledge and previously unknown, hidden, interesting patterns from a large amount of data to support the decision-making process. This book presents recent theoretical and practical advances in the field of data mining. It discusses a number of data mining methods, including classification, clustering, and association rule mining. This book brings together many different successful data mining studies in various areas such as health, banking, education, software engineering, animal science, and the environment

Directory of Open Access Books (DOAB)

An Investigative Study Of Patents From An Engineering Design Perspective

Author: Kanda Ajit singh
Publication venue: Clemson University Libraries
Publication date: 28/07/2008
Field of study

Preservation and reuse of valuable design experience aids in the design of new products and processes. Product design repositories are presently being used as a means to preserve and later reuse design knowledge. As such, patent databases such as the United States Patent office and the European Patent Office offer design knowledge in the form of patents. Unfortunately, these sources of novel design solutions do not appear to have been have not been effectively used in the context of engineering design. In this research, the role of patents in a systematic design process is reviewed understand its utility in the design process. A major hurdle, in the reuse of patent design knowledge, is the lack of formal tools to support designers in understanding and applying the available information to new problems. Information theory fundamentals are used to study patent claim text which describes the subject matter of the patent and to develop an understanding of the information content within the text claim and other representations of the claim. Graph based representations are recognized as an effective way to represent design information. They are considered as ideal for modeling patent claims as they enable the direct use of the information as input to existing design processes and tools, such as function models, the core product model, and function-behavior-structure scheme. This new approach provides a designer-friendly model of patent claims and also enables the use of intelligent search mechanisms. Existing graph based product representation schemas are studied for their suitability to model patent claims. A new representation tailored for patent claims is proposed since, the existing schemas where found to be insufficient to efficiently model patent claim. Patent claims modeled using multiple representation schemas are compared with the models developed using the proposed representation, for the information content captured from claim text. The representation technique proposed here may aid in the retrieval of the relevant patent design information, thereby promoting use of patent information to aid designers. Further refinement and evaluation of the scheme along with the development of grammar and ontologies for a vocabulary is needed. This representation scheme, with existing search and retrieval methods, should help designers in generating both novel and practical concepts based on patent information

Clemson University: TigerPrints