62 research outputs found

    Private Data Exploring, Sampling, and Profiling

    Get PDF
    Data analytics is being widely used not only as a business tool, which empowers organizations to drive efficiencies, glean deeper operational insights and identify new opportunities, but also for the greater good of society, as it is helping solve some of world's most pressing issues, such as developing COVID-19 vaccines, fighting poverty and climate change. Data analytics is a process involving a pipeline of tasks over the underlying datasets, such as data acquisition and cleaning, data exploration and profiling, building statistics and training machine learning models. In many cases, conducting data analytics faces two practical challenges. First, many sensitive datasets have restricted access and do not allow unfettered access; Second, data assets are often owned and stored in silos by multiple business units within an organization with different access control. Therefore, data scientists have to do analytics on private and siloed data. There is a fundamental trade-off between data privacy and the data analytics tasks. On the one hand, achieving good quality data analytics requires understanding the whole picture of the data; on the other hand, despite recent advances in designing privacy and security primitives such as differential privacy and secure computation, when naivly applied, they often significantly downgrade tasks' efficiency and accuracy, due to the expensive computations and injected noise, respectively. Moreover, those techniques are often piecemeal and they fall short in holistically integrating into end-to-end data analytics tasks. In this thesis, we approach this problem by treating privacy and utility as constraints on data analytics. First, we study each task and express its utility as data constraints; then, we select a principled data privacy and security model for each task; and finally, we develop mechanisms to combine them into end to end analytics tasks. This dissertation addresses the specific technical challenges of trading off privacy and utility in three popular analytics tasks. The first challenge is to ensure query accuracy in private data exploration. Current systems for answering queries with differential privacy place an inordinate burden on the data scientist to understand differential privacy, manage their privacy budget, and even implement new algorithms for noisy query answering. Moreover, current systems do not provide any guarantees to the data analyst on the quality they care about, namely accuracy of query answers. We propose APEx, a generic accuracy-aware privacy query engine for private data exploration. The key distinction of APEx is to allow the data scientist to explicitly specify the desired accuracy bounds to a SQL query. Using experiments with query benchmarks and a case study, we show that APEx allows high exploration quality with a reasonable privacy loss. The second challenge is to preserve the structure of the data in private data synthesis. Existing differentially private data synthesis methods aim to generate useful data based on applications, but they fail in keeping one of the most fundamental data properties of the structured data — the underlying correlations and dependencies among tuples and attributes. As a result, the synthesized data is not useful for any downstream tasks that require this structure to be preserved. We propose Kamino, a data synthesis system to ensure differential privacy and to preserve the structure and correlations present in the original dataset. We empirically show that while preserving the structure of the data, Kamino achieves comparable and even better usefulness in applications of training classification models and answering marginal queries than the state-of-the-art methods of differentially private data synthesis. The third challenge is efficient and secure private data profiling. Discovering functional dependencies (FDs) usually requires access to all data partitions to find constraints that hold on the whole dataset. Simply applying general secure multi-party computation protocols incurs high computation and communication cost. We propose SMFD to formulate the FD discovery problem in the secure multi-party scenario, and design secure and efficient cryptographic protocols to discover FDs over distributed partitions. Experimental results show that SMFD is practically efficient over non-secure distributed FD discovery, and can significantly outperform general purpose multi-party computation framework

    Collected Papers (on Neutrosophic Theory and Applications), Volume VIII

    Get PDF
    This eighth volume of Collected Papers includes 75 papers comprising 973 pages on (theoretic and applied) neutrosophics, written between 2010-2022 by the author alone or in collaboration with the following 102 co-authors (alphabetically ordered) from 24 countries: Mohamed Abdel-Basset, Abduallah Gamal, Firoz Ahmad, Ahmad Yusuf Adhami, Ahmed B. Al-Nafee, Ali Hassan, Mumtaz Ali, Akbar Rezaei, Assia Bakali, Ayoub Bahnasse, Azeddine Elhassouny, Durga Banerjee, Romualdas Bausys, Mircea Boșcoianu, Traian Alexandru Buda, Bui Cong Cuong, Emilia Calefariu, Ahmet Çevik, Chang Su Kim, Victor Christianto, Dae Wan Kim, Daud Ahmad, Arindam Dey, Partha Pratim Dey, Mamouni Dhar, H. A. Elagamy, Ahmed K. Essa, Sudipta Gayen, Bibhas C. Giri, Daniela Gîfu, Noel Batista Hernández, Hojjatollah Farahani, Huda E. Khalid, Irfan Deli, Saeid Jafari, Tèmítópé Gbóláhàn Jaíyéolá, Sripati Jha, Sudan Jha, Ilanthenral Kandasamy, W.B. Vasantha Kandasamy, Darjan Karabašević, M. Karthika, Kawther F. Alhasan, Giruta Kazakeviciute-Januskeviciene, Qaisar Khan, Kishore Kumar P K, Prem Kumar Singh, Ranjan Kumar, Maikel Leyva-Vázquez, Mahmoud Ismail, Tahir Mahmood, Hafsa Masood Malik, Mohammad Abobala, Mai Mohamed, Gunasekaran Manogaran, Seema Mehra, Kalyan Mondal, Mohamed Talea, Mullai Murugappan, Muhammad Akram, Muhammad Aslam Malik, Muhammad Khalid Mahmood, Nivetha Martin, Durga Nagarajan, Nguyen Van Dinh, Nguyen Xuan Thao, Lewis Nkenyereya, Jagan M. Obbineni, M. Parimala, S. K. Patro, Peide Liu, Pham Hong Phong, Surapati Pramanik, Gyanendra Prasad Joshi, Quek Shio Gai, R. Radha, A.A. Salama, S. Satham Hussain, Mehmet Șahin, Said Broumi, Ganeshsree Selvachandran, Selvaraj Ganesan, Shahbaz Ali, Shouzhen Zeng, Manjeet Singh, A. Stanis Arul Mary, Dragiša Stanujkić, Yusuf Șubaș, Rui-Pu Tan, Mirela Teodorescu, Selçuk Topal, Zenonas Turskis, Vakkas Uluçay, Norberto Valcárcel Izquierdo, V. Venkateswara Rao, Volkan Duran, Ying Li, Young Bae Jun, Wadei F. Al-Omeri, Jian-qiang Wang, Lihshing Leigh Wang, Edmundas Kazimieras Zavadskas

    Straggler-Resilient Distributed Computing

    Get PDF
    In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of University of Bergen's products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go to http://www.ieee.org/publications_standards/publications/rights/rights_link.html to learn how to obtain a License from RightsLink.Utbredelsen av distribuerte datasystemer har økt betydelig de siste årene. Dette skyldes først og fremst at behovet for beregningskraft øker raskere enn hastigheten til en enkelt datamaskin, slik at vi må bruke flere datamaskiner for å møte etterspørselen, og at det blir stadig mer vanlig at systemer er spredt over et stort geografisk område. Dette paradigmeskiftet medfører mange tekniske utfordringer. En av disse er knyttet til "straggler"-problemet, som er forårsaket av forsinkelsesvariasjoner i distribuerte systemer, der en beregning forsinkes av noen få langsomme noder slik at andre noder må vente før de kan fortsette. Straggler-problemet kan svekke effektiviteten til distribuerte systemer betydelig i situasjoner der en enkelt node som opplever en midlertidig overbelastning kan låse et helt system. I denne avhandlingen studerer vi metoder for å gjøre beregninger av forskjellige typer motstandsdyktige mot slike problemer, og dermed gjøre det mulig for et distribuert system å fortsette til tross for at noen noder ikke svarer i tide. Metodene vi foreslår er skreddersydde for spesielle typer beregninger. Vi foreslår metoder tilpasset distribuert matrise-vektor-multiplikasjon (som er en grunnleggende operasjon i mange typer beregninger), distribuert maskinlæring og distribuert sporing av en tilfeldig prosess (for eksempel det å spore plasseringen til kjøretøy for å unngå kollisjon). De foreslåtte metodene utnytter redundans som enten blir introdusert som en del av metoden, eller som naturlig eksisterer i det underliggende problemet, til å kompensere for manglende delberegninger. For en av de foreslåtte metodene utnytter vi redundans for også å øke effektiviteten til kommunikasjonen mellom noder, og dermed redusere mengden data som må kommuniseres over nettverket. I likhet med straggler-problemet kan slik kommunikasjon begrense effektiviteten i distribuerte systemer betydelig. De foreslåtte metodene gir signifikante forbedringer i ventetid og pålitelighet sammenlignet med tidligere metoder.The number and scale of distributed computing systems being built have increased significantly in recent years. Primarily, that is because: i) our computing needs are increasing at a much higher rate than computers are becoming faster, so we need to use more of them to meet demand, and ii) systems that are fundamentally distributed, e.g., because the components that make them up are geographically distributed, are becoming increasingly prevalent. This paradigm shift is the source of many engineering challenges. Among them is the straggler problem, which is a problem caused by latency variations in distributed systems, where faster nodes are held up by slower ones. The straggler problem can significantly impair the effectiveness of distributed systems—a single node experiencing a transient outage (e.g., due to being overloaded) can lock up an entire system. In this thesis, we consider schemes for making a range of computations resilient against such stragglers, thus allowing a distributed system to proceed in spite of some nodes failing to respond on time. The schemes we propose are tailored for particular computations. We propose schemes designed for distributed matrix-vector multiplication, which is a fundamental operation in many computing applications, distributed machine learning—in the form of a straggler-resilient first-order optimization method—and distributed tracking of a time-varying process (e.g., tracking the location of a set of vehicles for a collision avoidance system). The proposed schemes rely on exploiting redundancy that is either introduced as part of the scheme, or exists naturally in the underlying problem, to compensate for missing results, i.e., they are a form of forward error correction for computations. Further, for one of the proposed schemes we exploit redundancy to also improve the effectiveness of multicasting, thus reducing the amount of data that needs to be communicated over the network. Such inter-node communication, like the straggler problem, can significantly limit the effectiveness of distributed systems. For the schemes we propose, we are able to show significant improvements in latency and reliability compared to previous schemes.Doktorgradsavhandlin

    Scalability aspects of data cleaning

    Get PDF
    Data cleaning has become one of the important pre-processing steps for many data science, data analytics, and machine learning applications. According to a survey by Gartner, more than 25% of the critical data in the world's top companies is flawed, which can result in economic losses amounting to trillions of dollars a year. Over the past few decades, several algorithms and tools have been developed to clean data. However, many of these solutions find it difficult to scale, as the amount of data has increased over time. For example, these solutions often involve a quadratic amount of tuple-pair comparisons or generation of all possible column combinations. Both these tasks can take days to finish if the dataset has millions of tuples or a few hundreds of columns, which is usually the case for real-world applications. The data cleaning tasks often have a trade-off between the scalability and the quality of the solution. One can achieve scalability by performing fewer computations, but at the cost of a lower quality solution. Therefore, existing approaches exploit this trade-off when they need to scale to larger datasets, settling for a lower quality solution. Some approaches have considered re-thinking solutions from scratch to achieve scalability and high quality. However, re-designing these solutions from scratch is a daunting task as it would involve systematically analyzing the space of possible optimizations and then tuning the physical implementations for a specific computing framework, data size, and resources. Another component in these solutions that becomes critical with the increasing data size is how this data is stored and fetched. As for smaller datasets, most of it can fit in-memory, so accessing it from a data store is not a bottleneck. However, for large datasets, these solutions need to constantly fetch and write the data to a data store. As observed in this dissertation, data cleaning tasks have a lifecycle-driven data access pattern, which are not suitable for traditional data stores, making these data stores a bottleneck when cleaning large datasets. In this dissertation, we consider scalability as a first-class citizen for data cleaning tasks and propose that the scalable and high-quality solutions can be achieved by adopting the following three principles: 1) by having a new primitive-base re-writing of the existing algorithms that allows for efficient implementations for multiple computing frameworks, 2) by efficiently involving domain expert’s knowledge to reduce computation and improve quality, and 3) by using an adaptive data store that can transform the data layout based on the access pattern. We make contributions towards each of these principles. First, we present a set of primitive operations for discovering constraints from the data. These primitives facilitate re-writing efficient distributed implementations of the existing discovery algorithms. Next, we present a framework involving domain experts, for faster clustering selection for data de-duplication. This framework asks a bounded number of queries to a domain-expert and uses their response to select the best clustering with a high accuracy. Finally, we present an adaptive data store that can change the layout of the data based on the workload's access pattern, hence speeding-up the data cleaning tasks

    Estudio de problemas de clasificación supervisada y de localización en redes mediante optimización matemática

    Get PDF
    This PhD dissertation addresses several problems in the fields of Supervised Classification and Location Theory using tools and techniques coming from Mathematical Optimization. A brief description of these problems and the methodologies proposed for their analysis and resolution is given below. In the first chapter, the principles of Supervised Classification and Location Theory are discussed in detail, emphasizing the topics studied in this thesis. The following two chapters discuss Supervised Classification problems. In particular, Chapter 2 proposes exact solution approaches for various models of Support Vector Machines (SVM) with ramp loss, a well-known classification method that limits the influence of outliers. The resulting models are analyzed to obtain initial bounds of the big M parameters included in the formulation. Then, solution approaches based on three strategies for obtaining tighter values of the big M parameters are proposed. Two of them require solving a sequence of continuous optimization problems, while the third uses the Lagrangian relaxation. The derived resolution methods are valid for the l1-norm and l2-norm ramp loss formulations. They are tested and compared with existing solution methods in simulated and real-life datasets, showing the efficiency of the developed methodology. Chapter 3 presents a new SVM-based classifier that simultaneously deals with the limitation of the influence of outliers and feature selection. The influence of outliers is taken under control using the ramp loss margin error criterion, while the feature selection process is carried out including a new family of binary variables and several constraints. The resulting model is formulated as a mixed-integer program with big M parameters. The characteristics of the model are analyzed and two different solution approaches (exact and heuristic) are proposed. The performance of the obtained classifier is compared with several classical ones in different datasets. The next two chapters deal with location problems, in particular, two variants of the Maximal Covering Location Problem (MCLP) in networks. These variants respond to the modeling of two different scenarios, with and without uncertainty in the input data. First, Chapter 4 presents the upgrading version of MCLP with edge length modifications on networks. This problem aims at locating p facilities on the nodes (of the network) so as to maximize coverage, considering that the length of the edges can be reduced within a budget. Hence, we have to decide on: the optimal location of p facilities and the optimal edge length reductions. To solve it, we propose three different mixed-integer formulations and a preprocessing phase for fixing variables and removing some constraints. Moreover, we analyze the characteristics of these formulations to strengthen them by proposing valid inequalities. Finally, we compare the three formulations and their corresponding improvements by testing their performance over different datasets. The following chapter, Chapter 5, also considers a MCLP, albeit from the perspective of uncertainty. In particular, this chapter addresses a version of the single-facility MCLP on a network where the demand is distributed along the edges and uncertain with only a known interval estimation. We propose a minmax regret model where the service facility can be located anywhere along the network. Furthermore, we present two polynomial algorithms for finding the location that minimizes the maximal regret assuming that the demand realization is an unknown constant or linear function on each edge. We also include two illustrative examples as well as a computational study to show the potential of the proposed methodology

    Basic Probability Theory

    Get PDF
    Long title: Basic Probability Theory: Independent Random Variables and Sample Spaces. Chapters: Elementary Probability - Basic Probability - Canonical Sample Spaces - Working on Probability Spaces - A Solutions to Exercises

    Approximate Data Mining Techniques on Clinical Data

    Get PDF
    The past two decades have witnessed an explosion in the number of medical and healthcare datasets available to researchers and healthcare professionals. Data collection efforts are highly required, and this prompts the development of appropriate data mining techniques and tools that can automatically extract relevant information from data. Consequently, they provide insights into various clinical behaviors or processes captured by the data. Since these tools should support decision-making activities of medical experts, all the extracted information must be represented in a human-friendly way, that is, in a concise and easy-to-understand form. To this purpose, here we propose a new framework that collects different new mining techniques and tools proposed. These techniques mainly focus on two aspects: the temporal one and the predictive one. All of these techniques were then applied to clinical data and, in particular, ICU data from MIMIC III database. It showed the flexibility of the framework, which is able to retrieve different outcomes from the overall dataset. The first two techniques rely on the concept of Approximate Temporal Functional Dependencies (ATFDs). ATFDs have been proposed, with their suitable treatment of temporal information, as a methodological tool for mining clinical data. An example of the knowledge derivable through dependencies may be "within 15 days, patients with the same diagnosis and the same therapy usually receive the same daily amount of drug". However, current ATFD models are not analyzing the temporal evolution of the data, such as "For most patients with the same diagnosis, the same drug is prescribed after the same symptom". To this extent, we propose a new kind of ATFD called Approximate Pure Temporally Evolving Functional Dependencies (APEFDs). Another limitation of such kind of dependencies is that they cannot deal with quantitative data when some tolerance can be allowed for numerical values. In particular, this limitation arises in clinical data warehouses, where analysis and mining have to consider one or more measures related to quantitative data (such as lab test results and vital signs), concerning multiple dimensional (alphanumeric) attributes (such as patient, hospital, physician, diagnosis) and some time dimensions (such as the day since hospitalization and the calendar date). According to this scenario, we introduce a new kind of ATFD, named Multi-Approximate Temporal Functional Dependency (MATFD), which considers dependencies between dimensions and quantitative measures from temporal clinical data. These new dependencies may provide new knowledge as "within 15 days, patients with the same diagnosis and the same therapy receive a daily amount of drug within a fixed range". The other techniques are based on pattern mining, which has also been proposed as a methodological tool for mining clinical data. However, many methods proposed so far focus on mining of temporal rules which describe relationships between data sequences or instantaneous events, without considering the presence of more complex temporal patterns into the dataset. These patterns, such as trends of a particular vital sign, are often very relevant for clinicians. Moreover, it is really interesting to discover if some sort of event, such as a drug administration, is capable of changing these trends and how. To this extent, we propose a new kind of temporal patterns, called Trend-Event Patterns (TEPs), that focuses on events and their influence on trends that can be retrieved from some measures, such as vital signs. With TEPs we can express concepts such as "The administration of paracetamol on a patient with an increasing temperature leads to a decreasing trend in temperature after such administration occurs". We also decided to analyze another interesting pattern mining technique that includes prediction. This technique discovers a compact set of patterns that aim to describe the condition (or class) of interest. Our framework relies on a classification model that considers and combines various predictive pattern candidates and selects only those that are important to improve the overall class prediction performance. We show that our classification approach achieves a significant reduction in the number of extracted patterns, compared to the state-of-the-art methods based on minimum predictive pattern mining approach, while preserving the overall classification accuracy of the model. For each technique described above, we developed a tool to retrieve its kind of rule. All the results are obtained by pre-processing and mining clinical data and, as mentioned before, in particular ICU data from MIMIC III database
    corecore