137 research outputs found
Some Contribution of Statistical Techniques in Big Data: A Review
Big Data is a popular topic in research work. Everyone is talking about big data, and it is believed that science, business, industry, government, society etc. will undergo a through change with the impact of big data.Big data is used to refer to very huge data set having large, more complex, hidden pattern, structured and unstructured nature of data with the difficulties to collect, storage, analysing for process or result. So proper advanced techniques to use to gain knowledge about big data. In big data research big challenge is created in storage, process, search, sharing, transfer, analysis and visualizing. To deeply discuss on introduction of big data, issue, management and all used big data techniques. Also in this paper present a review of various advanced statistical techniques to handling the key application of big data have large data set. These advanced techniques handle the structure as well as unstructured big data in different area
Machine Learning for Synthetic Data Generation: A Review
Data plays a crucial role in machine learning. However, in real-world
applications, there are several problems with data, e.g., data are of low
quality; a limited number of data points lead to under-fitting of the machine
learning model; it is hard to access the data due to privacy, safety and
regulatory concerns. Synthetic data generation offers a promising new avenue,
as it can be shared and used in ways that real-world data cannot. This paper
systematically reviews the existing works that leverage machine learning models
for synthetic data generation. Specifically, we discuss the synthetic data
generation works from several perspectives: (i) applications, including
computer vision, speech, natural language, healthcare, and business; (ii)
machine learning methods, particularly neural network architectures and deep
generative models; (iii) privacy and fairness issue. In addition, we identify
the challenges and opportunities in this emerging field and suggest future
research directions
Recommended from our members
Data Summarizations for Scalable, Robust and Privacy-Aware Learning in High Dimensions
The advent of large-scale datasets has offered unprecedented amounts of information for building statistically powerful machines, but, at the same time, also introduced a remarkable computational challenge: how can we efficiently process massive data? This thesis presents a suite of data reduction methods that make learning algorithms scale on large datasets, via extracting a succinct model-specific representation that summarizes the
full data collection—a coreset. Our frameworks support by design datasets of arbitrary dimensionality, and can be used for general purpose Bayesian inference under real-world constraints, including privacy preservation and robustness to outliers, encompassing diverse uncertainty-aware data analysis tasks, such as density estimation, classification
and regression.
We motivate the necessity for novel data reduction techniques in the first place by developing a reidentification attack on coarsened representations of private behavioural data. Analysing longitudinal records of human mobility, we detect privacy-revealing structural patterns, that remain preserved in reduced graph representations of individuals’ information with manageable size. These unique patterns enable mounting linkage attacks via structural similarity computations on longitudinal mobility traces, revealing an overlooked, yet existing, privacy threat.
We then propose a scalable variational inference scheme for approximating posteriors on large datasets via learnable weighted pseudodata, termed pseudocoresets. We show that the use of pseudodata enables overcoming the constraints on minimum summary size for given approximation quality, that are imposed on all existing Bayesian coreset constructions due to data dimensionality. Moreover, it allows us to develop a scheme for pseudocoresets-based summarization that satisfies the standard framework of differential privacy by construction; in this way, we can release reduced size privacy-preserving representations for sensitive datasets that are amenable to arbitrary post-processing.
Subsequently, we consider summarizations for large-scale Bayesian inference in scenarios when observed datapoints depart from the statistical assumptions of our model. Using robust divergences, we develop a method for constructing coresets resilient to model misspecification. Crucially, this method is able to automatically discard outliers from the generated data summaries. Thus we deliver robustified scalable representations
for inference, that are suitable for applications involving contaminated and unreliable data sources.
We demonstrate the performance of proposed summarization techniques on multiple parametric statistical models, and diverse simulated and real-world datasets, from music genre features to hospital readmission records, considering a wide range of data dimensionalities.Nokia Bell Labs,
Lundgren Fund,
Darwin College, University of Cambridge
Department of Computer Science & Technology, University of Cambridg
Recommended from our members
Towards Transparent and Trustworthy Cloud
Despite its immense benefits in terms of flexibility, resource consumption, and simplified management, cloud computing raises several concerns due to lack of trust and transparency. Like all computing paradigms based on outsourcing, the use of cloud computing is largely a matter of trust. There is an increasing pressure by cloud customers for solutions that would increase their confidence that a cloud service/application is behaving in a secure and correct manner. Cloud assurance techniques, developed to assess the trustworthiness of cloud services, can play a major role in building trust. In this paper, we start from the assumption that an opaque cloud does not fit security, and present a reliable evidence collection process and infrastructure extending existing assurance techniques towards the definition of a trustworthy cloud. The proposed process and infrastructure are applied to a case study on cloud certification showing their utility
Spatial generalization and aggregation of massive movement data.
Movement data (trajectories of moving agents) are hard to visualize: numerous intersections and overlapping between trajectories make the display heavily cluttered and illegible. It is necessary to use appropriate data abstraction methods. We suggest a method for spatial generalization and aggregation of movement data, which transforms trajectories into aggregate flows between areas. It is assumed that no predefined areas are given. We have devised a special method for partitioning the underlying territory into appropriate areas. The method is based on extracting significant points from the trajectories. The resulting abstraction conveys essential characteristics of the movement. The degree of abstraction can be controlled through the parameters of the method. We introduce local and global numeric measures of the quality of the generalization, and suggest an approach to improve the quality in selected parts of the territory where this is deemed necessary. The suggested method can be used in interactive visual exploration of movement data and for creating legible flow maps for presentation purposes
Structured and unstructured data integration with electronic medical records
In recent years there has been a great population and technological evolution all over the world. At the same time, more areas beyond technology and information technology have also developed, namely medicine, which has led to an increase in average life expectancy which in turn, leads to a greater need for healthcare.
In order to provide the best possible treatments and healthcare services, nowadays the hospitals store large amounts of data regarding patients and diseases (in the form of electronic medical records) or the logistics of some departments in their storage systems. Therefore, computer science techniques such as data mining and natural language processing have been used to extract knowledge and value from these information-rich sources in order not only to develop, for example, new models for disease prediction, as well as improving existing processes in healthcare centres and hospitals. This data storage can be done in one of three ways: structured, unstructured or semi-structured.
In this paper, the author tested the integration of structured and unstructured data from two different departments of the same Portuguese hospital, in order to extract knowledge and improve hospital processes. Aiming to reduce the value loss of loading data that is not used in the healthcare providers systems.Nos últimos anos tem-se assistido a uma grande evolução populacional e tecnológica por todo o mundo. Paralelamente, mais áreas para além da tecnologia e informática têm-se também desenvolvido, nomeadamente a área da medicina, o que tem permitido um aumento na esperança média de vida que por sua vez leva a uma maior necessidade de cuidados de saúde.
Com o intuito de fornecer os melhores serviços de saúde possíveis, nos dias que hoje os hospitais guardam nos seus sistemas informáticos grandes quantidades de dados relativamente aos pacientes e doenças (sobre a forma de registos médicos eletrónicos) ou relativos à logística de alguns departamentos dos hospitais, etc. Por conseguinte, a estes dados têm vindo a ser utilizadas técnicas da área das ciências da computação como o data mining e o processamento da língua natural para extrair conhecimento e valor dessas fontes ricas em informação com o intuito não só de desenvolver, por exemplo, novos modelos de predição de doenças, como também de melhorar processos já existentes em centros de saúde e hospitais. Este armazenamento de dados pode ser feito em uma de três formas: de forma estruturada, não estruturada ou semi-estruturada.
Neste trabalho o autor testou a integração de dados estruturados e não estruturados de dois departamentos diferentes do mesmo hospital português, com o intuito de extrair conhecimento e melhorar os processos do hospital. Com o intuito de reduzir a perda do armazenamento de dados que não são utilizados
Privacy by Design in Data Mining
Privacy is ever-growing concern in our society: the lack of reliable privacy safeguards in many current services and devices is the basis of a diffusion that is often more limited than expected. Moreover, people feel reluctant to provide true personal data, unless it is absolutely necessary. Thus, privacy is becoming a fundamental aspect to take into account when one wants to use, publish and analyze data involving sensitive information. Many recent research works have focused on the study of privacy protection: some of these studies aim at individual privacy, i.e., the protection of sensitive individual data, while others aim at corporate privacy, i.e., the protection of strategic information at organization level. Unfortunately, it is in- creasingly hard to transform the data in a way that it protects sensitive information: we live in the era of big data characterized by unprecedented opportunities to sense, store and analyze complex data which describes human activities in great detail and resolution. As a result anonymization simply cannot be accomplished by de-identification. In the last few years, several techniques for creating anonymous or obfuscated versions of data sets have been proposed, which essentially aim to find an acceptable trade-off between data privacy on the one hand and data utility on the other. So far, the common result obtained is that no general method exists which is capable of both dealing with “generic personal data” and preserving “generic analytical results”.
In this thesis we propose the design of technological frameworks to counter the threats of undesirable, unlawful effects of privacy violation, without obstructing the knowledge discovery opportunities of data mining technologies. Our main idea is to inscribe privacy protection into the knowledge discovery technol- ogy by design, so that the analysis incorporates the relevant privacy requirements from the start. Therefore, we propose the privacy-by-design paradigm that sheds a new light on the study of privacy protection: once specific assumptions are made about the sensitive data and the target mining queries that are to be answered with the data, it is conceivable to design a framework to: a) transform the source data into an anonymous version with a quantifiable privacy guarantee, and b) guarantee that the target mining queries can be answered correctly using the transformed data instead of the original ones.
This thesis investigates on two new research issues which arise in modern Data Mining and Data Privacy: individual privacy protection in data publishing while preserving specific data mining analysis, and corporate privacy protection in data mining outsourcing
A Data-driven Methodology Towards Mobility- and Traffic-related Big Spatiotemporal Data Frameworks
Human population is increasing at unprecedented rates, particularly in urban areas. This increase, along with the rise of a more economically empowered middle class, brings new and complex challenges to the mobility of people within urban areas. To tackle such challenges, transportation and mobility authorities and operators are trying to adopt innovative Big Data-driven Mobility- and Traffic-related solutions. Such solutions will help decision-making processes that aim to ease the load on an already overloaded transport infrastructure. The information collected from day-to-day mobility and traffic can help to mitigate some of such mobility challenges in urban areas.
Road infrastructure and traffic management operators (RITMOs) face several limitations to effectively extract value from the exponentially growing volumes of mobility- and traffic-related Big Spatiotemporal Data (MobiTrafficBD) that are being acquired and gathered. Research about the topics of Big Data, Spatiotemporal Data and specially MobiTrafficBD is scattered, and existing literature does not offer a concrete, common methodological approach to setup, configure, deploy and use a complete Big Data-based framework to manage the lifecycle of mobility-related spatiotemporal data, mainly focused on geo-referenced time series (GRTS) and spatiotemporal events (ST Events), extract value from it and support decision-making
processes of RITMOs.
This doctoral thesis proposes a data-driven, prescriptive methodological approach towards the design, development and deployment of MobiTrafficBD Frameworks focused on GRTS and ST Events. Besides a thorough literature review on Spatiotemporal Data, Big Data and the merging of these two fields through MobiTraffiBD, the methodological approach comprises a set of general characteristics, technical requirements, logical components, data flows and technological infrastructure models, as well as guidelines and best practices that aim to guide researchers, practitioners and stakeholders, such as RITMOs, throughout the design, development and deployment phases of any MobiTrafficBD Framework.
This work is intended to be a supporting methodological guide, based on widely used
Reference Architectures and guidelines for Big Data, but enriched with inherent characteristics
and concerns brought about by Big Spatiotemporal Data, such as in the case of GRTS and ST
Events. The proposed methodology was evaluated and demonstrated in various real-world
use cases that deployed MobiTrafficBD-based Data Management, Processing, Analytics and
Visualisation methods, tools and technologies, under the umbrella of several research projects
funded by the European Commission and the Portuguese Government.A população humana cresce a um ritmo sem precedentes, particularmente nas áreas urbanas.
Este aumento, aliado ao robustecimento de uma classe média com maior poder económico,
introduzem novos e complexos desafios na mobilidade de pessoas em áreas urbanas. Para
abordar estes desafios, autoridades e operadores de transportes e mobilidade estão a adotar
soluções inovadoras no domínio dos sistemas de Dados em Larga Escala nos domínios da
Mobilidade e Tráfego. Estas soluções irão apoiar os processos de decisão com o intuito de libertar uma infraestrutura de estradas e transportes já sobrecarregada. A informação colecionada da mobilidade diária e da utilização da infraestrutura de estradas pode ajudar na mitigação de alguns dos desafios da mobilidade urbana.
Os operadores de gestão de trânsito e de infraestruturas de estradas (em inglês, road infrastructure and traffic management operators — RITMOs) estão limitados no que toca a extrair valor de um sempre crescente volume de Dados Espaciotemporais em Larga Escala no domínio da Mobilidade e Tráfego (em inglês, Mobility- and Traffic-related Big Spatiotemporal Data —MobiTrafficBD) que estão a ser colecionados e recolhidos. Os trabalhos de investigação sobre os tópicos de Big Data, Dados Espaciotemporais e, especialmente, de MobiTrafficBD, estão dispersos, e a literatura existente não oferece uma metodologia comum e concreta para preparar, configurar, implementar e usar uma plataforma (framework) baseada em tecnologias Big Data para gerir o ciclo de vida de dados espaciotemporais em larga escala, com ênfase nas série temporais georreferenciadas (em inglês, geo-referenced time series — GRTS) e eventos espacio-
temporais (em inglês, spatiotemporal events — ST Events), extrair valor destes dados e apoiar os
RITMOs nos seus processos de decisão.
Esta dissertação doutoral propõe uma metodologia prescritiva orientada a dados, para o design, desenvolvimento e implementação de plataformas de MobiTrafficBD, focadas em GRTS e ST Events. Além de uma revisão de literatura completa nas áreas de Dados Espaciotemporais, Big Data e na junção destas áreas através do conceito de MobiTrafficBD, a metodologia proposta contem um conjunto de características gerais, requisitos técnicos, componentes lógicos, fluxos de dados e modelos de infraestrutura tecnológica, bem como diretrizes e boas
práticas para investigadores, profissionais e outras partes interessadas, como RITMOs, com o
objetivo de guiá-los pelas fases de design, desenvolvimento e implementação de qualquer pla-
taforma MobiTrafficBD.
Este trabalho deve ser visto como um guia metodológico de suporte, baseado em Arqui-
teturas de Referência e diretrizes amplamente utilizadas, mas enriquecido com as característi-
cas e assuntos implícitos relacionados com Dados Espaciotemporais em Larga Escala, como
no caso de GRTS e ST Events. A metodologia proposta foi avaliada e demonstrada em vários
cenários reais no âmbito de projetos de investigação financiados pela Comissão Europeia e
pelo Governo português, nos quais foram implementados métodos, ferramentas e tecnologias
nas áreas de Gestão de Dados, Processamento de Dados e Ciência e Visualização de Dados em
plataformas MobiTrafficB
- …