13 research outputs found

    Synthetic Order Data Generator for Picking Data

    Get PDF
    Sample data are in high demand for testing and benchmarking purposes. Like many other fields, warehousing and specifically order picking process are not exempt from the need for sample data. Sample data are used in order picking pro- cesses as a way of testing new methodologies such as new routing and new storage allocation approaches. Unfortunately, access to real order picking data is limited because of confidentiality and privacy issues which make it difficult to obtain practical results from the new methodologies. On the other hand, order data follows a highly complex and correlated structure that cannot be easily extracted and replicated. We propose a two-part synthetic data generator that extracts and mimics the general fabric of a set of real data and produces a conceptually unlimited number of orders with any number of SKUs while keeping the structure largely intact. Such data can fill the gap of missing data in order picking process benchmarking

    A supervised generative optimization approach for tabular data

    Full text link
    Synthetic data generation has emerged as a crucial topic for financial institutions, driven by multiple factors, such as privacy protection and data augmentation. Many algorithms have been proposed for synthetic data generation but reaching the consensus on which method we should use for the specific data sets and use cases remains challenging. Moreover, the majority of existing approaches are ``unsupervised'' in the sense that they do not take into account the downstream task. To address these issues, this work presents a novel synthetic data generation framework. The framework integrates a supervised component tailored to the specific downstream task and employs a meta-learning approach to learn the optimal mixture distribution of existing synthetic distributions

    Evaluating Synthetically Generated Data from Small Sample Sizes: An Experimental Study

    Full text link
    In this paper, we propose a method for measuring the similarity low sample tabular data with synthetically generated data with a larger number of samples than original. This process is also known as data augmentation. But significance levels obtained from non-parametric tests are suspect when sample size is small. Our method uses a combination of geometry, topology and robust statistics for hypothesis testing in order to compare the validity of generated data. We also compare the results with common global metric methods available in the literature for large sample size data

    Random generation of realistic spatial data for use in classroom assessments

    Get PDF
    We describe the design, development, implementation and usage of a software-based tool called RADIAN (RAnDom spatIal dAta geNerator) to generate realistic spatial datasets for use within teaching and learning environments who require spatial datasets. RADIAN provides configurable functionality for users to generate spatial datasets containing geometric points with associated parameters/attributes which can then be easily imported into a PostgreSQL PostGIS database or visualised using a GIS or equivalent software. Much work has been carried out in areas such as statistics, machine learning and software testing on how to generate realistic datasets for testing hypotheses, machine learning model training and validation, algorithmic analysis, and rigorous software testing on real-world data. However, less work has been published on the generation of realistic spatial data. This thesis contributes to this body of work. We outline the theoretical approach we have used in RADIAN to generate random geometric points within a given spatial extent (polygon). We demonstrate the effectiveness of RADIAN with a suite of example scenarios of the spatial data generated. We believe RADIAN will be very useful to both teachers and spatial analysts requiring realistic randomly generated for spatial analysis purposes. The software code is available as open-source software via GitHub. The thesis concludes with some suggestions for further research and development work which are possible from the research and development of RADIAN

    Generating public transport data based on population distributions for RDF benchmarking

    Get PDF
    When benchmarking RDF data management systems such as public transport route planners, system evaluation needs to happen under various realistic circumstances, which requires a wide range of datasets with different properties. Real-world datasets are almost ideal, as they offer these realistic circumstances, but they are often hard to obtain and inflexible for testing. For these reasons, synthetic dataset generators are typically preferred over real-world datasets due to their intrinsic flexibility. Unfortunately, many synthetic dataset that are generated within benchmarks are insufficiently realistic, raising questions about the generalizability of benchmark results to real-world scenarios. In order to benchmark geospatial and temporal RDF data management systems such as route planners with sufficient external validity and depth, we designed PODiGG, a highly configurable generation algorithm for synthetic public transport datasets with realistic geospatial and temporal characteristics comparable to those of their real-world variants. The algorithm is inspired by real-world public transit network design and scheduling methodologies. This article discusses the design and implementation of PODiGG and validates the properties of its generated datasets. Our findings show that the generator achieves a sufficient level of realism, based on the existing coherence metric and new metrics we introduce specifically for the public transport domain. Thereby, PODiGG provides a flexible foundation for benchmarking RDF data management systems with geospatial and temporal data

    Synthetic Data Generation Using Wasserstein Conditional Gans With Gradient Penalty (WCGANS-GP)

    Get PDF
    With data protection requirements becoming stricter, the data privacy has become increasingly important and more crucial than ever. This has led to restrictions on the availability and dissemination of real-world datasets. Synthetic data offers a viable solution to overcome barriers of data access and sharing. Existing data generation methods require a great deal of user-defined rules, manual interactions and domainspecific knowledge. Moreover, they are not able to balance the trade-off between datausability and privacy. Deep learning based methods like GANs have seen remarkable success in synthesizing images by automatically learning the complicated distributions and patterns of real data. But they often suffer from instability during the training process

    Storing and querying evolving knowledge graphs on the web

    Get PDF

    Extraction de relations spatio-temporelles à partir des données environnementales et de la santé

    Get PDF
    Thanks to the new technologies (smartphones, sensors, etc.), large amounts of spatiotemporal data are now available. The associated database can be called spatiotemporal databases because each row is described by a spatial information (e.g. a city, a neighborhood, a river, etc.) and temporal information (e.g. the date of an event). This huge data is often complex and heterogeneous and generates new needs in knowledge extraction methods to deal with these constraints (e.g. follow phenomena in time and space).Many phenomena with complex dynamics are thus associated with spatiotemporal data. For instance, the dynamics of an infectious disease can be described as the interactions between humans and the transmission vector as well as some spatiotemporal mechanisms involved in its development. The modification of one of these components can trigger changes in the interactions between the components and finally develop the overall system behavior.To deal with these new challenges, new processes and methods must be developed to manage all available data. In this context, the spatiotemporal data mining is define as a set of techniques and methods used to obtain useful information from large volumes of spatiotemporal data. This thesis follows the general framework of spatiotemporal data mining and sequential pattern mining. More specifically, two generic methods of pattern mining are proposed. The first one allows us to extract sequential patterns including spatial characteristics of data. In the second one, we propose a new type of patterns called spatio-sequential patterns. This kind of patterns is used to study the evolution of a set of events describing an area and its near environment.Both approaches were tested on real datasets associated to two spatiotemporal phenomena: the pollution of rivers in France and the epidemiological monitoring of dengue in New Caledonia. In addition, two measures of quality and a patterns visualization prototype are also available to assist the experts in the selection of interesting patters.Face à l'explosion des nouvelles technologies (mobiles, capteurs, etc.), de grandes quantités de données localisées dans l'espace et dans le temps sont désormais disponibles. Les bases de données associées peuvent être qualifiées de bases de données spatio-temporelles car chaque donnée est décrite par une information spatiale (e.g. une ville, un quartier, une rivière, etc.) et temporelle (p. ex. la date d'un événement). Cette masse de données souvent hétérogènes et complexes génère ainsi de nouveaux besoins auxquels les méthodes d'extraction de connaissances doivent pouvoir répondre (e.g. suivre des phénomènes dans le temps et l'espace). De nombreux phénomènes avec des dynamiques complexes sont ainsi associés à des données spatio-temporelles. Par exemple, la dynamique d'une maladie infectieuse peut être décrite par les interactions entre les humains et le vecteur de transmission associé ainsi que par certains mécanismes spatio-temporels qui participent à son évolution. La modification de l'un des composants de ce système peut déclencher des variations dans les interactions entre les composants et finalement, faire évoluer le comportement global du système. Pour faire face à ces nouveaux enjeux, de nouveaux processus et méthodes doivent être développés afin d'exploiter au mieux l'ensemble des données disponibles. Tel est l'objectif de la fouille de données spatio-temporelles qui correspond à l'ensemble de techniques et méthodes qui permettent d'obtenir des connaissances utiles à partir de gros volumes de données spatio-temporelles. Cette thèse s'inscrit dans le cadre général de la fouille de données spatio-temporelles et l'extraction de motifs séquentiels. Plus précisément, deux méthodes génériques d'extraction de motifs sont proposées. La première permet d'extraire des motifs séquentiels incluant des caractéristiques spatiales. Dans la deuxième, nous proposons un nouveau type de motifs appelé "motifs spatio-séquentiels". Ce type de motifs permet d'étudier l'évolution d'un ensemble d'événements décrivant une zone et son entourage proche. Ces deux approches ont été testées sur deux jeux de données associées à des phénomènes spatio-temporels : la pollution des rivières en France et le suivi épidémiologique de la dengue en Nouvelle Calédonie. Par ailleurs, deux mesures de qualité ainsi qu'un prototype de visualisation de motifs sont été également proposés pour accompagner les experts dans la sélection des motifs d'intérêts
    corecore