6 research outputs found

    The ATEN Framework for Creating the Realistic Synthetic Electronic Health Record

    Get PDF
    Realistic synthetic data are increasingly being recognized as solutions to lack of data or privacy concerns in healthcare and other domains, yet little effort has been expended in establishing a generic framework for characterizing, achieving and validating realism in Synthetic Data Generation (SDG). The objectives of this paper are to: (1) present a characterization of the concept of realism as it applies to synthetic data; and (2) present and demonstrate application of the generic ATEN Framework for achieving and validating realism for SDG. The characterization of realism is developed through insights obtained from analysis of the literature on SDG. The development of the generic methods for achieving and validating realism for synthetic data was achieved by using knowledge discovery in databases (KDD), data mining enhanced with concept analysis and identification of characteristic, and classification rules. Application of this framework is demonstrated by using the synthetic Electronic Healthcare Record (EHR) for the domain of midwifery. The knowledge discovery process improves and expedites the generation process; having a more complex and complete understanding of the knowledge required to create the synthetic data significantly reduce the number of generation iterations. The validation process shows similar efficiencies through using the knowledge discovered as the elements for assessing the generated synthetic data. Successful validation supports claims of success and resolves whether the synthetic data is a sufficient replacement for real data. The ATEN Framework supports the researcher in identifying the knowledge elements that need to be synthesized, as well as supporting claims of sufficient realism through the use of that knowledge in a structured approach to validation. When used for SDG, the ATEN Framework enables a complete analysis of source data for knowledge necessary for correct generation. The ATEN Framework ensures the researcher that the synthetic data being created is realistic enough for the replacement of real data for a given use-case

    A rainfall disaggregation scheme for sub-hourly time scales: coupling a Bartlett-Lewis based model with adjusting procedures

    Get PDF
    Many hydrological applications, such as flood studies, require the use of long rainfall data at fine time scales varying from daily down to 1 min time step. However, in the real world there is limited availability of data at sub-hourly scales. To cope with this issue, stochastic disaggregation techniques are typically employed to produce possible, statistically consistent, rainfall events that aggregate up to the field data collected at coarser scales. A methodology for the stochastic disaggregation of rainfall at fine time scales was recently introduced, combining the Bartlett-Lewis process to generate rainfall events along with adjusting procedures to modify the lower-level variables (i.e., hourly) so as to be consistent with the higher-level one (i.e., daily). In the present paper, we extend the aforementioned scheme, initially designed and tested for the disaggregation of daily rainfall into hourly depths, for any sub-hourly time scale. In addition, we take advantage of the recent developments in Poisson-cluster processes incorporating in the methodology a Bartlett-Lewis model variant that introduces dependence between cell intensity and duration in order to capture the variability of rainfall at sub-hourly time scales. The disaggregation scheme is implemented in an R package, named HyetosMinute, to support disaggregation from daily down to 1-min time scale. The applicability of the methodology was assessed on a 5-min rainfall records collected in Bochum, Germany, comparing the performance of the above mentioned model variant against the original Bartlett-Lewis process (non-random with 5 parameters). The analysis shows that the disaggregation process reproduces adequately the most important statistical characteristics of rainfall at wide range of time scales, while the introduction of the model with dependent intensity-duration results in a better performance in terms of skewness, rainfall extremes and dry proportions

    Annual and seasonal discharge prediction in the middle Danube River basin based on a modified TIPS (Tendency, Intermittency, Periodicity, Stochasticity) methodology

    Get PDF
    The short-term predictions of annual and seasonal discharge derived by a modified TIPS (Tendency, Intermittency, Periodicity and Stochasticity) methodology are presented in this paper. The TIPS method (Yevjevich, 1984) is modified in such a way that annual time scale is used instead of daily. The reason of extracting a seasonal component from discharge time series represents an attempt to identify the long-term stochastic behaviour. The methodology is applied for modelling annual discharges at six gauging stations in the middle Danube River basin using the observed data in the common period from 1931 to 2012. The model performance measures suggest that the modelled time series are matched reasonably well. The model is then used for the short-time predictions for three annual step ahead (2013-2015). The annual discharge predictions of larger river basins for moderate hydrological conditions show reasonable matching with records expressed as the relative error from -8% to +3%. Irrespective of this, wet and dry periods for the aforementioned river basins show significant departures from annual observations. Also, the smaller river basins display greater deviations up to 26% of the observed annual discharges, whereas the accuracy of annual predictions do not strictly depend on the prevailing hydrological conditions

    A spectral-domain approach for the calibration of shot noise-based models of daily streamflows

    Get PDF
    Since 50's the scientific community has been strongly interested in the modeling of several classes of stochastic processes, among which a particular attention has been attracted by hydro-meteorological phenomena. Indeed, both their synthetic reproduction and forecasting is a central point in the resolution of a wide class of problems, as the design and management of water resources systems and flood risk analysis. Concerning the modeling of generic stochastic variables, there are two crucial aspects to be addressed: the identification of the most appropriate model able to correctly reproduce the statistical features of the real process (i.e., model selection), and the estimation of the parameters of the selected model (i.e., model calibration) from available data, concerning both input and output processes of the system to be identified. During the last decades, considerable efforts have been undertaken by researchers to provide the scientific community with suitable calibration techniques of several class of models, resulting in the current availability of numerous procedures for the estimation of the selected model parameters. A general distinction can be made between time-domain and frequency-domain (or spectral-domain) calibration approaches, whose main difference consists in the typology of information adopted in the parameters estimation. Indeed, while the former are usually based on a numerical comparison between historical and synthetic series of the output process, spectral-domain procedures adopt, in several ways, the frequency information content of recorded input and output series. Such a substantial difference results in some considerable advantages for frequency-domain techniques, especially in the case of unavailability of sufficiently long and simultaneous records of both input and output variables. This latter condition, in particular, is not rare in the case of hydrologic model calibration problems, since the model input sequences (e.g., rainfall and air temperature series) and the output sequence (e.g., the streamflow series) can be usually both available but not simultaneously or even unavailable (i.e., poorly gauged or ungauged basins). These considerations make recommendable the adoption of frequency-domain calibration techniques in hydrologic applications. Starting from this proposed framework, in this thesis the author focuses on the spectral-domain calibration problem of a widely developed class of models for the modeling of daily streamflow processes, the so-called shot noise models. These models consider the river flow process as the result of a convolution of Poisson-distributed occurrences, representing the rainfall process, and a linear response function, depending on the parameters to be estimated, representing the natural basin transformations. The technical literature provides several techniques for the calibration of this class of models, both in the time and in the frequency domain. Nevertheless, none of the existing procedures is found to take advantage of a remarkable property of shot noise models, i.e. the impulsive nature of the autocorrelation function of the input process. On the contrary, starting from this relevant feature, the proposed calibration technique allow the estimation of the basin response function parameters only through the knowledge of the power spectral density of the recorded streamflow series. Hence, on the one hand, the main drawbacks of classical time-domain calibration approaches are solved and, on the other hand, the dependence of existing frequency-domain techniques on the availability of both input and output data is overcome. The effectiveness of the proposed procedure is widely proved through its application to three daily streamflow series, associated to three watersheds located in the Italian territory. In particular, performances of the approach in the reproduction of the recorded flow series statistical properties are ascertained through a simulation analysis
    corecore