12,083 research outputs found
Methods for generating and evaluating synthetic longitudinal patient data: a systematic review
The proliferation of data in recent years has led to the advancement and
utilization of various statistical and deep learning techniques, thus
expediting research and development activities. However, not all industries
have benefited equally from the surge in data availability, partly due to legal
restrictions on data usage and privacy regulations, such as in medicine. To
address this issue, various statistical disclosure and privacy-preserving
methods have been proposed, including the use of synthetic data generation.
Synthetic data are generated based on some existing data, with the aim of
replicating them as closely as possible and acting as a proxy for real
sensitive data. This paper presents a systematic review of methods for
generating and evaluating synthetic longitudinal patient data, a prevalent data
type in medicine. The review adheres to the PRISMA guidelines and covers
literature from five databases until the end of 2022. The paper describes 17
methods, ranging from traditional simulation techniques to modern deep learning
methods. The collected information includes, but is not limited to, method
type, source code availability, and approaches used to assess resemblance,
utility, and privacy. Furthermore, the paper discusses practical guidelines and
key considerations for developing synthetic longitudinal data generation
methods
Historical collaborative geocoding
The latest developments in digital have provided large data sets that can
increasingly easily be accessed and used. These data sets often contain
indirect localisation information, such as historical addresses. Historical
geocoding is the process of transforming the indirect localisation information
to direct localisation that can be placed on a map, which enables spatial
analysis and cross-referencing. Many efficient geocoders exist for current
addresses, but they do not deal with the temporal aspect and are based on a
strict hierarchy (..., city, street, house number) that is hard or impossible
to use with historical data. Indeed historical data are full of uncertainties
(temporal aspect, semantic aspect, spatial precision, confidence in historical
source, ...) that can not be resolved, as there is no way to go back in time to
check. We propose an open source, open data, extensible solution for geocoding
that is based on the building of gazetteers composed of geohistorical objects
extracted from historical topographical maps. Once the gazetteers are
available, geocoding an historical address is a matter of finding the
geohistorical object in the gazetteers that is the best match to the historical
address. The matching criteriae are customisable and include several dimensions
(fuzzy semantic, fuzzy temporal, scale, spatial precision ...). As the goal is
to facilitate historical work, we also propose web-based user interfaces that
help geocode (one address or batch mode) and display over current or historical
topographical maps, so that they can be checked and collaboratively edited. The
system is tested on Paris city for the 19-20th centuries, shows high returns
rate and is fast enough to be used interactively.Comment: WORKING PAPE
Characterizing Seismicity in Alberta for Induced-Seismicity Applications
This report documents the compilation of a high-quality catalog of earthquakes in Alberta and the surrounding region: the Composite Alberta Seismicity Catalog (CASC). It currently includes events through July 2015. The catalog and its documentation are available for download at www.inducedseismicity.ca. For the determination of the magnitude of completeness (Mc) of the catalog, we map Mc (xi, yi, t) across a grid of the region, where xi and yi represent the longitude and latitude of center nodes in the grid and t indicates time period. The empirical relation determined from the catalog and station data is of the form Mc(D4) = aD4+c, where D4 is the distance from (xi, yi) to the fourth-nearest station. Seven Mcmaps are created to represent spatial variations of Mc from 1985 to 2015. Based on the derived Mc maps, we estimate the equivalent rate of occurrences of M ≥3 earthquakes in various grids
Injecting equipment schemes for injecting drug users : qualitative evidence review
This review of the qualitative literature about needle and syringe programmes (NSPs) for injecting drug users (IDUs) complements the review of effectiveness and cost-effectiveness. It aims to provide a more situated narrative perspective on the overall guidance questions
Econometrics meets sentiment : an overview of methodology and applications
The advent of massive amounts of textual, audio, and visual data has spurred the development of econometric methodology to transform qualitative sentiment data into quantitative sentiment variables, and to use those variables in an econometric analysis of the relationships between sentiment and other variables. We survey this emerging research field and refer to it as sentometrics, which is a portmanteau of sentiment and econometrics. We provide a synthesis of the relevant methodological approaches, illustrate with empirical results, and discuss useful software
DotHash: Estimating Set Similarity Metrics for Link Prediction and Document Deduplication
Metrics for set similarity are a core aspect of several data mining tasks. To
remove duplicate results in a Web search, for example, a common approach looks
at the Jaccard index between all pairs of pages. In social network analysis, a
much-celebrated metric is the Adamic-Adar index, widely used to compare node
neighborhood sets in the important problem of predicting links. However, with
the increasing amount of data to be processed, calculating the exact similarity
between all pairs can be intractable. The challenge of working at this scale
has motivated research into efficient estimators for set similarity metrics.
The two most popular estimators, MinHash and SimHash, are indeed used in
applications such as document deduplication and recommender systems where large
volumes of data need to be processed. Given the importance of these tasks, the
demand for advancing estimators is evident. We propose DotHash, an unbiased
estimator for the intersection size of two sets. DotHash can be used to
estimate the Jaccard index and, to the best of our knowledge, is the first
method that can also estimate the Adamic-Adar index and a family of related
metrics. We formally define this family of metrics, provide theoretical bounds
on the probability of estimate errors, and analyze its empirical performance.
Our experimental results indicate that DotHash is more accurate than the other
estimators in link prediction and detecting duplicate documents with the same
complexity and similar comparison time
- …