195,752 research outputs found
Improving official statistics in emerging markets using machine learning and mobile phone data
Mobile phones are one of the fastest growing technologies in the developing world with global penetration rates reaching 90%. Mobile phone data, also called CDR, are generated everytime phones are used and recorded by carriers at scale. CDR have generated groundbreaking insights in public health, official statistics, and logistics. However, the fact that most phones in developing countries are prepaid means that the data lacks key information about the user, including gender and other demographic variables. This precludes numerous uses of this data in social science and development economic research. It furthermore severely prevents the development of humanitarian applications such as the use of mobile phone data to target aid towards the most vulnerable groups during crisis. We developed a framework to extract more than 1400 features from standard mobile phone data and used them to predict useful individual characteristics and group estimates. We here present a systematic cross-country study of the applicability of machine learning for dataset augmentation at low cost. We validate our framework by showing how it can be used to reliably predict gender and other information for more than half a million people in two countries. We show how standard machine learning algorithms trained on only 10,000 users are sufficient to predict individual’s gender with an accuracy ranging from 74.3 to 88.4% in a developed country and from 74.5 to 79.7% in a developing country using only metadata. This is significantly higher than previous approaches and, once calibrated, gives highly accurate estimates of gender balance in groups. Performance suffers only marginally if we reduce the training size to 5,000, but significantly decreases in a smaller training set. We finally show that our indicators capture a large range of behavioral traits using factor analysis and that the framework can be used to predict other indicators of vulnerability such as age or socio-economic status. Mobile phone data has a great potential for good and our framework allows this data to be augmented with vulnerability and other information at a fraction of the cost
Improving the output quality of official statistics based on machine learning algorithms
National statistical institutes currently investigate how to improve the
output quality of official statistics based on machine learning algorithms. A
key obstacle is concept drift, i.e., when the joint distribution of independent
variables and a dependent (categorical) variable changes over time. Under
concept drift, a statistical model requires regular updating to prevent it from
becoming biased. However, updating a model asks for additional data, which are
not always available. In the literature, we find a variety of bias correction
methods as a promising solution. In the paper, we will compare two popular
correction methods: the misclassification estimator and the calibration
estimator. For prior probability shift (a specific type of concept drift), we
investigate the two correction methods theoretically as well as experimentally.
Our theoretical results are expressions for the bias and variance of both
methods. As experimental result, we present a decision boundary (as a function
of (a) model accuracy, (b) class distribution and (c) test set size) for the
relative performance of the two methods. Close inspection of the results will
provide a deep insight into the effect of prior probability shift on output
quality, leading to practical recommendations on the use of machine learning
algorithms in official statistics.Comment: 19 pages, 3 figures, submitted to the Journal of Official Statistics
on 14 December 202
Monitoring spatial sustainable development: Semi-automated analysis of satellite and aerial images for energy transition and sustainability indicators
Solar panels are installed by a large and growing number of households due to
the convenience of having cheap and renewable energy to power house appliances.
In contrast to other energy sources solar installations are distributed very
decentralized and spread over hundred-thousands of locations. On a global level
more than 25% of solar photovoltaic (PV) installations were decentralized. The
effect of the quick energy transition from a carbon based economy to a green
economy is though still very difficult to quantify. As a matter of fact the
quick adoption of solar panels by households is difficult to track, with local
registries that miss a large number of the newly built solar panels. This makes
the task of assessing the impact of renewable energies an impossible task.
Although models of the output of a region exist, they are often black box
estimations. This project's aim is twofold: First automate the process to
extract the location of solar panels from aerial or satellite images and
second, produce a map of solar panels along with statistics on the number of
solar panels. Further, this project takes place in a wider framework which
investigates how official statistics can benefit from new digital data sources.
At project completion, a method for detecting solar panels from aerial images
via machine learning will be developed and the methodology initially developed
for BE, DE and NL will be standardized for application to other EU countries.
In practice, machine learning techniques are used to identify solar panels in
satellite and aerial images for the province of Limburg (NL), Flanders (BE) and
North Rhine-Westphalia (DE).Comment: This document provides the reader with an overview of the various
datasets which will be used throughout the project. The collection of
satellite and aerial images as well as auxiliary information such as the
location of buildings and roofs which is required to train, test and validate
the machine learning algorithm that is being develope
a New Scientific Paradigm of Information and Knowledge Development in National Statistical Systems
Ashofteh, A., & Bravo, J. M. (2021). Data Science Training for Official Statistics: a New Scientific Paradigm of Information and Knowledge Development in National Statistical Systems. Statistical Journal of the IAOS, 37(3), 771 – 789. https://doi.org/10.3233/SJI-210841The ability to incorporate new and Big Data sources and to benefit from emerging technologies such as Web Technologies, Remote Data Collection methods, User Experience Platforms, and Trusted Smart Statistics will become increasingly important in producing and disseminating official statistics. The skills and competencies required to automate, analyse, and optimize such complex systems are often not part of the traditional skill set of most National Statistical Offices. The adoption of these technologies requires new knowledge, methodologies and the upgrading of the quality assurance framework, technology, security, privacy, and legal matters. However, there are methodological challenges and discussions among scholars about the diverse methodical confinement and the wide array of skills and competencies considered relevant for those working with big data at NSOs. This paper develops a Data Science Model for Official Statistics (DSMOS), graphically summarizing the role of data science in statistical business processes. The model combines data science, existing scientific paradigms, and trusted smart statistics, and develops around a restricted number of constructs. We considered a combination of statistical engineering, data engineering, data analysis, software engineering and soft skills such as statistical thinking, statistical literacy and specific knowledge of official statistics and dissemination of official statistics products as key requirements of data science in official statistics. We then analyse and discuss the educational requirements of the proposed model, clarifying their contribution, interactions, and current and future importance in official statistics. The DSMOS was validated through a quantitative method, using a survey addressed to experts working at the European statistical systems. The empirical results show that the core competencies considered relevant for the DSMOS include acquisition and processing capabilities related to Statistics, high-frequency data, spatial data, Big Data, and microdata/nano-data, in addition to problem-solving skills, Spatio-temporal modelling, machine learning, programming with R and SAS software, Data visualisation using novel technologies, Data and statistical literacy, Ethics in Official Statistics, New data methodologies, New data quality tools, standards and frameworks for official statistics. Some disadvantages and vulnerabilities are also addressed in the paper.publishersversionpublishe
Some implications of new data sources for economic analysis and official statistics
ArtĂculo de revistaOn the backof new technologies, new data sources are emerging. These are of very high frequency, with greater granularity than traditional sources, and can be accessed across the board, in many cases, by the different economic agents. Such developments open up new avenues and new opportunities for official statistics and for economic analysis. From a central bank’s standpoint, the use and incorporation of these data into its traditional tasks poses significant challenges, arising from their management, storage, security and confidentiality. Further, there are problems with their statistical representativeness. Given that these data are available to many agents, and not exclusively to official statistics institutions, there is a risk that different measures of the same phenomenon may be generated, with heterogeneous quality standards, giving rise to confusion among the public. Some of these sources, which consist of unstructured data such as text, require new processing techniques so that they can be integrated into economic analysis in an appropriate format (quantitative). In addition, their use entails the incorporation of machine learning techniques, among others, into traditional analysis methodologies. This article reviews, from a central bank’s standpoint, some of the possibilities and implications of this new phenomenon for economic analysis and official statistics, with examples of recent studie
Crime prediction and monitoring in Porto, Portugal, using machine learning, spatial and text analytics
Crimes are a common societal concern impacting quality of life and economic growth.
Despite the global decrease in crime statistics, specific types of crime and feelings of insecurity, have
often increased, leading safety and security agencies with the need to apply novel approaches and
advanced systems to better predict and prevent occurrences. The use of geospatial technologies,
combined with data mining and machine learning techniques allows for significant advances in the
criminology of place. In this study, official police data from Porto, in Portugal, between 2016 and 2018,
was georeferenced and treated using spatial analysis methods, which allowed the identification of
spatial patterns and relevant hotspots. Then, machine learning processes were applied for space-time
pattern mining. Using lasso regression analysis, significance for crime variables were found, with
random forest and decision tree supporting the important variable selection. Lastly, tweets related to
insecurity were collected and topic modeling and sentiment analysis was performed. Together, these
methods assist interpretation of patterns, prediction and ultimately, performance of both police and
planning professionals
- …