695 research outputs found
Geometry- and Accuracy-Preserving Random Forest Proximities with Applications
Many machine learning algorithms use calculated distances or similarities between data observations to make predictions, cluster similar data, visualize patterns, or generally explore the data. Most distances or similarity measures do not incorporate known data labels and are thus considered unsupervised. Supervised methods for measuring distance exist which incorporate data labels and thereby exaggerate separation between data points of different classes. This approach tends to distort the natural structure of the data. Instead of following similar approaches, we leverage a popular algorithm used for making data-driven predictions, known as random forests, to naturally incorporate data labels into similarity measures known as random forest proximities. In this dissertation, we explore previously defined random forest proximities and demonstrate their weaknesses in popular proximity-based applications. Additionally, we develop a new proximity definition that can be used to recreate the random forest’s predictions. We call these random forest-geometry-and accuracy-Preserving proximities or RF-GAP. We show by proof and empirical demonstration can be used to perfectly reconstruct the random forest’s predictions and, as a result, we argue that RF-GAP proximities provide a truer representation of the random forest’s learning when used in proximity-based applications. We provide evidence to suggest that RF-GAP proximities improve applications including imputing missing data, detecting outliers, and visualizing the data. We also introduce a new random forest proximity-based technique that can be used to generate 2- or 3-dimensional data representations which can be used as a tool to visually explore the data. We show that this method does well at portraying the relationship between data variables and the data labels. We show quantitatively and qualitatively that this method surpasses other existing methods for this task
Comparison of Imputation Methods for Mixed Data Missing at Random
A statistician\u27s job is to produce statistical models. When these models are precise and unbiased, we can relate them to new data appropriately. However, when data sets have missing values, assumptions to statistical methods are violated and produce biased results. The statistician\u27s objective is to implement methods that produce unbiased and accurate results. Research in missing data is becoming popular as modern methods that produce unbiased and accurate results are emerging, such as MICE in R, a statistical software. Using real data, we compare four common imputation methods, in the MICE package in R, at different levels of missingness. The results were compared in terms of the regression coefficients and adjusted R^2 values using the complete data set. The CART and PMM methods consistently performed better than the OTF and RF methods. The procedures were repeated on a second sample of real data and the same conclusions were drawn
Selective Data Editing of Continuous Variables with Random Forests in Official Statistics
Technological advances and new demands due to economic and socio-cultural changes regularly challenge the National Statistical Institutes to adapt to their evolving environment. The application of machine learning methods as important and promising tools for official statistics are discussed in the context of these changes, in the context of opportunities arising from new digital data sources, and considering the difficult task of having to balance a variety of quality requirements at national and international level. Selective statistical data editing is an approach to detect influential units and select them for manual follow up in order to make the process more efficient. In this thesis, a simple and a two-step approach are developed to apply random forests to selective editing of continuous variables in the context of short-term business survey data. We present a score function based on decision forest models which allows for an efficient selection of units relevant for the estimation of the final estimates. The approach is found to be applicable also at the disaggregated levels of the autonomous communities and economic branches
Statistical Analysis of the Effectiveness of Seawalls and Coastal Forests in Mitigating Tsunami Impacts in Iwate and Miyagi Prefectures
The Pacific coast of the Tohoku region of Japan experiences repeated tsunamis, with the most recent events having occurred in 1896, 1933, 1960, and 2011. These events have caused large loss of life and damage throughout the coastal region. There is uncertainty about the degree to which seawalls reduce deaths and building damage during tsunamis in Japan. On the one hand they provide physical protection against tsunamis as long as they are not overtopped and do not fail. On the other hand, the presence of a seawall may induce a false sense of security, encouraging additional development behind the seawall and reducing evacuation rates during an event. We analyze municipality-level and sub-municipality-level data on the impacts of the 1896, 1933, 1960, and 2011 tsunamis, finding that seawalls larger than 5 m in height generally have served a protective role in these past events, reducing both death rates and the damage rates of residential buildings. However, seawalls smaller than 5 m in height appear to have encouraged development in vulnerable areas and exacerbated damage. We also find that the extent of flooding is a critical factor in estimating both death rates and building damage rates, suggesting that additional measures, such as multiple lines of defense and elevating topography, may have significant benefits in reducing the impacts of tsunamis. Moreover, the area of coastal forests was found to be inversely related to death and destruction rates, indicating that forests either mitigated the impacts of these tsunamis, or displaced development that would otherwise have been damaged
Improving the matching of registered unemployed to job offers through machine learning algorithms
Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceDue to the existence of a double-sided asymmetric information problem on the labour market
characterized by a mutual lack of trust by employers and unemployed people, not enough job matches
are facilitated by public employment services (PES), which seem to be caught in a low-end equilibrium.
In order to act as a reliable third party, PES need to build a good and solid reputation among their main
clients by offering better and less time consuming pre-selection services. The use of machine-learning,
data-driven relevancy algorithms that calculate the viability of a specific candidate for a particular job
opening is becoming increasingly popular in this field. Based on the Portuguese PES databases (CVs,
vacancies, pre-selection and matching results), complemented by relevant external data published by
Statistics Portugal and the European Classification of Skills/Competences, Qualifications and
Occupations (ESCO), the current thesis evaluates the potential application of models such as Random
Forests, Gradient Boosting, Support Vector Machines, Neural Networks Ensembles and other tree-based
ensembles to the job matching activities that are carried out by the Portuguese PES, in order to
understand the extent to which the latter can be improved through the adoption of automated
processes. The obtained results seem promising and point to the possible use of robust algorithms such
as Random Forests within the pre-selection of suitable candidates, due to their advantages at various
levels, namely in terms of accuracy, capacity to handle large datasets with thousands of variables,
including badly unbalanced ones, as well as extensive missing values and many-valued categorical
variables
- …