117,894 research outputs found
Biologically informed ecological niche models for an example pelagic, highly mobile species
Background: Although pelagic seabirds are broadly recognised as indicators of the health of marine systems, numerous gaps exist in knowledge of their at-sea distributions at the species level. These gaps have profound negative impacts on the robustness of marine conservation policies. Correlative modelling techniques have provided some information, but few studies have explored model development for non-breeding pelagic seabirds. Here, I present a first phase in developing robust niche models for highly mobile species as a baseline for further development.Methodology: Using observational data from a 12-year time period, 217 unique model parameterisations across three correlative modelling algorithms (boosted regression trees, Maxent and minimum volume ellipsoids) were tested in a time-averaged approach for their ability to recreate the at-sea distribution of non-breeding Wandering Albatrosses (Diomedea exulans) to provide a baseline for further development.Principle Findings/Results: Overall, minimum volume ellipsoids outperformed both boosted regression trees and Maxent. However, whilst the latter two algorithms generally overfit the data, minimum volume ellipsoids tended to underfit the data. Conclusions: The results of this exercise suggest a necessary evolution in how correlative modelling for highly mobile species such as pelagic seabirds should be approached. These insights are crucial for understanding seabird–environment interactions at macroscales, which can facilitate the ability to address population declines and inform effective marine conservation policy in the wake of rapid global change
How to Host a Data Competition: Statistical Advice for Design and Analysis of a Data Competition
Data competitions rely on real-time leaderboards to rank competitor entries
and stimulate algorithm improvement. While such competitions have become quite
popular and prevalent, particularly in supervised learning formats, their
implementations by the host are highly variable. Without careful planning, a
supervised learning competition is vulnerable to overfitting, where the winning
solutions are so closely tuned to the particular set of provided data that they
cannot generalize to the underlying problem of interest to the host. This paper
outlines some important considerations for strategically designing relevant and
informative data sets to maximize the learning outcome from hosting a
competition based on our experience. It also describes a post-competition
analysis that enables robust and efficient assessment of the strengths and
weaknesses of solutions from different competitors, as well as greater
understanding of the regions of the input space that are well-solved. The
post-competition analysis, which complements the leaderboard, uses exploratory
data analysis and generalized linear models (GLMs). The GLMs not only expand
the range of results we can explore, they also provide more detailed analysis
of individual sub-questions including similarities and differences between
algorithms across different types of scenarios, universally easy or hard
regions of the input space, and different learning objectives. When coupled
with a strategically planned data generation approach, the methods provide
richer and more informative summaries to enhance the interpretation of results
beyond just the rankings on the leaderboard. The methods are illustrated with a
recently completed competition to evaluate algorithms capable of detecting,
identifying, and locating radioactive materials in an urban environment.Comment: 36 page
Identifying Real Estate Opportunities using Machine Learning
The real estate market is exposed to many fluctuations in prices because of
existing correlations with many variables, some of which cannot be controlled
or might even be unknown. Housing prices can increase rapidly (or in some
cases, also drop very fast), yet the numerous listings available online where
houses are sold or rented are not likely to be updated that often. In some
cases, individuals interested in selling a house (or apartment) might include
it in some online listing, and forget about updating the price. In other cases,
some individuals might be interested in deliberately setting a price below the
market price in order to sell the home faster, for various reasons. In this
paper, we aim at developing a machine learning application that identifies
opportunities in the real estate market in real time, i.e., houses that are
listed with a price substantially below the market price. This program can be
useful for investors interested in the housing market. We have focused in a use
case considering real estate assets located in the Salamanca district in Madrid
(Spain) and listed in the most relevant Spanish online site for home sales and
rentals. The application is formally implemented as a regression problem that
tries to estimate the market price of a house given features retrieved from
public online listings. For building this application, we have performed a
feature engineering stage in order to discover relevant features that allows
for attaining a high predictive performance. Several machine learning
algorithms have been tested, including regression trees, k-nearest neighbors,
support vector machines and neural networks, identifying advantages and
handicaps of each of them.Comment: 24 pages, 13 figures, 5 table
A Strategy analysis for genetic association studies with known inbreeding
Background: Association studies consist in identifying the genetic variants which are related to a specific disease through the use of statistical multiple hypothesis testing or segregation analysis in pedigrees. This type of studies has been very successful in the case of Mendelian monogenic disorders while it has been less successful in identifying genetic variants related to complex diseases where the insurgence depends on the interactions between different genes and the environment. The current technology allows to genotype more than a million of markers and this number has been rapidly increasing in the last years with the imputation based on templates sets and whole genome sequencing. This type of data introduces a great amount of noise in the statistical analysis and usually requires a great number of samples. Current methods seldom take into account gene-gene and gene-environment interactions which are fundamental especially in complex diseases. In this paper we propose to use a non-parametric additive model to detect the genetic variants related to diseases which accounts for interactions of unknown order. Although this is not new to
the current literature, we show that in an isolated population, where the most related subjects share also most of their genetic code, the use of additive models may be improved if the available genealogical tree is taken into account. Specifically, we form a sample of cases and controls with the highest inbreeding by means of the Hungarian method, and estimate the set of genes/environmental variables, associated with the disease, by means of Random Forest.
Results: We have evidence, from statistical theory, simulations and two applications, that we build a suitable
procedure to eliminate stratification between cases and controls and that it also has enough precision in
identifying genetic variants responsible for a disease. This procedure has been successfully used for the betathalassemia, which is a well known Mendelian disease, and also to the common asthma where we have identified
candidate genes that underlie to the susceptibility of the asthma. Some of such candidate genes have been also found related to common asthma in the current literature.
Conclusions: The data analysis approach, based on selecting the most related cases and controls along with the Random Forest model, is a powerful tool for detecting genetic variants associated to a disease in isolated
populations. Moreover, this method provides also a prediction model that has accuracy in estimating the unknown disease status and that can be generally used to build kit tests for a wide class of Mendelian diseases
- …