1,997 research outputs found

    Random Survival Forests Incorporated by the Nadaraya-Watson Regression

    Get PDF
    An attention-based random survival forest (Att-RSF) is presented in the paper. The first main idea behind this model is to adapt the Nadaraya-Watson kernel regression to the random survival forest so that the regression weights or kernels can be regarded as trainable attention weights under important condition that predictions of the random survival forest are represented in the form of functions, for example, the survival function and the cumulative hazard function. Each trainable weight assigned to a tree and a training or testing example is defined by two factors: by the ability of corresponding tree to predict and by the peculiarity of an example which falls into a leaf of the tree. The second main idea behind Att-RSF is to apply the Huber's contamination model to represent the attention weights as the linear function of the trainable attention parameters. The Harrell's C-index (concordance index) measuring the prediction quality of the random survival forest is used to form the loss function for training the attention weights. The C-index jointly with the contamination model lead to the standard quadratic optimization problem for computing the weights, which has many simple algorithms for its solution. Numerical experiments with real datasets containing survival data illustrate Att-RSF

    Addressing the challenges of uncertainty in regression models for high dimensional and heterogeneous data from observational studies

    Get PDF
    The lack of replicability in research findings from different scientific disciplines has gained wide attention in the last few years and led to extensive discussions. In this `replication crisis', different types of uncertainty play an important role, which occur at different points of data collection and statistical analysis. Nevertheless, the consequences are often ignored in current research practices with the risk of low credibility and reliability of research findings. For the analysis and the development of solutions to this problem, we define measurement uncertainty, sampling uncertainty, data pre-processing uncertainty, method uncertainty, and model uncertainty, and investigate them in particular in the context of regression analyses. Therefore, we consider data from observational studies with the focus on high dimensionality and heterogeneous variables, which are characteristics of growing importance. High dimensional data, i.e., data with more variables than observations, play an important role in the area of medical research, where large amounts of molecular data (omics data) can be collected with ever decreasing expense and effort. Where several types of omics data are available, we are additionally faced with heterogeneity. Moreover, heterogeneous data can be found in many observational studies, where data originate from different sources, or where variables of different types are collected. This work comprises four contributions with different approaches to this topic and a different focus of investigation. Contribution 1 can be considered as a practical example to illustrate data pre-processing and method uncertainty in the context of prediction and variable selection from high dimensional and heterogeneous data. In the first part of this paper, we introduce the development of priority-Lasso, a hierarchical method for prediction using multi-omics data. Priority-Lasso is based on standard Lasso and assumes a pre-specified priority order of blocks of data. The idea is to successively fit Lasso models on these blocks of data and to take the linear predictor from every fit as an offset in the fit of the block with next lowest priority. In the second part, we apply this method in a current study of acute myeloid leukemia (AML) and compare its performance to standard Lasso. We illustrate data pre-processing and method uncertainty, caused by different choices of variable definitions and specifications of settings in the application of the method. These choices result in different effect estimates and thus different prediction performances and selected variables. In the second contribution, we compare method uncertainty with sampling uncertainty in the context of variable selection and ranking of omics biomarkers. For this purpose, we develop a user-friendly and versatile framework. We apply this framework on data from AML patients with high dimensional and heterogeneous characteristics and explore three different scenarios: First, variable selection in multivariable regression based on multi-omics data, second, variable ranking based on variable importance measures from random forests, and, third, identification of genes based on differential gene expression analysis. In contributions 3 and 4, we apply the vibration of effects framework, which was initially used to analyze model uncertainty in a large epidemiological study (NHANES), to assess and compare different types of uncertainty. The two contributions intensively address the methodological extension of this framework to different types of uncertainty. In contribution 3, we describe the extension of the vibration of effects framework to sampling and data pre-processing uncertainty. As a practical illustration, we take a large data set from psychological research with heterogeneous variable structure (SAPA-project), and examine sampling, model and data pre-processing uncertainty in the context of logistic regression for varying sample sizes. Beyond the comparison of single types of uncertainty, we introduce a strategy which allows quantifying cumulative model and data pre-processing uncertainty and analyzing their relative contributions to the total uncertainty with a variance decomposition. Finally, we extend the vibration of effects framework to measurement uncertainty in contribution 4. In a practical example, we conduct a comparison study between sampling, model and measurement uncertainty on the NHANES data set in the context of survival analysis. We focus on different scenarios of measurement uncertainty which differ in the choice of variables considered to be measured with error. Moreover, we analyze the behavior of different types of uncertainty with increasing sample sizes in a large simulation study

    Above- and belowground tree biomass models for three mangrove species in Tanzania: a nonlinear mixed effects modelling approach

    Get PDF
    International audienceAbstractKey messageTested on data from Tanzania, both existing species-specific and common biomass models developed elsewhere revealed statistically significant large prediction errors. Species-specific and common above- and belowground biomass models for three mangrove species were therefore developed. The species-specific models fitted better to data than the common models. The former models are recommended for accurate estimation of biomass stored in mangrove forests of Tanzania.ContextMangroves are essential for climate change mitigation through carbon storage and sequestration. Biomass models are important tools for quantifying biomass and carbon stock. While numerous aboveground biomass models exist, very few studies have focused on belowground biomass, and among these, mangroves of Africa are hardly or not represented.AimsThe aims of the study were to develop above- and belowground biomass models and to evaluate the predictive accuracy of existing aboveground biomass models developed for mangroves in other regions and neighboring countries when applied on data from Tanzania.MethodsData was collected through destructive sampling of 120 trees (aboveground biomass), among these 30 trees were sampled for belowground biomass. The data originated from four sites along the Tanzanian coastline covering three dominant species: Avicennia marina (Forssk.) Vierh, Sonneratia alba J. Smith, and Rhizophora mucronata Lam. The biomass models were developed through mixed modelling leading to fixed effects/common models and random effects/species-specific models.ResultsBoth the above- and belowground biomass models improved when random effects (species) were considered. Inclusion of total tree height as predictor variable, in addition to diameter at breast height alone, further improved the model predictive accuracy. The tests of existing models from other regions on our data generally showed large and significant prediction errors for aboveground tree biomass.ConclusionInclusion of random effects resulted into improved goodness of fit for both above- and belowground biomass models. Species-specific models therefore are recommended for accurate biomass estimation of mangrove forests in Tanzania for both management and ecological applications. For belowground biomass (S. alba) however, the fixed effects/common model is recommended

    Overcoming the data crisis in biodiversity conservation

    Get PDF
    How can we track population trends when monitoring data are sparse? Population declines can go undetected, despite ongoing threats. For example, only one of every 200 harvested species are monitored. This gap leads to uncertainty about the seriousness of declines and hampers effective conservation. Collecting more data is important, but we can also make better use of existing information. Prior knowledge of physiology, life history, and community ecology can be used to inform population models. Additionally, in multispecies models, information can be shared among taxa based on phylogenetic, spatial, or temporal proximity. By exploiting generalities across species that share evolutionary or ecological characteristics within Bayesian hierarchical models, we can fill crucial gaps in the assessment of species’ status with unparalleled quantitative rigor

    Suupohjan kuukkelipopulaation efektiivinen populaatiokoko ja populaation elinkelpoisuus

    Get PDF
    Genetic variation is vital for both contemporary and long-term wellbeing of populations. Whereas heterozygosity (Ho) and allelic richness (A) are commonly used to measure the level of genetic diversity in a population, effective population size (Ne) describes the speed of loss of genetic variation. Various effective population sizes are proposed as standards for safe retention of genetic variation in a Minimum Viable Population (MVP). Since the 1940s, several types of effective population size estimators have been developed. Earlier estimators relied on demographic parameters, whereas genetic estimators are based on the analysis of either one or two genetic samples from a population. All Ne estimators have their unique sensitivities and limiting assumptions, which complicate the choice of estimator, comparison of results of different studies and the assessment of the reliability of the results. Ne estimators have recently been used e.g. in the monitoring of many aquatic populations, but their reliability and comparability has not often been tested with extensive ecological and genetic data, and it is not well established how much added value they bring to the conservation of easily observable species. I tested this with an extensive dataset on the Siberian jays (Perisoreus infaustus) living in Suupohja, Finland (62°22'N, 21°30'E). The Suupohja Siberian jays form one of the few isolates of Siberian jays in Southern Finland. I utilised three demographic and three genetic Ne estimators to estimate the Ne and the Ne/N ratio in the Suupohja Siberian jays, and compared the findings to the Ho and A estimates calculated with the same data, and to various suggested MVP standards. The results showed that the ratio of effective and census population sizes (Ne/N) is close to 0.6 in the Suupohja Siberian jays. Uneven survival of offspring and population size fluctuations are the main factors in the formation of this ratio. The average genetic Ne estimate would, then, suggest a census population size of 44 % higher than the average N in the Suupohja study area. This result is probably connected to the high proportion of breeding immigrants in the data, which would cause the Ne estimates to reflect a larger genetic neighbourhood than the study area. The genetic Ne estimates also suggest that the Suupohja Siberian jays might not be able to maintain their genetic diversity in the long term if gene flow would cease due to further isolation, especially if isolation would also cause a faster demographic decline. Conservation attempts should aim at ensuring gene flow to the remaining Siberian jay isolates in Southern Finland, in order to protect them from increasing genetic uniformity and inbreeding. It is possible that while the average dispersal distances in the Siberian jay are short, occasional long-distance dispersal events have an important role in the pretention of genetic structuring in a Siberian jay population. Ne estimation based on demographic data was laborious in the case of the Suupohja Siberian jays, whereas the genetic Ne estimates showed large variation depending on year and estimation method used. Reliable estimation of Ne with genetic methods would have required information on the large-scale genetic structure of the population. In any case, Ne estimates gave a clearer picture on the genetic viability of the Suupohja Siberian jays than the Ho and A estimates, which did not indicate any decrease of genetic diversity during the study period.Geneettinen vaihtelu on elintärkeää populaatioiden hyvinvoinnille. Siinä missä heterotsygotiaa (Ho) ja alleelirikkautta (A) käytetään yleisesti populaation geneettisen vaihtelun mittareina, efektiivinen populaatiokoko (Ne) mittaa populaation geneettisen vaihtelun häviämisnopeutta. Useita efektiivisiä populaatiokokoja on ehdotettu geneettisen vaihtelun turvallisen säilyttämisen standardiksi pienimmässä elinkelpoisessa populaatiokoossa (MVP). Efektiivisen populaatiokoon mittareita on kehitetty 1940-luvulta lähtien. Varhaisemmat mittarit ovat perustuneet populaation demografisille ominaisuuksille. Geneettiset mittarit perustuvat yhden tai kahden geneettisen näyte-erän analyysille. Kaikilla efektiivisen populaatiokoon mittareilla on ominaisia herkkyyksiään ja käyttörajoituksiaan, mikä vaikeuttaa mittarin valintaa, tulosten vertailukelpoisuutta ja tulosten luotettavuuden arviointia. Efektiivisen populaatiokoon mittausta on käytetty monien akvaattisten populaatioiden seurannassa, mutta mittareiden luotettavuutta ja vertailukelpoisuutta on vain harvoin voitu testata kattavalla demografisella ja geneettisellä aineistolla, eikä tarjolla ole selkeää näkemystä siitä, kuinka paljon lisähyötyä ne tuovat muutoinkin helposti seurattavien lajien suojeluun. Testasin tätä Suomen Suupohjassa, 62°22'N, 21°30'E, eläviä kuukkeleita (Perisoreus infaustus) käsittelevällä aineistolla. Suupohjan kuukkelit muodostavat yhden Etelä-Suomen harvoista kuukkelialueista. Käytin kolmea demografista ja kolmea geneettistä efektiivisen populaatiokoon mittaria määrittääkseni populaation efektiivisen koon sekä efektiivisen ja todellisen koon suhteen (Ne /N), ja vertailin tuloksia heterotsygotia- ja alleelirikkausarvoihin sekä keskusteluun pienimmästä elinkelpoisesta populaatiokoosta. Tulosten mukaan efektiivisen ja todellisen populaatiokoon suhde on kuukkelilla noin 0.6. Suhteen muodostumiseen vaikuttivat lisääntymismenestyksen vaihtelu ja populaatiokoon heilahtelut. Geneettisillä mittareilla mitatun efektiivisen populaatiokoon mukaan kuukkelien todellinen populaatiokoko olisi 44 % korkeampi kuin tutkimusalueen keskimääräinen populaatiokoko. Tulos lienee sidoksissa yllättävän korkeaan immigranttien osuuteen lisääntyvistä yksilöistä, mistä johtuen geneettinen aineisto heijastelisi tutkimusaluetta laajempaa geneettistä naapurustoa. Tulokset osoittavat myös, etteivät Suupohjan kuukkelit todennäköisesti kykenisi säilyttämään geneettistä vaihteluaan ilman immigraatiota, etenkin jos immigraation lakkaaminen johtaisi myös populaatiokoon laskuun. Suojeluyritysten tulisi tähdätä geenivirran varmistamiseen Etelä-Suomen jäljellä oleville kuukkelialueille geneettisen yhdenmukaistumisen estämiseksi. Koska keskimääräiset dispersaalimatkat ovat kuukkelilla lyhyitä, on mahdollista, että harvinaisemmilla pitkän matkan dispersaalitapahtumilla on keskeinen rooli geneettisen eriytymisen estämisessä. Efektiivisen populaatiokoon arviointi demografisen aineiston perusteella oli työlästä, kun taas geneettiset arviot ilmensivät suurta vaihtelua vuodesta ja mittarista riippuen. Luotettava efektiivisen populaatiokoon arviointi olisi edellyttänyt populaation geneettisen rakenteen parempaa etukäteistuntemusta. Efektiivisen populaatiokoon mittaus antoi kuitenkin selkeämmän kuvan Suupohjan kuukkeleiden geneettisestä hyvinvoinnista kuin mitatut heterotsygotia- ja alleelirikkausarvot, jotka eivät osoittaneet minkäänlaista geneettisen vaihtelun laskua tutkimusajanjaksolla

    Estimating abundance of African great apes

    Get PDF
    All species and subspecies of African great apes are listed by the International Union for the Conservation of Nature as endangered or critically endangered, and populations continue to decline. As human populations and industry expand into great ape habitat, efficient, reliable estimators of great ape abundance are needed to inform conservation status and land-use planning, to assess adverse and beneficial effects of human activities, and to help funding agencies and donors make informed and efficient contributions. Fortunately, technological advances have improved our ability to sample great apes remotely, and new statistical methods for estimating abundance are constantly in development. Following a brief general introduction, this thesis reviews established and emerging approaches to estimating great ape abundance, then describes new methods for estimating animal density from photographic data by distance sampling with camera traps, and for selecting among models of the distance sampling detection function when distance data are overdispersed. Subsequent chapters quantify the effect of violating the assumption of demographic closure when estimating abundance using spatially explicit capture–recapture models for closed populations, and describe the design and implementation of a camera trapping survey of chimpanzees at the landscape scale in Kibale National Park, Uganda. The new methods developed have generated considerable interest, and allow abundances of multiple species, including great apes, to be estimated from data collected during a single photographic survey. Spatially explicit capture–recapture analyses of photographic data from small study areas yielded accurate and precise estimates of chimpanzee abundance, and this combination of methods could be used to enumerate great apes over large areas and in dense forests more reliably and efficiently than previously possible."This work was supported by a St Leonard’s College Scholarship from the University of St Andrews, and the Max Planck Institute for Evolutionary Anthropology." -- Fundin

    Resilience of a deer hunting system in Southeast Alaska: integrating social, ecological, and genetic dimensions

    Get PDF
    Thesis (Ph.D.) University of Alaska Fairbanks, 2009I examined the interactions of key components of a hunting system of Sitka black-tailed deer (Odocoileus hemionus sitkensis) on Prince of Wales Island, Alaska to address concerns of subsistence hunters and to provide a new tool to more effectively monitor deer populations. To address hunter concerns, I documented local knowledge and perceptions of changes in harvest opportunities of deer over the last 50 years as a result of landscape change (e.g., logging, roads). To improve deer monitoring, I designed an efficient method to sample and survey deer pellets, tested the feasibility of identifying individual deer from fecal DNA, and used DNA-based mark and recapture techniques to estimate population trends of deer. I determined that intensive logging from 1950 into the 1990s provided better hunter access to deer and habitat that facilitated deer hunting. However, recent declines in logging activity and successional changes in logged forests have reduced access to deer and increased undesirable habitat for deer hunting. My findings suggested that using DNA from fecal pellets is an effective method for monitoring deer in southeast Alaska. My sampling protocol optimized encounter rates with pellet groups allowing feasible and efficient estimates of deer abundance. I estimated deer abundance with precision (±20%) each year in 3 distinct watersheds, and identified a 30% decline in the deer population between 2006-2008. My data suggested that 3 consecutive severe winters caused the decline. Further, I determined that managed forest harvested>30 years ago supported fewer deer relative to young-managed forest and unmanaged forest. I provided empirical data to support both the theory that changes in plant composition because of succession of logged forest may reduce habitat carrying capacity of deer over the long-term (i.e., decades), and that severity of winter weather may be the most significant force behind annual changes in deer population size in southeast Alaska. Adaptation at an individual and institutional level may be needed to build resilience into the hunting system as most (>90%) of logged forest in southeast Alaska transitions over the next couple of decades into a successional stage that sustains fewer deer and deer hunting opportunities.1. General introduction -- 2. Influence of hunter adaptability on resilience of subsistence hunting systems -- 3. Linking hunter knowledge with forest change to understand changing deer harvest opportunities in intensively logged landscapes -- 4. Individual identification of Sitka black-tailed deer using DNA from fecal pellets -- 5. A practical approach for sampling along animal trails -- 6. Estimating abundance of Sitka black-tailed deer using DNA from fecal pellets -- 7. Summary -- 8. Future recommendations -- Appendix
    corecore