307 research outputs found

    Fishery Interaction Modeling of Cetacean Bycatch in the California Drift Gillnet Fishery to Inform a Dynamic Ocean Management Tool

    Get PDF
    Understanding the drivers that lead to interaction between target species in a fishery and marine mammals is a critical aspect in efforts to reduce bycatch. In the California drift gillnet fishery static management approaches and gear changes have reduced bycatch but neither measure ascertains the underlying dynamics causing bycatch events. To avoid further potentially drastic measures such as hard caps, dynamic management approaches that consider the scales relevant to physical dynamics, animal movement and human use could be implemented. A key component to this approach is determining the factors that lead to fisheries interactions. Using 25 years (1990-2014) of National Oceanic and Atmospheric Administration fisheries’ observer data from the California drift gillnet fishery, we model the relative probability of bycatch (presence–absence) of four cetacean species in the California Current System (short-beaked common dolphin Delphinus delphis, northern right whale dolphins Lissodelphis borealis, Risso’s dolphins Grampus griseus, and Pacific white-sided dolphins Lagenorhynchus obliquidens). Due to the nature of protected species bycatch, these are rare-events, which cause a large amount of absences (zeros) in each species’ dataset. Using a data-assimilative configuration of the Regional Ocean Modeling System, we determined the capabilities of a flexible machine-learning algorithm to handle these zero-inflated datasets in order to explore the physical drivers of cetacean bycatch in the California drift gillnet fishery. Results suggest that cetacean bycatch probability has a complex relationship with the physical environment, with mesoscale variability acting as a strong driver. Through the modeling process, we observed varied responses to the range of sample sizes in the zero-inflated datasets, determining the minimum number of presences capable of building an accurate model. The selection of predictor variables and model evaluation statistics were found to play an important role in assessing the biological significance of our species distribution models. These results highlight the statistical capability (and incapability) of modeling techniques to predict the complex nature driving fishery interaction of cetacean bycatch in the California drift gillnet fishery. By determining where fisheries interactions are most likely to occur, we can inform near real-time management approaches to reduce bycatch while still allowing fishermen to meet their catch quotas

    Predictive analytics in agribusiness industries

    Get PDF
    Agriculturally related industries are routinely among the most hazardous work environments. Workplace injuries directly impact labor-market outcomes including income reduction, job loss, and health of the injured workers. In addition to medical and indemnity costs, workplace incidents include indirect costs such as equipment damage and repair, incident investigation time, training new personnel for replacement of the injured ones, an increase in insurance premiums for the year following the incidents, a slowdown of production schedules, damage to companies’ reputation, and lowering the workers’ motivation to return to work. The main purpose of incident analysis is the derivation and development of preventative measures from injury data. Applying proper analytical tools aimed at discovering the causes of occupational incidents is essential to gain useful information that contributes in preventing those incidents in future. Insight gained from the analyses of workers’ compensation data can efficiently direct preventative activities at high-risk industries. Since incidents arise from a combination of factors rather than a single cause, research on occupational incidents must go deeper into identifying the underlying causes and their relationship through applying more comprehensive analyses. Therefore, this study aimed at identifying underlying patterns in occupational injury occurrence and costs using data mining and predictive modeling techniques instead of traditional statistical methods. Utilizing a workers’ compensation claims dataset, the objectives of this study were to: investigate the use of predictive modeling techniques in forecasting future claims costs based on historical data; identify distinctive patterns of high-cost occupational injuries; and examine how well machine learning methods work in finding the predictive relationship between factors influencing occupational injuries and workers’ compensation claims occurrence and severity. The results lead to a better understanding of injury patterns, identification of prevalent causes of occupational injuries, and identification of high-risk industries and occupations. Therefore, various stakeholders such as policymakers, insurance companies, safety standard writers, and manufacturers of safety equipment can use the findings of the study to plan for remedial actions and revise safety standards. The implementation of safety measures by agribusiness organizations can prevent occupational injuries, save lives, and reduce the occurrence and cost of such incidents in agricultural work environments

    Practical Statistics for Particle Physics

    Full text link
    This is the write-up of a set of lectures given at the Asia Europe Pacific School of High Energy Physics in Quy Nhon, Vietnam in September 2018, to an audience of PhD students in all branches of particle physics They cover the different meanings of 'probability', particularly frequentist and Bayesian, the binomial, Poisson and Gaussian distributions, hypothesis testing, estimation, errors (including asymmetric and systematic errors) and goodness of fit. Several different methods used in setting upper limits are explained, followed by a discussion on why 5 sigma are conventionally required for a 'discovery'

    Cosmological Effects of Powerful AGN Outbursts in Galaxy Clusters: Insights from an XMM-Newton Observation of MS0735+7421

    Get PDF
    We report on the results of an analysis of XMM-Newton observations of MS0735+7421, the galaxy cluster which hosts the most energetic AGN outburst currently known. The previous Chandra image shows twin giant X-ray cavities (~200 kpc diameter) filled with radio emission and surrounded by a weak shock front. XMM data are consistent with these findings. The total energy in cavities and shock (~6 \times 10^{61} erg) is enough to quench the cooling flow and, since most of the energy is deposited outside the cooling region (~100 kpc), to heat the gas within 1 Mpc by ~1/4 keV per particle. The cluster exhibits an upward departure (factor ~2) from the mean L-T relation. The boost in emissivity produced by the ICM compression in the bright shells due to the cavity expansion may contribute to explain the high luminosity and high central gas mass fraction that we measure. The scaled temperature and metallicity profiles are in general agreement with those observed in relaxed clusters. Also, the quantities we measure are consistent with the observed M-T relation. We conclude that violent outbursts such as the one in MS0735+7421 do not cause dramatic instantaneous departures from cluster scaling relations (other than the L-T relation). However, if they are relatively common they may play a role in creating the global cluster properties.Comment: 69 pages, 30 figures, accepted for publication in ApJ Main Journa

    The impact of macroeconomic leading indicators on inventory management

    Get PDF
    Forecasting tactical sales is important for long term decisions such as procurement and informing lower level inventory management decisions. Macroeconomic indicators have been shown to improve the forecast accuracy at tactical level, as these indicators can provide early warnings of changing markets while at the same time tactical sales are sufficiently aggregated to facilitate the identification of useful leading indicators. Past research has shown that we can achieve significant gains by incorporating such information. However, at lower levels, that inventory decisions are taken, this is often not feasible due to the level of noise in the data. To take advantage of macroeconomic leading indicators at this level we need to translate the tactical forecasts into operational level ones. In this research we investigate how to best assimilate top level forecasts that incorporate such exogenous information with bottom level (at Stock Keeping Unit level) extrapolative forecasts. The aim is to demonstrate whether incorporating these variables has a positive impact on bottom level planning and eventually inventory levels. We construct appropriate hierarchies of sales and use that structure to reconcile the forecasts, and in turn the different available information, across levels. We are interested both at the point forecast and the prediction intervals, as the latter inform safety stock decisions. Therefore the contribution of this research is twofold. We investigate the usefulness of macroeconomic leading indicators for SKU level forecasts and alternative ways to estimate the variance of hierarchically reconciled forecasts. We provide evidence using a real case study

    Gradient boosting in automatic machine learning: feature selection and hyperparameter optimization

    Get PDF
    Das Ziel des automatischen maschinellen Lernens (AutoML) ist es, alle Aspekte der Modellwahl in prĂ€diktiver Modellierung zu automatisieren. Diese Arbeit beschĂ€ftigt sich mit Gradienten Boosting im Kontext von AutoML mit einem Fokus auf Gradient Tree Boosting und komponentenweisem Boosting. Beide Techniken haben eine gemeinsame Methodik, aber ihre Zielsetzung ist unterschiedlich. WĂ€hrend Gradient Tree Boosting im maschinellen Lernen als leistungsfĂ€higer Vorhersagealgorithmus weit verbreitet ist, wurde komponentenweises Boosting im Rahmen der Modellierung hochdimensionaler Daten entwickelt. Erweiterungen des komponentenweisen Boostings auf multidimensionale Vorhersagefunktionen werden in dieser Arbeit ebenfalls untersucht. Die Herausforderung der Hyperparameteroptimierung wird mit Fokus auf Bayesianische Optimierung und effiziente Stopping-Strategien diskutiert. Ein groß angelegter Benchmark ĂŒber Hyperparameter verschiedener Lernalgorithmen, zeigt den kritischen Einfluss von Hyperparameter Konfigurationen auf die QualitĂ€t der Modelle. Diese Daten können als Grundlage fĂŒr neue AutoML- und Meta-LernansĂ€tze verwendet werden. DarĂŒber hinaus werden fortgeschrittene Strategien zur Variablenselektion zusammengefasst und eine neue Methode auf Basis von permutierten Variablen vorgestellt. Schließlich wird ein AutoML-Ansatz vorgeschlagen, der auf den Ergebnissen und Best Practices fĂŒr die Variablenselektion und Hyperparameteroptimierung basiert. Ziel ist es AutoML zu vereinfachen und zu stabilisieren sowie eine hohe Vorhersagegenauigkeit zu gewĂ€hrleisten. Dieser Ansatz wird mit AutoML-Methoden, die wesentlich komplexere SuchrĂ€ume und Ensembling Techniken besitzen, verglichen. Vier Softwarepakete fĂŒr die statistische Programmiersprache R sind Teil dieser Arbeit, die neu entwickelt oder erweitert wurden: mlrMBO: Ein generisches Paket fĂŒr die Bayesianische Optimierung; autoxgboost: Ein AutoML System, das sich vollstĂ€ndig auf Gradient Tree Boosting fokusiert; compboost: Ein modulares, in C++ geschriebenes Framework fĂŒr komponentenweises Boosting; gamboostLSS: Ein Framework fĂŒr komponentenweises Boosting additiver Modelle fĂŒr Location, Scale und Shape.The goal of automatic machine learning (AutoML) is to automate all aspects of model selection in (supervised) predictive modeling. This thesis deals with gradient boosting techniques in the context of AutoML with a focus on gradient tree boosting and component-wise gradient boosting. Both techniques have a common methodology, but their goal is quite different. While gradient tree boosting is widely used in machine learning as a powerful prediction algorithm, component-wise gradient boosting strength is in feature selection and modeling of high-dimensional data. Extensions of component-wise gradient boosting to multidimensional prediction functions are considered as well. Focusing on Bayesian optimization and efficient early stopping strategies the challenge of hyperparameter optimization for these algorithms is discussed. Difficulty in the optimization of these algorithms is shown by a large scale random search on hyperparameters for machine learning algorithms, that can build the foundation of new AutoML and metalearning approaches. Furthermore, advanced feature selection strategies are summarized and a new method based on shadow features is introduced. Finally, an AutoML approach based on the results and best practices for feature selection and hyperparameter optimization is proposed, with the goal of simplifying and stabilizing AutoML while maintaining high prediction accuracy. This is compared to AutoML approaches using much more complex search spaces and ensembling techniques. Four software packages for the statistical programming language R have been newly developed or extended as a part of this thesis: mlrMBO: A general framework for Bayesian optimization; autoxgboost: An automatic machine learning framework that heavily utilizes gradient tree boosting; compboost: A modular framework for component-wise boosting written in C++; gamboostLSS: A framework for component-wise boosting for generalized additive models for location scale and shape

    Statistical methods for NHS incident reporting data

    Get PDF
    The National Reporting and Learning System (NRLS) is the English and Welsh NHS’ national repository of incident reports from healthcare. It aims to capture details of incident reports, at national level, and facilitate clinical review and learning to improve patient safety. These incident reports range from minor ‘near-misses’ to critical incidents that may lead to severe harm or death. NRLS data are currently reported as crude counts and proportions, but their major use is clinical review of the free-text descriptions of incidents. There are few well-developed quantitative analysis approaches for NRLS, and this thesis investigates these methods. A literature review revealed a wealth of clinical detail, but also systematic constraints of NRLS’ structure, including non-mandatory reporting, missing data and misclassification. Summary statistics for reports from 2010/11 – 2016/17 supported this and suggest NRLS was not suitable for statistical modelling in isolation. Modelling methods were advanced by creating a hybrid dataset using other sources of hospital casemix data from Hospital Episode Statistics (HES). A theoretical model was established, based on ‘exposure’ variables (using casemix proxies), and ‘culture’ as a random-effect. The initial modelling approach examined Poisson regression, mixture and multilevel models. Overdispersion was significant, generated mainly by clustering and aggregation in the hybrid dataset, but models were chosen to reflect these structures. Further modelling approaches were examined, using Generalized Additive Models to smooth predictor variables, regression tree-based models including Random Forests, and Artificial Neural Networks. Models were also extended to examine a subset of death and severe harm incidents, exploring how sparse counts affect models. Text mining techniques were examined for analysis of incident descriptions and showed how term frequency might be used. Terms were used to generate latent topics models used, in-turn, to predict the harm level of incidents. Model outputs were used to create a ‘Standardised Incident Reporting Ratio’ (SIRR) and cast this in the mould of current regulatory frameworks, using process control techniques such as funnel plots and cusum charts. A prototype online reporting tool was developed to allow NHS organisations to examine their SIRRs, provide supporting analyses, and link data points back to individual incident reports

    Analysis of Human Gut Metagenomes for the Prediction of Host Traits with Tree Ensemble Machine Learning Models

    Get PDF
    The human gut microbiota is made of a myriad of microorganisms, among which not only bacteria but also archaea. Present at lower abundances, technically more challenging to quantify, and under-represented in databases, archaea are often overseen when describing the human gut microbiome. Nonetheless, the main archaeon in terms of prevalence and abundance is Methanobrevibacter smithii, family Methanobacteriaceae. It has been associated with various host phenotypes such as slow transit or diet habits. Remarkably, contrasting evidence shows an association between M. smithii and body mass index (BMI): it is enriched in lean or obese individuals according to population studies. Reasonable hypotheses relying on the metabolism of the archaeon support these conflicting findings. For instance, its slow replication time supports its association with slow transit. M. smithii and all members of the Methanobacteriaceae family are methanogens: their metabolism relies on the reduction of simple carbon molecules to methane. In the human gut, methanogenesis starts from bacterial fermentation products. In particular, H2 and CO2 are the primary substrates of M. smithii, formate can also be used but with a lower energy yield. By uptaking fermentation products, M. smithii can boost specific fermentation pathways, consequently affecting the production of short-chain fatty acids (SCFA). These byproducts of bacterial fermentation are absorbed by the host, where they mediate host energy and inflammatory metabolisms. Accordingly, its overall effect may de- pend on the fermentation potential of the gut microbiome, itself defined by the microbiome composition. Hence, M. smithii may influence its host by consuming fermentation products. Because we know so little about the interactions between M. smithii and fermenting bacteria, gaining knowledge on their diversity and specificity and the underlying mechanisms would improve our understanding of methanogens’ role in the human gut. This work aims at providing insights into the associations between M. smithii and gut bacteria. Due to the fastidiousness of methanogens’ culture, I performed a meta-analysis of human gut metagenomes using machine learning models. To decipher the variable interactions captured by the model, I developed a tool for interpreting tree ensemble models. My new method allowed me to infer biologically relevant associations between the methanogen and components of the human gut environment. In particular, I found a clear association between M. smithii and an uncultured family of the Christensenellales order, as well as members of the Oscillospirales order predicted to have a slow replication time and be associated with slow transit. Furthermore, predictions from the model revealed a gradient in relative abundances of a core group of taxa associated with the colonization of human guts by Methanobacteriaceae. This gradient generally followed microbiome composition types, i.e., enterotypes, previously correlated with human population traits. This suggests that associations between methanogens and phenotypes known to be associated with certain enterotypes, such as BMI is correlated with the ETB enterotype, may be spurious. Then, I further explored the association between M. smithii and members of the Christensenellales order. For this, I compared co-cultures of M. smithii with Christensenella minuta, a human gut iso- late of the Christensenellaceae family, and Bacteroides thetaiotaomicron, a common H2-producer from the human gut. Results demonstrated a syntrophy via H2-transfer between Christensenellaceae and the methanogen, accompanied by a switch in SCFA production. Altogether, my findings complement the current knowledge on interactions between the human gut methanogen M. smithii and fermenting bacteria. They support the hypothesis that M. smithii preferentially interacts with specific H2-producers in the human gut, e.g., members of the Christensenellales order, as well as a core group of bacteria favoring its colonization of the gut environment. Syntrophy may underlie the identified associations, with potential effects on bacterial fermentation. In addition, my method for interpreting machine learning models applies to all sorts of problems being studied with tree ensemble models. Thus, its potential in helping understand complex systems is not limited to the microbiome field and will hopefully appear useful to other researchers in the future

    Spatio-temporal modeling of arthropod-borne zoonotic diseases: a proposed methodology to enhance macro-scale analyses

    Get PDF
    Zoonotic diseases are infectious diseases that can be transmitted from or through animals to humans, and arthropods often act as vectors for transmission. Emerging infectious diseases have been increasing both in prevalence and geographic range at alarming rates the last 30 years, and the majority of these diseases are zoonotic in nature. Many zoonotic diseases are considered notifiable by the Centers for Disease Control and Prevention (CDC). However, though state regulations or contractual obligations may require the reporting of certain diseases, significant underreporting is known to exist. Because of the rich volume of information captured in health insurance plan databases, administrative medical claims data could supplement the current reporting systems and allow for more comprehensive spatio-temporal analyses of zoonotic infections. The purpose of this dissertation is to introduce the use of electronic administrative medical claims data as a potential new source that could be leveraged in ecological field studies in the surveillance of arthropod-borne zoonotic diseases. If using medical claims data to study zoonoses is a viable approach, it could be used to improve both the temporal and spatial scale of study through the use of long-term longitudinal data covering large geographic expansions and more geographically refined ZIP code scales. Additionally, claims data could supplement the current reporting of notifiable diseases to the CDC. This effort may help bridge the disease incidence gap created by health care providers\u27 underreporting and thus allow for more effective tracking and monitoring of infectious zoonotic diseases across time and space. I specifically examined 5 tick-borne (Lyme disease [LD], babesiosis, ehrlichiosis, Rocky Mountain spotted fever [RMSF], and tularemia) and 2 mosquito-borne (West Nile virus, La Crosse viral encephalitis) diseases known to occur in the southeastern US. I first compared disease incidence rates from cases reported to the Tennessee Department of Health (TDH) state registry system with medically diagnosed cases captured in a southeastern managed care organization (MCO) claims data warehouse. I determined that LD and RMSF are significantly underreported in Tennessee. Three (3) cases of babesiosis were discovered in the claims data, a significant finding as this disease has never been reported in Tennessee. Next, I used a cluster scan statistic to statistically validate when (temporal) and where (spatial) these data sources differ. Findings highlight how the data sources do not overlap in their significant cluster results, supporting the need to integrate administrative and state registry data sources in order to provide a more comprehensive set of case information. Once the usefulness of administrative data was demonstrated, I focused on how these data could improve spatio-temporal macro-scale analyses by examining information at the ZIP code level as opposed to traditional county level assessments. I expanded on the current literature related to spatially explicit modeling by employing more advanced data mining modeling techniques. Four separate modeling techniques were compared (stepwise logistic regression, classification and regression tree, gradient boosted tree, and neural network) to describe the occurrence of tick-borne diseases as they relate to socio-demographic, geographic, and habitat characteristics. Covariates most useful in explaining LD and RMSF were similar and included co-occurrences of RMSF and LD, respectively, amount of forested and non-forested wetlands, pasture/grasslands, and urbanized/developed lands, population counts, and median income levels. Finally, I conclude with a ZIP code level spatio-temporal modeling exercise to determine areas and time periods in Tennessee where significant clusters of the studied diseases occurred. ZIP code level clusters were compared to the previously defined county-level clusters to discuss the importance of spatial scale. The findings suggest that focused disease/vector prevention efforts in non-endemic areas are warranted. Very little work exists using administrative claims data in the study of zoonotic diseases. This body of work thus adds to an area void of much knowledge. Administrative medical claims data are relatively easy to access given the appropriate permissions, have relatively no cost once access is granted, and provides the researcher with a volume rich dataset from which to study. This data source should be properly considered in the wildlife and biological sciences fields of research

    Potential of neural network triggers for the Water-Cherenkov detector array of the Pierre Auger Observatory

    Get PDF
    • 

    corecore