1,089 research outputs found

    F-measure Maximization in Multi-Label Classification with Conditionally Independent Label Subsets

    Full text link
    We discuss a method to improve the exact F-measure maximization algorithm called GFM, proposed in (Dembczynski et al. 2011) for multi-label classification, assuming the label set can be can partitioned into conditionally independent subsets given the input features. If the labels were all independent, the estimation of only mm parameters (mm denoting the number of labels) would suffice to derive Bayes-optimal predictions in O(m2)O(m^2) operations. In the general case, m2+1m^2+1 parameters are required by GFM, to solve the problem in O(m3)O(m^3) operations. In this work, we show that the number of parameters can be reduced further to m2/nm^2/n, in the best case, assuming the label set can be partitioned into nn conditionally independent subsets. As this label partition needs to be estimated from the data beforehand, we use first the procedure proposed in (Gasse et al. 2015) that finds such partition and then infer the required parameters locally in each label subset. The latter are aggregated and serve as input to GFM to form the Bayes-optimal prediction. We show on a synthetic experiment that the reduction in the number of parameters brings about significant benefits in terms of performance

    Multiple morbidities in companion dogs: a novel model for investigating age-related disease

    Get PDF
    The proportion of men and women surviving over 65 years has been steadily increasing over the last century. In their later years, many of these individuals are afflicted with multiple chronic conditions, placing increasing pressure on healthcare systems. The accumulation of multiple health problems with advanced age is well documented, yet the causes are poorly understood. Animal models have long been employed in attempts to elucidate these complex mechanisms with limited success. Recently, the domestic dog has been proposed as a promising model of human aging for several reasons. Mean lifespan shows twofold variation across dog breeds. In addition, dogs closely share the environments of their owners, and substantial veterinary resources are dedicated to comprehensive diagnosis of conditions in dogs. However, while dogs are therefore useful for studying multimorbidity, little is known about how aging influences the accumulation of multiple concurrent disease conditions across dog breeds. The current study examines how age, body weight, and breed contribute to variation in multimorbidity in over 2,000 companion dogs visiting private veterinary clinics in England. In common with humans, we find that the number of diagnoses increases significantly with age in dogs. However, we find no significant weight or breed effects on morbidity number. This surprising result reveals that while breeds may vary in their average longevity and causes of death, their age-related trajectories of morbidities differ little, suggesting that age of onset of disease may be the source of variation in lifespan across breeds. Future studies with increased sample sizes and longitudinal monitoring may help us discern more breed-specific patterns in morbidity. Overall, the large increase in multimorbidity seen with age in dogs mirrors that seen in humans and lends even more credence to the value of companion dogs as models for human morbidity and mortality

    Calculating partial expected value of perfect information via Monte Carlo sampling algorithms

    Get PDF
    Partial expected value of perfect information (EVPI) calculations can quantify the value of learning about particular subsets of uncertain parameters in decision models. Published case studies have used different computational approaches. This article examines the computation of partial EVPI estimates via Monte Carlo sampling algorithms. The mathematical definition shows 2 nested expectations, which must be evaluated separately because of the need to compute a maximum between them. A generalized Monte Carlo sampling algorithm uses nested simulation with an outer loop to sample parameters of interest and, conditional upon these, an inner loop to sample remaining uncertain parameters. Alternative computation methods and shortcut algorithms are discussed and mathematical conditions for their use considered. Maxima of Monte Carlo estimates of expectations are biased upward, and the authors show that the use of small samples results in biased EVPI estimates. Three case studies illustrate 1) the bias due to maximization and also the inaccuracy of shortcut algorithms 2) when correlated variables are present and 3) when there is nonlinearity in net benefit functions. If relatively small correlation or nonlinearity is present, then the shortcut algorithm can be substantially inaccurate. Empirical investigation of the numbers of Monte Carlo samples suggests that fewer samples on the outer level and more on the inner level could be efficient and that relatively small numbers of samples can sometimes be used. Several remaining areas for methodological development are set out. A wider application of partial EVPI is recommended both for greater understanding of decision uncertainty and for analyzing research priorities

    Measurement error in a multi-level analysis of air pollution and health: a simulation study.

    Get PDF
    BACKGROUND: Spatio-temporal models are increasingly being used to predict exposure to ambient outdoor air pollution at high spatial resolution for inclusion in epidemiological analyses of air pollution and health. Measurement error in these predictions can nevertheless have impacts on health effect estimation. Using statistical simulation we aim to investigate the effects of such error within a multi-level model analysis of long and short-term pollutant exposure and health. METHODS: Our study was based on a theoretical sample of 1000 geographical sites within Greater London. Simulations of "true" site-specific daily mean and 5-year mean NO2 and PM10 concentrations, incorporating both temporal variation and spatial covariance, were informed by an analysis of daily measurements over the period 2009-2013 from fixed location urban background monitors in the London area. In the context of a multi-level single-pollutant Poisson regression analysis of mortality, we investigated scenarios in which we specified: the Pearson correlation between modelled and "true" data and the ratio of their variances (model versus "true") and assumed these parameters were the same spatially and temporally. RESULTS: In general, health effect estimates associated with both long and short-term exposure were biased towards the null with the level of bias increasing to over 60% as the correlation coefficient decreased from 0.9 to 0.5 and the variance ratio increased from 0.5 to 2. However, for a combination of high correlation (0.9) and small variance ratio (0.5) non-trivial bias (> 25%) away from the null was observed. Standard errors of health effect estimates, though unaffected by changes in the correlation coefficient, appeared to be attenuated for variance ratios > 1 but inflated for variance ratios < 1. CONCLUSION: While our findings suggest that in most cases modelling errors result in attenuation of the effect estimate towards the null, in some situations a non-trivial bias away from the null may occur. The magnitude and direction of bias appears to depend on the relationship between modelled and "true" data in terms of their correlation and the ratio of their variances. These factors should be taken into account when assessing the validity of modelled air pollution predictions for use in complex epidemiological models

    The strike rate index: a new index for journal quality based on journal size and the h-index of citations

    Get PDF
    Quantifying the impact of scientific research is almost always controversial, and there is a need for a uniform method that can be applied across all fields. Increasingly, however, the quantification has been summed up in the impact factor of the journal in which the work is published, which is known to show differences between fields. Here the h-index, a way to summarize an individual's highly cited work, was calculated for journals over a twenty year time span and compared to the size of the journal in four fields, Agriculture, Condensed Matter Physics, Genetics and Heredity and Mathematical Physics. There is a linear log-log relationship between the h-index and the size of the journal: the larger the journal, the more likely it is to have a high h-index. The four fields cannot be separated from each other suggesting that this relationship applies to all fields. A strike rate index (SRI) based on the log relationship of the h-index and the size of the journal shows a similar distribution in the four fields, with similar thresholds for quality, allowing journals across diverse fields to be compared to each other. The SRI explains more than four times the variation in citation counts compared to the impact factor

    New distance measures for classifying X-ray astronomy data into stellar classes

    Full text link
    The classification of the X-ray sources into classes (such as extragalactic sources, background stars, ...) is an essential task in astronomy. Typically, one of the classes corresponds to extragalactic radiation, whose photon emission behaviour is well characterized by a homogeneous Poisson process. We propose to use normalized versions of the Wasserstein and Zolotarev distances to quantify the deviation of the distribution of photon interarrival times from the exponential class. Our main motivation is the analysis of a massive dataset from X-ray astronomy obtained by the Chandra Orion Ultradeep Project (COUP). This project yielded a large catalog of 1616 X-ray cosmic sources in the Orion Nebula region, with their series of photon arrival times and associated energies. We consider the plug-in estimators of these metrics, determine their asymptotic distributions, and illustrate their finite-sample performance with a Monte Carlo study. We estimate these metrics for each COUP source from three different classes. We conclude that our proposal provides a striking amount of information on the nature of the photon emitting sources. Further, these variables have the ability to identify X-ray sources wrongly catalogued before. As an appealing conclusion, we show that some sources, previously classified as extragalactic emissions, have a much higher probability of being young stars in Orion Nebula.Comment: 29 page

    Light smoking at base-line predicts a higher mortality risk to women than to men; evidence from a cohort with long follow-up

    Get PDF
    BACKGROUND: There is conflicting evidence as to whether smoking is more harmful to women than to men. The UK Cotton Workers’ Cohort was recruited in the 1960s and contained a high proportion of men and women smokers who were well matched in terms of age, job and length of time in job. The cohort has been followed up for 42 years. METHODS: Mortality in the cohort was analysed using an individual relative survival method and Cox regression. Whether smoking, ascertained at baseline in the 1960s, was more hazardous to women than to men was examined by estimating the relative risk ratio women to men, smokers to never smoked, for light (1–14), medium (15–24), heavy (25+ cigarettes per day) and former smoking. RESULTS: For all-cause mortality relative risk ratios were 1.35 for light smoking at baseline (95% CI 1.07-1.70), 1.15 for medium smoking (95% CI 0.89-1.49) and 1.00 for heavy smoking (95% CI 0.63-1.61). Relative risk ratios for light smoking at baseline for circulatory system disease was 1.42 (95% CI 1.01 to 1.98) and for respiratory disease was 1.89 (95% CI 0.99 to 3.63). Heights of participants provided no explanation for the gender difference. CONCLUSIONS: Light smoking at baseline was shown to be significantly more hazardous to women than to men but the effect decreased as consumption increased indicating a dose response relationship. Heavy smoking was equally hazardous to both genders. This result may help explain the conflicting evidence seen elsewhere. However gender differences in smoking cessation may provide an alternative explanation

    Evaluation of clustering algorithms for gene expression data

    Get PDF
    BACKGROUND: Cluster analysis is an integral part of high dimensional data analysis. In the context of large scale gene expression data, a filtered set of genes are grouped together according to their expression profiles using one of numerous clustering algorithms that exist in the statistics and machine learning literature. A closely related problem is that of selecting a clustering algorithm that is "optimal" in some sense from a rather impressive list of clustering algorithms that currently exist. RESULTS: In this paper, we propose two validation measures each with two parts: one measuring the statistical consistency (stability) of the clusters produced and the other representing their biological functional congruence. Smaller values of these indices indicate better performance for a clustering algorithm. We illustrate this approach using two case studies with publicly available gene expression data sets: one involving a SAGE data of breast cancer patients and the other involving a time course cDNA microarray data on yeast. Six well known clustering algorithms UPGMA, K-Means, Diana, Fanny, Model-Based and SOM were evaluated. CONCLUSION: No single clustering algorithm may be best suited for clustering genes into functional groups via expression profiles for all data sets. The validation measures introduced in this paper can aid in the selection of an optimal algorithm, for a given data set, from a collection of available clustering algorithms

    Differentially expressed alternatively spliced genes in Malignant Pleural Mesothelioma identified using massively parallel transcriptome sequencing

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Analyses of Expressed Sequence Tags (ESTs) databases suggest that most human genes have multiple alternative splice variants. The alternative splicing of pre-mRNA is tightly regulated during development and in different tissue types. Changes in splicing patterns have been described in disease states. Recently, we used whole-transcriptome shotgun pryrosequencing to characterize 4 malignant pleural mesothelioma (MPM) tumors, 1 lung adenocarcinoma and 1 normal lung. We hypothesized that alternative splicing profiles might be detected in the sequencing data for the expressed genes in these samples.</p> <p>Methods</p> <p>We developed a software pipeline to map the transcriptome read sequences of the 4 MPM samples and 1 normal lung sample onto known exon junction sequences in the comprehensive AceView database of expressed sequences and to count how many reads map to each junction. 13,274,187 transcriptome reads generated by the Roche/454 sequencing platform for 5 samples were compared with 151,486 exon junctions from the AceView database. The exon junction expression index (EJEI) was calculated for each exon junction in each sample to measure the differential expression of alternative splicing events. Top ten exon junctions with the largest EJEI difference between the 4 mesothelioma and the normal lung sample were then examined for differential expression using Quantitative Real Time PCR (qRT-PCR) in the 5 sequenced samples. Two of the differentially expressed exon junctions (ACTG2.aAug05 and CDK4.aAug05) were further examined with qRT-PCR in additional 18 MPM and 18 normal lung specimens.</p> <p>Results</p> <p>We found 70,953 exon junctions covered by at least one sequence read in at least one of the 5 samples. All 10 identified most differentially expressed exon junctions were validated as present by RT-PCR, and 8 were differentially expressed exactly as predicted by the sequence analysis. The differential expression of the AceView exon junctions for the ACTG2 and CDK4 genes were also observed to be statistically significant in an additional 18 MPM and 18 normal lung samples examined using qRT-PCR. The differential expression of these two junctions was shown to successfully classify these mesothelioma and normal lung specimens with high sensitivity (89% and 78%, respectively).</p> <p>Conclusion</p> <p>Whole-transcriptome shotgun sequencing, combined with a downstream bioinformatics pipeline, provides powerful tools for the identification of differentially expressed exon junctions resulting from alternative splice variants. The alternatively spliced genes discovered in the study could serve as useful diagnostic markers as well as potential therapeutic targets for MPM.</p
    corecore