441 research outputs found
Principal Component Analysis and Radiative Transfer modelling of Spitzer IRS Spectra of Ultra Luminous Infrared Galaxies
The mid-infrared spectra of ultraluminous infrared galaxies (ULIRGs) contain
a variety of spectral features that can be used as diagnostics to characterise
the spectra. However, such diagnostics are biased by our prior prejudices on
the origin of the features. Moreover, by using only part of the spectrum they
do not utilise the full information content of the spectra. Blind statistical
techniques such as principal component analysis (PCA) consider the whole
spectrum, find correlated features and separate them out into distinct
components.
We further investigate the principal components (PCs) of ULIRGs derived in
Wang et al.(2011). We quantitatively show that five PCs is optimal for
describing the IRS spectra. These five components (PC1-PC5) and the mean
spectrum provide a template basis set that reproduces spectra of all z<0.35
ULIRGs within the noise. For comparison, the spectra are also modelled with a
combination of radiative transfer models of both starbursts and the dusty torus
surrounding active galactic nuclei. The five PCs typically provide better fits
than the models. We argue that the radiative transfer models require a colder
dust component and have difficulty in modelling strong PAH features.
Aided by the models we also interpret the physical processes that the
principal components represent. The third principal component is shown to
indicate the nature of the dominant power source, while PC1 is related to the
inclination of the AGN torus.
Finally, we use the 5 PCs to define a new classification scheme using 5D
Gaussian mixtures modelling and trained on widely used optical classifications.
The five PCs, average spectra for the four classifications and the code to
classify objects are made available at: http://www.phys.susx.ac.uk/~pdh21/PCA/Comment: 11 pages, 12 figures, accepted for publication in MNRA
Creating longitudinal datasets and cleaning existing data identifiers in a cystic fibrosis registry using a novel Bayesian probabilistic approach from astronomy
Patient registry data are commonly collected as annual snapshots that need to be amalgamated to understand the longitudinal progress of each patient. However, patient identifiers can either change or may not be available for legal reasons when longitudinal data are collated from patients living in different countries. Here, we apply astronomical statistical matching techniques to link individual patient records that can be used where identifiers are absent or to validate uncertain identifiers. We adopt a Bayesian model framework used for probabilistically linking records in astronomy. We adapt this and validate it across blinded, annually collected data. This is a high-quality (Danish) sub-set of data held in the European Cystic Fibrosis Society Patient Registry (ECFSPR). Our initial experiments achieved a precision of 0.990 at a recall value of 0.987. However, detailed investigation of the discrepancies uncovered typing errors in 27 of the identifiers in the original Danish sub-set. After fixing these errors to create a new gold standard our algorithm correctly linked individual records across years achieving a precision of 0.997 at a recall value of 0.987 without recourse to identifiers. Our Bayesian framework provides the probability of whether a pair of records belong to the same patient. Unlike other record linkage approaches, our algorithm can also use physical models, such as body mass index curves, as prior information for record linkage. We have shown our framework can create longitudinal samples where none existed and validate pre-existing patient identifiers. We have demonstrated that in this specific case this automated approach is better than the existing identifiers
Intersensory integration and reading : a theory / IREC Papers Vol. 1, No. 2
Includes bibliographic references (p. 34-37)
De-blending Deep Herschel Surveys: A Multi-wavelength Approach
Cosmological surveys in the far infrared are known to suffer from confusion.
The Bayesian de-blending tool, XID+, currently provides one of the best ways to
de-confuse deep Herschel SPIRE images, using a flat flux density prior. This
work is to demonstrate that existing multi-wavelength data sets can be
exploited to improve XID+ by providing an informed prior, resulting in more
accurate and precise extracted flux densities. Photometric data for galaxies in
the COSMOS field were used to constrain spectral energy distributions (SEDs)
using the fitting tool CIGALE. These SEDs were used to create Gaussian prior
estimates in the SPIRE bands for XID+. The multi-wavelength photometry and the
extracted SPIRE flux densities were run through CIGALE again to allow us to
compare the performance of the two priors. Inferred ALMA flux densities
(F), at 870m and 1250m, from the best fitting SEDs from the
second CIGALE run were compared with measured ALMA flux densities (F) as an
independent performance validation. Similar validations were conducted with the
SED modelling and fitting tool MAGPHYS and modified black body functions to
test for model dependency. We demonstrate a clear improvement in agreement
between the flux densities extracted with XID+ and existing data at other
wavelengths when using the new informed Gaussian prior over the original
uninformed prior. The residuals between F and F were calculated. For
the Gaussian prior, these residuals, expressed as a multiple of the ALMA error
(), have a smaller standard deviation, 7.95 for the Gaussian
prior compared to 12.21 for the flat prior, reduced mean, 1.83
compared to 3.44, and have reduced skew to positive values, 7.97
compared to 11.50. These results were determined to not be significantly model
dependent. This results in statistically more reliable SPIRE flux densities.Comment: 8 pages, 7 figures, 3 tables. Accepted for publication in A&
Extreme star formation events in quasar hosts over
We explore the relationship between active galactic nuclei and star formation
in a sample of 513 optically luminous type 1 quasars up to redshifts of 4
hosting extremely high star formation rates (SFRs). The quasars are selected to
be individually detected by the \textit{Herschel} SPIRE instrument at 3 at 250 m, leading to typical SFRs of order of 1000
Myr. We find the average SFRs to increase by almost a factor
10 from to , mirroring the rise in the comoving SFR density
over the same epoch. However, we find that the SFRs remain approximately
constant with increasing accretion luminosity for accretion luminosities above
10 L. We also find that the SFRs do not correlate with black
hole mass. Both of these results are most plausibly explained by the existence
of a self-regulation process by the starburst at high SFRs, which controls SFRs
on time-scales comparable to or shorter than the AGN or starburst duty cycles.
We additionally find that SFRs do not depend on Eddington ratio at any
redshift, consistent with no relation between SFR and black hole growth rate
per unit black hole mass. Finally, we find that high-ionisation broad
absorption line (HiBAL) quasars have indistinguishable far-infrared properties
to those of classical quasars, consistent with HiBAL quasars being normal
quasars observed along a particular line of sight, with the outflows in HiBAL
quasars not having any measurable effect on the star formation in their hosts.Comment: 12 pages, 6 figure
Recommended from our members
Can the use of Bayesian analysis methods correct for incompleteness in electronic health records diagnosis data? Development of a novel method using simulated and real-life clinical data
Background
Patient health information is collected routinely in electronic health records (EHRs) and used for research purposes, however, many health conditions are known to be under-diagnosed or under-recorded in EHRs. In research, missing diagnoses result in under-ascertainment of true cases, which attenuates estimated associations between variables and results in a bias towards the null. Bayesian approaches allow the specification of prior information to the model, such as the likely rates of missingness in the data. This paper describes a Bayesian analysis approach which aimed to reduce attenuation of associations in EHR studies focussed on conditions characterised by under-diagnosis.
Methods
Study 1: We created synthetic data, produced to mimic structured EHR data where diagnoses were under-recorded. We fitted logistic regression (LR) models with and without Bayesian priors representing rates of misclassification in the data. We examined the LR parameters estimated by models with and without priors.
Study 2: We used EHR data from UK primary care in a case-control design with dementia as the outcome. We fitted LR models examining risk factors for dementia, with and without generic prior information on misclassification rates. We examined LR parameters estimated by models with and without the priors, and estimated classification accuracy using Area Under the Receiver Operating Characteristic.
Results
Study 1: In synthetic data, estimates of LR parameters were much closer to the true parameter values when Bayesian priors were added to the model; with no priors, parameters were substantially attenuated by under-diagnosis.
Study 2: The Bayesian approach ran well on real life clinic data from UK primary care, with the addition of prior information increasing LR parameter values in all cases. In multivariate regression models, Bayesian methods showed no improvement in classification accuracy over traditional LR.
Conclusions
The Bayesian approach showed promise but had implementation challenges in real clinical data: prior information on rates of misclassification was difficult to find. Our simple model made a number of assumptions, such as diagnoses being missing at random. Further development is needed to integrated the method into studies using real-life EHR data. Our findings nevertheless highlight the importance of developing methods to address missing diagnoses in EHR data
Identifying undetected dementia in UK primary care patients: a retrospective case-control study comparing machine-learning and standard epidemiological approaches
Background
Identifying dementia early in time, using real world data, is a public health challenge. As only two-thirds of people with dementia now ultimately receive a formal diagnosis in United Kingdom health systems and many receive it late in the disease process, there is ample room for improvement. The policy of the UK government and National Health Service (NHS) is to increase rates of timely dementia diagnosis. We used data from general practice (GP) patient records to create a machine-learning model to identify patients who have or who are developing dementia, but are currently undetected as having the condition by the GP.
Methods
We used electronic patient records from Clinical Practice Research Datalink (CPRD). Using a case-control design, we selected patients aged >65y with a diagnosis of dementia (cases) and matched them 1:1 by sex and age to patients with no evidence of dementia (controls). We developed a list of 70 clinical entities related to the onset of dementia and recorded in the 5 years before diagnosis. After creating binary features, we trialled machine learning classifiers to discriminate between cases and controls (logistic regression, naïve Bayes, support vector machines, random forest and neural networks). We examined the most important features contributing to discrimination.
Results
The final analysis included data on 93,120 patients, with a median age of 82.6 years; 64.8% were female. The naïve Bayes model performed least well. The logistic regression, support vector machine, neural network and random forest performed very similarly with an AUROC of 0.74. The top features retained in the logistic regression model were disorientation and wandering, behaviour change, schizophrenia, self-neglect, and difficulty managing.
Conclusions
Our model could aid GPs or health service planners with the early detection of dementia. Future work could improve the model by exploring the longitudinal nature of patient data and modelling decline in function over time
Learning the fundamental mid-infrared spectral components of galaxies with non-negative matrix factorization
The mid-infrared (MIR) spectra observed with the Spitzer Infrared Spectrograph (IRS) provide a valuable data set for untangling the physical processes and conditions within galaxies. This paper presents the first attempt to blindly learn fundamental spectral components of MIR galaxy spectra, using non-negative matrix factorization (NMF). NMF is a recently developed multivariate technique shown to be successful in blind source separation problems. Unlike the more popular multivariate analysis technique, principal component analysis, NMF imposes the condition that weights and spectral components are non-negative. This more closely resembles the physical process of emission in the MIR, resulting in physically intuitive components. By applying NMF to galaxy spectra in the Cornell Atlas of Spitzer/IRS sources, we find similar components amongst different NMF sets. These similar components include two for active galactic nucleus (AGN) emission and one for star formation. The first AGN component is dominated by fine structure emission lines and hot dust, the second by broad silicate emission at 10 and 18 μm. The star formation component contains all the polycyclic aromatic hydrocarbon features and molecular hydrogen lines. Other components include rising continuums at longer wavelengths, indicative of colder grey-body dust emission. We show an NMF set with seven components can reconstruct the general spectral shape of a wide variety of objects, though struggle to fit the varying strength of emission lines. We also show that the seven components can be used to separate out different types of objects. We model this separation with Gaussian mixtures modelling and use the result to provide a classification tool. We also show that the NMF components can be used to separate out the emission from AGN and star formation regions and define a new star formation/AGN diagnostic which is consistent with all MIR diagnostics already in use but has the advantage that it can be applied to MIR spectra with low signal-to-noise ratio or with limited spectral range. The seven NMF components and code for classification are available at https://github.com/pdh21/NMF_software/
Recommended from our members
An empirical, Bayesian approach to modelling crop yield: Maize in USA
We apply an empirical, data-driven approach for describing crop yield as a function of monthly temperature and precipitation by employing generative probabilistic models with parameters determined through Bayesian inference. Our approach is applied to state-scale maize yield and meteorological data for the US Corn Belt from 1981 to 2014 as an exemplar, but would be readily transferable to other crops, locations and spatial scales. Experimentation with a number of models shows that maize growth rates can be characterised by a two-dimensional Gaussian function of temperature and precipitation with monthly contributions accumulated over the growing period. This approach accounts for non-linear growth responses to the individual meteorological variables, and allows for interactions between them. Our models correctly identify that temperature and precipitation have the largest impact on yield in the six months prior to the harvest, in agreement with the typical growing season for US maize (April to September). Maximal growth rates occur for monthly mean temperature 18 °C–19 °C, corresponding to a daily maximum temperature of 24 °C–25 °C (in broad agreement with previous work) and monthly total precipitation 115 mm. Our approach also provides a self-consistent way of investigating climate change impacts on current US maize varieties in the absence of adaptation measures. Keeping precipitation and growing area fixed, a temperature increase of 2 °C, relative to 1981–2014, results in the mean yield decreasing by 8%, while the yield variance increases by a factor of around 3. We thus provide a flexible, data-driven framework for exploring the impacts of natural climate variability and climate change on globally significant crops based on their observed behaviour. In concert with other approaches, this can help inform the development of adaptation strategies that will ensure food security under a changing climate
Main sequence of star forming galaxies beyond the Herschel confusion limit
Context. Deep far-infrared (FIR) cosmological surveys are known to be affected by confusion, causing issues when examining the main sequence of star forming galaxies (MS). In the past this has typically been partially tackled by the use of stacking. However, stacking only provides the average properties of the objects in the stack. Aims. This work aims to trace the MS over 0.2 ≤ z < 6.0 using the latest de-blended Herschel photometry, which reaches ≈ 10 times deeper than the 5σ confusion limit in SPIRE. This provides more reliable star formation rates (SFRs), especially for the fainter galaxies, and hence a more reliable MS. Methods. We built a pipeline that uses the spectral energy distribution (SED) modelling and fitting tool CIGALE to generate flux density priors in the Herschel SPIRE bands. These priors where then fed into the de-blending tool XID+ to extract flux densities from the SPIRE maps. In the final step, multi-wavelength data were combined with the extracted SPIRE flux densities to constrain SEDs and provide stellar mass (M☉) and SFRs. These M☉ and SFRs were then used to populate the SFR-M☉ plane over 0.2 ≤ z < 6.0. Results. No significant evidence of a high-mass turn-over was found, resulting in the best fit being a simple two-parameter power law of the form log(SFR) = α(log(M☉) - 10:5] + β. The normalisation of the power law increased with redshift, rapidly at z ≲ 1.8, from 0.58 ± 0.09 at z ≈ 0:37 to 1.31 ± 0.08 at z ≈ 1.8. The slope was also found to increase with redshift, perhaps with an excess around 1.8 ≤ z < 2.9. Conclusions. The increasing slope indicates that galaxies become more self-similar as redshift increases. This implies that high-mass galaxies’ specific SFR increases with redshift, from 0.2 to 6.0, becoming closer to that of low-mass galaxies. The excess in the slope at 1.8 ≤ z < 2.9, if present, coincides with the peak of the cosmic star formation history
- …