48 research outputs found

    Integrating Bayesian networks and Simpson's paradox in data mining

    Get PDF
    This paper proposes to integrate two very different kinds of methods for data mining, namely the construction of Bayesian networks from data and the detection of occurrences of Simpson’s paradox. The former aims at discovering potentially causal knowledge in the data, whilst the latter aims at detecting surprising patterns in the data. By integrating these two kinds of methods we can hope fully discover patterns which are more likely to be useful to the user, a challenging data mining goal which is under-explored in the literature. The proposed integration method involves two approaches. The first approach uses the detection of occurrences of Simpson’s paradox as a preprocessing for a more effective construction of Bayesian networks; whilst the second approach uses the construction of a Bayesian network from data as a preprocessing for the detection of occurrences of Simpson’s parado

    A New Discrete Particle Swarm Algorithm Applied to Attribute Selection in a Bioinformatics Data Set

    Get PDF
    Many data mining applications involve the task of build- ing a model for predictive classification. The goal of such a model is to classify examples (records or data instances) into classes or categories of the same type. The use of variables (attributes) not related to the classes can reduce the accu- racy and reliability of a classification or prediction model. Superfluous variables can also increase the costs of build- ing a model - particularly on large data sets. We propose a discrete Particle Swarm Optimization (PSO) algorithm de- signed for attribute selection. The proposed algorithm deals with discrete variables, and its population of candidate solu- tions contains particles of different sizes. The performance of this algorithm is compared with the performance of a standard binary PSO algorithm on the task of selecting at- tributes in a bioinformatics data set. The criteria used for comparison are: (1) maximizing predictive accuracy; and (2) finding the smallest subset of attributes

    Data standards can boost metabolomics research, and if there is a will, there is a way.

    Get PDF
    Thousands of articles using metabolomics approaches are published every year. With the increasing amounts of data being produced, mere description of investigations as text in manuscripts is not sufficient to enable re-use anymore: the underlying data needs to be published together with the findings in the literature to maximise the benefit from public and private expenditure and to take advantage of an enormous opportunity to improve scientific reproducibility in metabolomics and cognate disciplines. Reporting recommendations in metabolomics started to emerge about a decade ago and were mostly concerned with inventories of the information that had to be reported in the literature for consistency. In recent years, metabolomics data standards have developed extensively, to include the primary research data, derived results and the experimental description and importantly the metadata in a machine-readable way. This includes vendor independent data standards such as mzML for mass spectrometry and nmrML for NMR raw data that have both enabled the development of advanced data processing algorithms by the scientific community. Standards such as ISA-Tab cover essential metadata, including the experimental design, the applied protocols, association between samples, data files and the experimental factors for further statistical analysis. Altogether, they pave the way for both reproducible research and data reuse, including meta-analyses. Further incentives to prepare standards compliant data sets include new opportunities to publish data sets, but also require a little "arm twisting" in the author guidelines of scientific journals to submit the data sets to public repositories such as the NIH Metabolomics Workbench or MetaboLights at EMBL-EBI. In the present article, we look at standards for data sharing, investigate their impact in metabolomics and give suggestions to improve their adoption

    A genetic algorithm-Bayesian network approach for the analysis of metabolomics and spectroscopic data: application to the rapid detection of Bacillus spores and identification of Bacillus species

    Get PDF
    Background The rapid identification of Bacillus spores and bacterial identification are paramount because of their implications in food poisoning, pathogenesis and their use as potential biowarfare agents. Many automated analytical techniques such as Curie-point pyrolysis mass spectrometry (Py-MS) have been used to identify bacterial spores giving use to large amounts of analytical data. This high number of features makes interpretation of the data extremely difficult We analysed Py-MS data from 36 different strains of aerobic endospore-forming bacteria encompassing seven different species. These bacteria were grown axenically on nutrient agar and vegetative biomass and spores were analyzed by Curie-point Py-MS. Results We develop a novel genetic algorithm-Bayesian network algorithm that accurately identifies sand selects a small subset of key relevant mass spectra (biomarkers) to be further analysed. Once identified, this subset of relevant biomarkers was then used to identify Bacillus spores successfully and to identify Bacillus species via a Bayesian network model specifically built for this reduced set of features. Conclusions This final compact Bayesian network classification model is parsimonious, computationally fast to run and its graphical visualization allows easy interpretation of the probabilistic relationships among selected biomarkers. In addition, we compare the features selected by the genetic algorithm-Bayesian network approach with the features selected by partial least squares-discriminant analysis (PLS-DA). The classification accuracy results show that the set of features selected by the GA-BN is far superior to PLS-DA

    Profiling of spatial metabolite distributions in wheat leaves under normal and nitrate limiting conditions

    Get PDF
    The control and interaction between nitrogen and carbon assimilatory pathways is essential in both photosynthetic and non-photosynthetic tissue in order to support metabolic processes without compromising growth. Physiological differences between the basal and mature region of wheat (Triticum aestivum) primary leaves confirmed that there was a change from heterotrophic to autotrophic metabolism. Fourier Transform Infrared (FT-IR) spectroscopy confirmed the suitability and phenotypic reproducibility of the leaf growth conditions. Principal Component–Discriminant Function Analysis (PC–DFA) revealed distinct clustering between base, and tip sections of the developing wheat leaf, and from plants grown in the presence or absence of nitrate. Gas Chromatography-Time of Flight/Mass Spectrometry (GC-TOF/MS) combined with multivariate and univariate analyses, and Bayesian network (BN) analysis, distinguished different tissues and confirmed the physiological switch from high rates of respiration to photosynthesis along the leaf. The operation of nitrogen metabolism impacted on the levels and distribution of amino acids, organic acids and carbohydrates within the wheat leaf. In plants grown in the presence of nitrate there was reduced levels of a number of sugar metabolites in the leaf base and an increase in maltose levels, possibly reflecting an increase in starch turnover. The value of using this combined metabolomics analysis for further functional investigations in the future are discussed

    Detection of glycosylation and iron-binding protein modifications using Raman spectroscopy

    Get PDF
    In this study we demonstrate the use of Raman spectroscopy to determine protein modifications as a result of glycosylation and iron binding. Most proteins undergo some modifications after translation which can directly affect protein function. Identifying these modifications is particularly important in the production of biotherapeutic agents as they can affect stability, immunogenicity and pharmacokinetics. However, post-translational modifications can often be difficult to detect with regard to the subtle structural changes they induce in proteins. From their Raman spectra apo-and holo- forms of iron-binding proteins, transferrin and ferritin, could be readily distinguished and variations in spectral features as a result of structural changes could also be determined. In particular, differences in solvent exposure of aromatic amino acids residues could be identified between the open and closed forms of the iron-binding proteins. Protein modifications as a result of glycosylation can be even more difficult to identify. Through the application of the chemometric techniques of principal component analysis and partial least squares regression variations in Raman spectral features as a result of glycosylation induced structural modifications could be identified. These were then used to distinguish between glycosylated and non-glycosylated transferrin and to measure the relative concentrations of the glycoprotein within a mixture of the native non-glycosylated protein

    COordination of Standards in MetabOlomicS (COSMOS): facilitating integrated metabolomics data access

    Get PDF
    Metabolomics has become a crucial phenotyping technique in a range of research fields including medicine, the life sciences, biotechnology and the environmental sciences. This necessitates the transfer of experimental information between research groups, as well as potentially to publishers and funders. After the initial efforts of the metabolomics standards initiative, minimum reporting standards were proposed which included the concepts for metabolomics databases. Built by the community, standards and infrastructure for metabolomics are still needed to allow storage, exchange, comparison and re-utilization of metabolomics data. The Framework Programme 7 EU Initiative ‘coordination of standards in metabolomics’ (COSMOS) is developing a robust data infrastructure and exchange standards for metabolomics data and metadata. This is to support workflows for a broad range of metabolomics applications within the European metabolomics community and the wider metabolomics and biomedical communities’ participation. Here we announce our concepts and efforts asking for re-engagement of the metabolomics community, academics and industry, journal publishers, software and hardware vendors, as well as those interested in standardisation worldwide (addressing missing metabolomics ontologies, complex-metadata capturing and XML based open source data exchange format), to join and work towards updating and implementing metabolomics standards

    Glycemia but not the Metabolic Syndrome is Associated with Cognitive Decline: Findings from the European Male Ageing Study

    Get PDF
    © 2017 American Association for Geriatric Psychiatry. Objective Previous research has indicated that components of the metabolic syndrome (MetS), such as hyperglycemia and hypertension, are negatively associated with cognition. However, evidence that MetS itself is related to cognitive performance has been inconsistent. This longitudinal study investigates whether MetS or its components affect cognitive decline in aging men and whether any interaction with inflammation exists. Methods Over a mean of 4.4 years (SD ± 0.3), men aged 40–79 years from the multicenter European Male Ageing Study were recruited. Cognitive functioning was assessed using the Rey-Osterrieth Complex Figure (ROCF), the Camden Topographical Recognition Memory (CTRM) task, and the Digit Symbol Substitution Test (DSST). High-sensitivity C-reactive protein (hs-CRP) levels were measured using a chemiluminescent immunometric assay. Results Overall, 1,913 participants contributed data to the ROCF analyses and 1,965 subjects contributed to the CTRM and DSST analyses. In multiple regression models the presence of baseline MetS was not associated with cognitive decline over time (p  >  0.05). However, logistic ordinal regressions indicated that high glucose levels were related to a greater risk of decline on the ROCF Copy (ÎČ = −0.42, p  <  0.05) and the DSST (ÎČ = −0.39, p  <  0.001). There was neither a main effect of hs-CRP levels nor an interaction effect of hs-CRP and MetS at baseline on cognitive decline. Conclusion No evidence was found for a relationship between MetS or inflammation and cognitive decline in this sample of aging men. However, glycemia was negatively associated with visuoconstructional abilities and processing speed

    Integrating Bayesian networks and Simpson's paradox in data mining

    Get PDF
    This paper proposes to integrate two very different kinds of methods for data mining, namely the construction of Bayesian networks from data and the detection of occurrences of Simpson’s paradox. The former aims at discovering potentially causal knowledge in the data, whilst the latter aims at detecting surprising patterns in he data. By integrating these two kinds of methods we can hopefully discover patterns which are more likely to be useful to the user, a challenging data mining goal which is under-explored in the literature. The proposed integration method involves two approaches. The first approach uses the detection of occurrences of Simpson’s paradox as a preprocessing for a more effective construction of Bayesian networks; whilst the second approach uses the construction of a Bayesian network from data as a preprocessing for the detection of occurrences of Simpson’s paradox

    Particle swarm for attribute selection in Bayesian classification: an application to protein function prediction

    Get PDF
    The discrete particle swarm optimization (DPSO) algorithm is an optimization technique which belongs to the fertile paradigm of Swarm Intelligence. Designed for the task of attribute selection, the DPSO deals with discrete variables in a straightforward manner. This work empowers the DPSO algorithm by extending it in two ways. First, it enables the DPSO to select attributes for a Bayesian network algorithm; which is more sophisticated than the Naive Bayes classifier previously used by the original DPSO algorithm. Second, it applies the DPSO to a set of challenging protein functional classification data, involving a large number of classes to be predicted. The work then compares the performance of the DPSO algorithm against the performance of a standard Binary PSO algorithm on the task of selecting attributes on those data sets. The criteria used for this comparison are (1) maximizing predictive accuracy, and (2) finding the smallest subset of attributes
    corecore