20 research outputs found

    Does replication groups scoring reduce false positive rate in SNP interaction discovery?

    Get PDF
    BACKGROUNG. Computational methods that infer single nucleotide polymorphism (SNP) interactions from phenotype data may uncover new biological mechanisms in non-Mendelian diseases. However, practical aspects of such analysis face many problems. Present experimental studies typically use SNP arrays with hundreds of thousands of SNPs but record only hundreds of samples. Candidate SNP pairs inferred by interaction analysis may include a high proportion of false positives. Recently, Gayan et al. (2008) proposed to reduce the number of false positives by combining results of interaction analysis performed on subsets of data (replication groups), rather than analyzing the entire data set directly. If performing as hypothesized, replication groups scoring could improve interaction analysis and also any type of feature ranking and selection procedure in systems biology. Because Gayan et al. do not compare their approach to the standard interaction analysis techniques, we here investigate if replication groups indeed reduce the number of reported false positive interactions. RESULTS. A set of simulated and false interaction-imputed experimental SNP data sets were used to compare the inference of SNP-SNP interactions by means of replication groups to the standard approach where the entire data set was directly used to score all candidate SNP pairs. In all our experiments, the inference of interactions from the entire data set (e.g. without using the replication groups) reported fewer false positives. CONCLUSIONS. With respect to the direct scoring approach the utility of replication groups does not reduce false positive rates, and may, depending on the data set, often perform worse

    Induction of prediction models using domain knowledge about related features

    Get PDF
    Domain knowledge can help us build more accurate prediction models. Molecular biology is one of the fields where induction of prediction models is relatively hard due to few learning instances in a typical data set, but there exists vast domain knowledge. Basic entities of the field---genes, proteins, and metabolic products---are described and categorized in various freely accessible databases. This thesis focuses on methods that transform data from the space of features into the space of feature groups, which can be assembled from existing data bases and represent prior knowledge. Features in data sets from the field of molecular biology that we used in the thesis represent genes. Methods working with gene groups assume that gene expression profiles belonging to the same group are similar. We show that gene expressions of gene pairs from groups in databases KEGG and BioGRID are more similar than gene expression of random gene pairs, but the differences are small. The differences do not change with the database version. We propose a technique for transformation of data into a space of feature groups with collective matrix factorization, which simultaneously factorizes matrices representing data and feature groups into a product of latent factors with ranks smaller than ranks of original matrices. The models induced from the transformed data can be as accurate as models on the non-transformed data. In contrast to existing approaches, the proposed approach can also use features that are not in predefined groups of features but are similar to features in a group. Transformation techniques that transform data into a space of feature groups require estimation of transformation parameters such as, for example, feature weights. Techniques that use values of the target variable for parameter estimation, produce values for the feature groups that are at least partially fitted to the target variable. The induced models could therefore overestimate the importance of class-overfitted features, which can decrease their accuracy on novel data. We propose a solution that uses stacking. The proposed solution can work with any transformation technique and, for some data sets, boosts accuracy substantially. In the thesis we throughly study transformation of data into predefined feature groups. We show, in the largest study so far, that, on average, models induced from data sets transformed with feature groups do not obtain better prediction accuracies than models induced on non-transformed data sets. As the accuracies on transformed and non-transformed data sets are similar, the transformed data may still be preferred as models on feature groups are easier to interpret

    Induction of prediction models using domain knowledge about related features

    Get PDF
    Domain knowledge can help us build more accurate prediction models. Molecular biology is one of the fields where induction of prediction models is relatively hard due to few learning instances in a typical data set, but there exists vast domain knowledge. Basic entities of the field---genes, proteins, and metabolic products---are described and categorized in various freely accessible databases. This thesis focuses on methods that transform data from the space of features into the space of feature groups, which can be assembled from existing data bases and represent prior knowledge. Features in data sets from the field of molecular biology that we used in the thesis represent genes. Methods working with gene groups assume that gene expression profiles belonging to the same group are similar. We show that gene expressions of gene pairs from groups in databases KEGG and BioGRID are more similar than gene expression of random gene pairs, but the differences are small. The differences do not change with the database version. We propose a technique for transformation of data into a space of feature groups with collective matrix factorization, which simultaneously factorizes matrices representing data and feature groups into a product of latent factors with ranks smaller than ranks of original matrices. The models induced from the transformed data can be as accurate as models on the non-transformed data. In contrast to existing approaches, the proposed approach can also use features that are not in predefined groups of features but are similar to features in a group. Transformation techniques that transform data into a space of feature groups require estimation of transformation parameters such as, for example, feature weights. Techniques that use values of the target variable for parameter estimation, produce values for the feature groups that are at least partially fitted to the target variable. The induced models could therefore overestimate the importance of class-overfitted features, which can decrease their accuracy on novel data. We propose a solution that uses stacking. The proposed solution can work with any transformation technique and, for some data sets, boosts accuracy substantially. In the thesis we throughly study transformation of data into predefined feature groups. We show, in the largest study so far, that, on average, models induced from data sets transformed with feature groups do not obtain better prediction accuracies than models induced on non-transformed data sets. As the accuracies on transformed and non-transformed data sets are similar, the transformed data may still be preferred as models on feature groups are easier to interpret

    Does replication groups scoring reduce false positive rate in SNP interaction discovery?

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Computational methods that infer single nucleotide polymorphism (SNP) interactions from phenotype data may uncover new biological mechanisms in non-Mendelian diseases. However, practical aspects of such analysis face many problems. Present experimental studies typically use SNP arrays with hundreds of thousands of SNPs but record only hundreds of samples. Candidate SNP pairs inferred by interaction analysis may include a high proportion of false positives. Recently, Gayan et al. (2008) proposed to reduce the number of false positives by combining results of interaction analysis performed on subsets of data (replication groups), rather than analyzing the entire data set directly. If performing as hypothesized, replication groups scoring could improve interaction analysis and also any type of feature ranking and selection procedure in systems biology. Because Gayan et al. do not compare their approach to the standard interaction analysis techniques, we here investigate if replication groups indeed reduce the number of reported false positive interactions.</p> <p>Results</p> <p>A set of simulated and false interaction-imputed experimental SNP data sets were used to compare the inference of SNP-SNP interactions by means of replication groups to the standard approach where the entire data set was directly used to score all candidate SNP pairs. In all our experiments, the inference of interactions from the entire data set (e.g. without using the replication groups) reported fewer false positives.</p> <p>Conclusions</p> <p>With respect to the direct scoring approach the utility of replication groups does not reduce false positive rates, and may, depending on the data set, often perform worse.</p

    Functional complement analysis can predict genetic testing results and long-term outcome in patients with complement deficiencies

    Get PDF
    Background: Prevalence of complement deficiencies (CDs) is markedly higher in Slovenian primary immunodeficiency (PID) registry in comparison to other national and international PID registries. Objective: The purposes of our study were to confirm CD and define complete and partial CD in registered patients in Slovenia, to evaluate frequency of clinical manifestations, and to assess the risk for characteristic infections separately for subjects with complete and partial CD. Methods: CD was confirmed with genetic analyses in patients with C2 deficiency, C8 deficiency, and hereditary angioedema or with repeated functional complement studies and measurement of complement components in other CD. Results of genetic studies (homozygous subjects vs. heterozygous carriers) and complement functional studies were analyzed to define complete (complement below the level of heterozygous carriers) and partial CD (complement above the level of homozygous patients). Presence of characteristic infections was assessed separately for complete and partial CD. Results: Genetic analyses confirmed markedly higher prevalence of CD in Slovenian PID registry (26% of all PID) than in other national and international PID registries (0.5–6% of all PID). Complement functional studies and complement component concentrations reliably distinguished between homozygous and heterozygous CD carriers. Subjects with partial CD had higher risk for characteristic infections than previously reported. Conclusion: Results of our study imply under-recognition of CD worldwide. Complement functional studies and complement component concentrations reliably predicted risk for characteristic infections in patients with complete or partial CD. Vaccination against encapsulated bacteria should be advocated also for subjects with partial CD and not limited to complete CD

    Odkrivanje interakcij z naključnimi gozdovi

    No full text

    ABC transporters in Dictyostelium discoideum development

    Get PDF
    ATP-binding cassette (ABC) transporters can translocate a broad spectrum of molecules across the cell membrane including physiological cargo and toxins. ABC transporters are known for the role they play in resistance towards anticancer agents in chemotherapy of cancer patients. There are 68 ABC transporters annotated in the genome of the social amoeba Dictyostelium discoideum. We have characterized more than half of these ABC transporters through a systematic study of mutations in their genes. We have analyzed morphological and transcriptional phenotypes for these mutants during growth and development and found that most of the mutants exhibited rather subtle phenotypes. A few of the genes may share physiological functions, as reflected in their transcriptional phenotypes. Since most of the abc-transporter mutants showed subtle morphological phenotypes, we utilized these transcriptional phenotypes to identify genes that are important for development by looking for transcripts whose abundance was unperturbed in most of the mutants. We found a set of 668 genes that includes many validated D. discoideum developmental genes. We have also found that abcG6 and abcG18 may have potential roles in intercellular signaling during terminal differentiation of spores and stalks

    Quasars/quasar: 1.9.0

    No full text
    &lt;p&gt;Quasar 1.9.0&lt;/p&gt; &lt;p&gt;Based on Orange 3.36.1 and orange-spectroscopy 0.6.11&lt;/p&gt; &lt;p&gt;Important changes since orange-spectroscopy 0.6.8:&lt;/p&gt; &lt;p&gt;[ENH] Spectra: individual display in its own thread [ENH] Spectra: dask table support [ENH] Spectra: waterfall plot [ENH] Visualization parameters dialog for Spectra and HyperSpectra [ENH] Improved the context menu display on Mac [ENH] A utility function that can easily replace wavenumbers (x) [ENH] Improve palette support (for dark/light mode)[FIX] Fix opening GSF files in Multifile [FIX] owinterpolate: Call commit.deferred in line edit callbacks [FIX] Multifile: fix a bug when wavenumbers differ for less than 1e-6 [FIX] io.opus: Convert visible image loading Exceptions to warnings [FIX] HyperSpectra: handle degenerate (all nan) coordinates and data&lt;/p&gt

    Quasar

    Full text link
    Data volumes collected in many scientific fields have long exceeded the capacity of human comprehension. This is especially true in biomedical research where multiple replicates and techniques are required to conduct reliable studies. Ever-increasing data rates from new instruments compound our dependence on statistics to make sense of the numbers. The currently available data analysis tools lack user-friendliness, various capabilities or ease of access. Problem-specific software or scripts freely available in supplementary materials or research lab websites are often highly specialized, no longer functional, or simply too hard to use. Commercial software limits access and reproducibility, and is often unable to follow quickly changing, cutting-edge research demands. Finally, as machine learning techniques penetrate data analysis pipelines of the natural sciences, we see the growing demand for user-friendly and flexible tools to fuse machine learning with spectroscopy datasets. In our opinion, open-source software with strong community engagement is the way forward. To counter these problems, we develop Quasar, an open-source and user-friendly software, as a solution to these challenges. Here, we present case studies to highlight some Quasar features analyzing infrared spectroscopy data using various machine learning techniques

    Orange: data mining toolbox in Python

    Get PDF
    Orange is a machine learning and data mining suite for data analysis through Python scripting and visual programming. Here we report on the scripting part, which features interactive data analysis and component-based assembly of data mining procedures. In the selection and design of components, we focus on the flexibility of their reuse: our principal intention is to let the user write simple and clear scripts in Python, which build upon C++ implementations of computationally-intensive tasks. Orange is intended both for experienced users and programmers, as well as for students of data mining
    corecore