10 research outputs found

    A novel Border Identification algorithm based on an “Anti-Bayesian” paradigm

    Get PDF
    Border Identification (BI) algorithms, a subset of Prototype Reduction Schemes (PRS) aim to reduce the number of training vectors so that the reduced set (the border set) contains only those patterns which lie near the border of the classes, and have sufficient information to perform a meaningful classification. However, one can see that the true border patterns (“near” border) are not able to perform the task independently as they are not able to always distinguish the testing samples. Thus, researchers have worked on this issue so as to find a way to strengthen the “border” set. A recent development in this field tries to add more border patterns, i.e., the “far” borders, to the border set, and this process continues until it reaches a stage at which the classification accuracy no longer increases. In this case, the cardinality of the border set is relatively high. In this paper, we aim to design a novel BI algorithm based on a new definition for the term “border”. We opt to select the patterns which lie at the border of the alternate class as the border patterns. Thus, those patterns which are neither on the true discriminant nor too close to the central position of the distributions, are added to the “border” set. The border patterns, which are very small in number (for example, five from both classes), selected in this manner, have the potential to perform a classification which is comparable to that obtained by well-known traditional classifiers like the SVM, and very close to the optimal Bayes’ bound

    Nonparametric “anti-Bayesian” quantile-based pattern classification

    Get PDF
    Author's accepted manuscript.Available from 24/06/2021.This is a post-peer-review, pre-copyedit version of an article published in Pattern Analysis and Applications. The final authenticated version is available online at: https://doi.org/10.1007/s10044-020-00903-7.acceptedVersio

    Sparse machine learning models in bioinformatics

    Get PDF
    The meaning of parsimony is twofold in machine learning: either the structure or (and) the parameter of a model can be sparse. Sparse models have many strengths. First, sparsity is an important regularization principle to reduce model complexity and therefore avoid overfitting. Second, in many fields, for example bioinformatics, many high-dimensional data may be generated by a very few number of hidden factors, thus it is more reasonable to use a proper sparse model than a dense model. Third, a sparse model is often easy to interpret. In this dissertation, we investigate the sparse machine learning models and their applications in high-dimensional biological data analysis. We focus our research on five types of sparse models as follows. First, sparse representation is a parsimonious principle that a sample can be approximated by a sparse linear combination of basis vectors. We explore existing sparse representation models and propose our own sparse representation methods for high dimensional biological data analysis. We derive different sparse representation models from a Bayesian perspective. Two generic dictionary learning frameworks are proposed. Also, kernel and supervised dictionary learning approaches are devised. Furthermore, we propose fast active-set and decomposition methods for the optimization of sparse coding models. Second, gene-sample-time data are promising in clinical study, but challenging in computation. We propose sparse tensor decomposition methods and kernel methods for the dimensionality reduction and classification of such data. As the extensions of matrix factorization, tensor decomposition techniques can reduce the dimensionality of the gene-sample-time data dramatically, and the kernel methods can run very efficiently on such data. Third, we explore two sparse regularized linear models for multi-class problems in bioinformatics. Our first method is called the nearest-border classification technique for data with many classes. Our second method is a hierarchical model. It can simultaneously select features and classify samples. Our experiment, on breast tumor subtyping, shows that this model outperforms the one-versus-all strategy in some cases. Fourth, we propose to use spectral clustering approaches for clustering microarray time-series data. The approaches are based on two transformations that have been recently introduced, especially for gene expression time-series data, namely, alignment-based and variation-based transformations. Both transformations have been devised in order to take into account temporal relationships in the data, and have been shown to increase the ability of a clustering method in detecting co-expressed genes. We investigate the performances of these transformations methods, when combined with spectral clustering on two microarray time-series datasets, and discuss their strengths and weaknesses. Our experiments on two well known real-life datasets show the superiority of the alignment-based over the variation-based transformation for finding meaningful groups of co-expressed genes. Fifth, we propose the max-min high-order dynamic Bayesian network (MMHO-DBN) learning algorithm, in order to reconstruct time-delayed gene regulatory networks. Due to the small sample size of the training data and the power-low nature of gene regulatory networks, the structure of the network is restricted by sparsity. We also apply the qualitative probabilistic networks (QPNs) to interpret the interactions learned. Our experiments on both synthetic and real gene expression time-series data show that, MMHO-DBN can obtain better precision than some existing methods, and perform very fast. The QPN analysis can accurately predict types of influences and synergies. Additionally, since many high dimensional biological data are subject to missing values, we survey various strategies for learning models from incomplete data. We extend the existing imputation methods, originally for two-way data, to methods for gene-sample-time data. We also propose a pair-wise weighting method for computing kernel matrices from incomplete data. Computational evaluations show that both approaches work very robustly

    Machine Learning Approaches for Cancer Analysis

    Get PDF
    In addition, we propose many machine learning models that serve as contributions to solve a biological problem. First, we present Zseq, a linear time method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors, such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Studying the abundance of select mRNA species throughout prostate cancer progression may provide some insight into the molecular mechanisms that advance the disease. In the second contribution of this dissertation, we reveal that the combination of proper clustering, distance function and Index validation for clusters are suitable in identifying outlier transcripts, which show different trending than the majority of the transcripts, the trending of the transcript is the abundance throughout different stages of prostate cancer. We compare this model with standard hierarchical time-series clustering method based on Euclidean distance. Using time-series profile hierarchical clustering methods, we identified stage-specific mRNA species termed outlier transcripts that exhibit unique trending patterns as compared to most other transcripts during disease progression. This method is able to identify those outliers rather than finding patterns among the trending transcripts compared to the hierarchical clustering method based on Euclidean distance. A wet-lab experiment on a biomarker (CAM2G gene) confirmed the result of the computational model. Genes related to these outlier transcripts were found to be strongly associated with cancer, and in particular, prostate cancer. Further investigation of these outlier transcripts in prostate cancer may identify them as potential stage-specific biomarkers that can predict the progression of the disease. Breast cancer, on the other hand, is a widespread type of cancer in females and accounts for a lot of cancer cases and deaths in the world. Identifying the subtype of breast cancer plays a crucial role in selecting the best treatment. In the third contribution, we propose an optimized hierarchical classification model that is used to predict the breast cancer subtype. Suitable filter feature selection methods and new hybrid feature selection methods are utilized to find discriminative genes. Our proposed model achieves 100% accuracy for predicting the breast cancer subtypes using the same or even fewer genes. Studying breast cancer survivability among different patients who received various treatments may help understand the relationship between the survivability and treatment therapy based on gene expression. In the fourth contribution, we have built a classifier system that predicts whether a given breast cancer patient who underwent some form of treatment, which is either hormone therapy, radiotherapy, or surgery will survive beyond five years after the treatment therapy. Our classifier is a tree-based hierarchical approach that partitions breast cancer patients based on survivability classes; each node in the tree is associated with a treatment therapy and finds a predictive subset of genes that can best predict whether a given patient will survive after that particular treatment. We applied our tree-based method to a gene expression dataset that consists of 347 treated breast cancer patients and identified potential biomarker subsets with prediction accuracies ranging from 80.9% to 100%. We have further investigated the roles of many biomarkers through the literature. Studying gene expression through various time intervals of breast cancer survival may provide insights into the recovery of the patients. Discovery of gene indicators can be a crucial step in predicting survivability and handling of breast cancer patients. In the fifth contribution, we propose a hierarchical clustering method to separate dissimilar groups of genes in time-series data as outliers. These isolated outliers, genes that trend differently from other genes, can serve as potential biomarkers of breast cancer survivability. In the last contribution, we introduce a method that uses machine learning techniques to identify transcripts that correlate with prostate cancer development and progression. We have isolated transcripts that have the potential to serve as prognostic indicators and may have significant value in guiding treatment decisions. Our study also supports PTGFR, NREP, scaRNA22, DOCK9, FLVCR2, IK2F3, USP13, and CLASP1 as potential biomarkers to predict prostate cancer progression, especially between stage II and subsequent stages of the disease

    Contextual modulation of visual variability: perceptual biases over time and across the visual field

    Get PDF
    The visual system extracts statistical information about the environment to manage noise, ensure perceptual stability and predict future events. These summary representations are able to inform sensory information received in subsequent times or in other regions of the visual field. This has been conceptualized in terms of Bayesian inference within the predictive coding framework. Nevertheless, contextual influence can also drive anti-Bayesian biases, as in sensory adaptation. Variance is a crucial statistical descriptor, yet relatively overlooked in ensemble vision research. We assessed the mechanisms whereby visual variability exerts and is subject to contextual modulation over time and across the visual field. Perceptual biases over time: serial dependence (SD) In a series of visual experiments, we examined SD on visual variance: the influence of the variance of previously presented ensembles in current variance judgments. We encountered two history-dependent biases: a positive bias exerted by recent presentations and a negative bias driven by less recent context. Contrary to claims that positive SD has low-level sensory origin, our experiments demonstrated a decisional bias requiring perceptual awareness and subject to time and capacity limitations. The negative bias was likely of sensory origin (adaptation). A two-layer model combining population codes and Bayesian Kalman filters replicated positive and negative effects in their approximate timescales. Perceptual biases across the visual field: Uniformity Illusion (UI) In UI, presentation of a pattern with uniform foveal components and more variable peripheral elements results in the latter taking the appearance of the foveal input. We studied the mechanistic basis of UI on orientation and determined that it arose without changes in sensory encoding at the primary visual cortex. Conclusions We studied perceptual biases on visual variability across space and time and found a combination of sensory negative and positive decisional biases, likely to handle the balance between change sensitivity and perceptual stability

    Generalization in qualitative psychology

    Full text link
    "Questions on generalization depend on the context of available data and the goals of generalizing from research findings. Sometimes, generalization is not only of minor interest, but it can be misleading. Of course, science is interested in principles, we want to know the underlying logic of individual and social processes. But how "generally" do we want to apply the "particular" findings - and is there a need for generalization?" (author's abstract). Contents: Leo Gürtler, Günter L. Huber: Should we generalize? Anyway, we do it all the time in everyday life (17-35); Thomas Burkart, Gerhard Kleining: Generalisierung durch qualitative Heuristik (37-52); Rudolf Schmitt: Attempts not to over-generalize the results of metaphor analyses (53-70); Pascal Dey, Julia Nentwich: The identity politics of qualitative research. A discourse analytic inter-text (71-105); M.Concepción Dominguez, Antonio Medina Rivilla: Integrated methodology: From self-observation to debate groups to the design of intercultural educational materials and teacher training (107-128); Tiberio Feliz Murias, M. Carmen Ricoy Lorenzo: From feedback about resources to the improvement of the curricular design of practical training as a generalization process (129-144); M. Carmen Ricoy Lorenzo, Tiberio Feliz Murias: Competencies design as a qualitative process of generalization. Designing the competencies of educators in the technological resources (145-160); Silke Birgitta Gahleitner, Julia Markner: Youth welfare services and problems of borderline personality disorder. Suggestions from the perspective of the client – a single case study (161-176); Inge Herfort, Andreas Weiss, Martin Mühlberger: Intercultural competence for transnational co-operations between small and medium-sized enterprises in Austria and Hungary (177-189); Lorenzo Almazán Moreno, Ana Ortiz Colón: A study of the training needs of adults in 21st-century society: integrated methodological research model involving discussion groups, questionnaires and case studies (191-194); Samuel Gento, M. Concepción Dominguez, Antonio Medina: Problems of discipline and learning in the educational system (195-233); Michaela Gläser-Zikuda: The relation of instructional quality to students' emotions in secondary schools - a qualitative-quantitative study (235-248)
    corecore