1,886 research outputs found

    Genetic programming for mining DNA chip data from cancer patients

    Get PDF
    In machine learning terms DNA (gene) chip data is unusual in having thousands of attributes (the gene expression values) but few (<100) records (the patients). A GP based method for both feature selection and generating simple models based on a few genes is demonstrated on cancer data

    Highly comparative feature-based time-series classification

    Full text link
    A highly comparative, feature-based approach to time series classification is introduced that uses an extensive database of algorithms to extract thousands of interpretable features from time series. These features are derived from across the scientific time-series analysis literature, and include summaries of time series in terms of their correlation structure, distribution, entropy, stationarity, scaling properties, and fits to a range of time-series models. After computing thousands of features for each time series in a training set, those that are most informative of the class structure are selected using greedy forward feature selection with a linear classifier. The resulting feature-based classifiers automatically learn the differences between classes using a reduced number of time-series properties, and circumvent the need to calculate distances between time series. Representing time series in this way results in orders of magnitude of dimensionality reduction, allowing the method to perform well on very large datasets containing long time series or time series of different lengths. For many of the datasets studied, classification performance exceeded that of conventional instance-based classifiers, including one nearest neighbor classifiers using Euclidean distances and dynamic time warping and, most importantly, the features selected provide an understanding of the properties of the dataset, insight that can guide further scientific investigation

    Feature Selection and Molecular Classification of Cancer Using Genetic Programming

    Get PDF
    AbstractDespite important advances in microarray-based molecular classification of tumors, its application in clinical settings remains formidable. This is in part due to the limitation of current analysis programs in discovering robust biomarkers and developing classifiers with a practical set of genes. Genetic programming (GP) is a type of machine learning technique that uses evolutionary algorithm to simulate natural selection as well as population dynamics, hence leading to simple and comprehensible classifiers. Here we applied GP to cancer expression profiling data to select feature genes and build molecular classifiers by mathematical integration of these genes. Analysis of thousands of GP classifiers generated for a prostate cancer data set revealed repetitive use of a set of highly discriminative feature genes, many of which are known to be disease associated. GP classifiers often comprise five or less genes and successfully predict cancer types and subtypes. More importantly, GP classifiers generated in one study are able to predict samples from an independent study, which may have used different microarray platforms. In addition, GP yielded classification accuracy better than or similar to conventional classification methods. Furthermore, the mathematical expression of GP classifiers provides insights into relationships between classifier genes. Taken together, our results demonstrate that GP may be valuable for generating effective classifiers containing a practical set of genes for diagnostic/ prognostic cancer classification

    DI2: prior-free and multi-item discretization of biological data and its applications

    Get PDF
    Funding Information: This work was supported by Fundação para a Ciência e a Tecnologia (FCT), through IDMEC, under LAETA project (UIDB/50022/2020), IPOscore with reference (DSAIPA/DS/0042/2018), and ILU (DSAIPA/DS/0111/2018). This work was further supported by the Associate Laboratory for Green Chemistry (LAQV), financed by national funds from FCT/MCTES (UIDB/50006/2020 and UIDP/50006/2020), INESC-ID plurianual (UIDB/50021/2020) and the contract CEECIND/01399/2017 to RSC. The funding entities did not partake in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.Background: A considerable number of data mining approaches for biomedical data analysis, including state-of-the-art associative models, require a form of data discretization. Although diverse discretization approaches have been proposed, they generally work under a strict set of statistical assumptions which are arguably insufficient to handle the diversity and heterogeneity of clinical and molecular variables within a given dataset. In addition, although an increasing number of symbolic approaches in bioinformatics are able to assign multiple items to values occurring near discretization boundaries for superior robustness, there are no reference principles on how to perform multi-item discretizations. Results: In this study, an unsupervised discretization method, DI2, for variables with arbitrarily skewed distributions is proposed. Statistical tests applied to assess differences in performance confirm that DI2 generally outperforms well-established discretizations methods with statistical significance. Within classification tasks, DI2 displays either competitive or superior levels of predictive accuracy, particularly delineate for classifiers able to accommodate border values. Conclusions: This work proposes a new unsupervised method for data discretization, DI2, that takes into account the underlying data regularities, the presence of outlier values disrupting expected regularities, as well as the relevance of border values. DI2 is available at https://github.com/JupitersMight/DI2publishersversionpublishe

    Knowledge Generation with Rule Induction in Cancer Omics

    Get PDF
    The explosion of omics data availability in cancer research has boosted the knowledge of the molecular basis of cancer, although the strategies for its definitive resolution are still not well established. The complexity of cancer biology, given by the high heterogeneity of cancer cells, leads to the development of pharmacoresistance for many patients, hampering the efficacy of therapeutic approaches. Machine learning techniques have been implemented to extract knowledge from cancer omics data in order to address fundamental issues in cancer research, as well as the classification of clinically relevant sub-groups of patients and for the identification of biomarkers for disease risk and prognosis. Rule induction algorithms are a group of pattern discovery approaches that represents discovered relationships in the form of human readable associative rules. The application of such techniques to the modern plethora of collected cancer omics data can effectively boost our understanding of cancer-related mechanisms. In fact, the capability of these methods to extract a huge amount of human readable knowledge will eventually help to uncover unknown relationships between molecular attributes and the malignant phenotype. In this review, we describe applications and strategies for the usage of rule induction approaches in cancer omics data analysis. In particular, we explore the canonical applications and the future challenges and opportunities posed by multi-omics integration problems.Peer reviewe

    Application based technical Approaches of data mining in Pharmaceuticals, and Research approaches in biomedical and Bioinformatics

    Get PDF
    In the past study shows that flow of direction in the field of pharmaceutical was quit slow and simplest and by the time the process of transformation of information was so complex and the it was out of the reach to the technology, new modern technology could not reach to catch the pharmaceutical field. Then the later on technology becomes the compulsorily part of business and its contributed into business progress and developments. But now a days its get technology enabled and smoothly and easily pharma industries managing their billings and inventories and developing new products and services and now its easy to maintain and merging the drugs detail like its cost ,and usage with the patients records prescribe by the doctors in the hospitals .and data collection methods have improved data manipulation techniques are yet to keep pace with them data mining called and refer with the specific term as pattern analysis on large data sets used like clustering, segmentation and classification for helping better manipulation of the data and hence it helps to the pharma firms and industries this paper describes the vital role of data Mining in the pharma industry and thus data mining improves the quality of decision making services in pharmaceutical fields. This paper also describe a brief overviews of tool kits of Data mining and its various Applications in the field of Biomedical research in terms of relational approaches of data minings with the Emphasis on propositionalisation and relational subgroup discovery, and which is quit helpful to prove to be effective for data analysis in biomedical and its applications and in Bioinformatics as well. DOI: 10.17762/ijritcc2321-8169.15038

    Comparative Analysis of Various Data Stream Mining Procedures and Various Dimension Reduction Techniques

    Get PDF
    In recent years data mining is contributing to be the great research area, as we know data mining is the process of extracting needful information from the given set of data which will be further used for various purposes, it could be for commercial use or for scientific use .while fetching the information (mined data) proper methodologies with good approximations have to be used .In our survey we have provided the study about various data stream clustering techniques and various dimension reduction techniques with their characteristics to improve the quality of clustering, we have also provided our approach(our proposal) for clustering the streamed data using suitable procedures ,In our approach for stream data mining a dimension reduction technique have been used then after the Fuzzy C-means algorithm have been applied on it to improve the quality of clustering. Keywords: Data Stream, Dimension Reduction, Clusterin

    Data Mining and Analysis on Multiple Time Series Object Data

    Get PDF
    Huge amount of data is available in our society and the need for turning such data into useful information and knowledge is urgent. Data mining is an important field addressing that need and significant progress has been achieved in the last decade. In several important application areas, data arises in the format of Multiple Time Series Object (MTSO) data, where each data object is an array of time series over a large set of features and each has an associated class or state. Very little research has been conducted towards this kind of data. Examples include computational toxicology, where each data object consists of a set of time series over thousands of genes, and operational stress management, where each data object consists of a set of time series over different measuring points on the human body. The purpose of this dissertation is to conduct a systematic data mining study over microarray time series data, with applications on computational toxicology. More specifically, we aim to consider several issues: feature selection algorithms for different classification cases, gene markers or feature set selection for toxic chemical exposure detection, toxic chemical exposure time prediction, wildness concept development and applications, and organizing diversified and parsimonious committee. We will formalize and analyze these research problems, design algorithms to address these problems, and perform experimental evaluations of the proposed algorithms. All these studies are based on microarray time series data set provided by Dr. McDougal

    Random Subspace Learning on Outlier Detection and Classification with Minimum Covariance Determinant Estimator

    Get PDF
    The questions brought by high dimensional data is interesting and challenging. Our study is targeting on the particular type of data in this situation that namely “large p, small n”. Since the dimensionality is massively larger than the number of observations in the data, any measurement of covariance and its inverse will be miserably affected. The definition of high dimension in statistics has been changed throughout decades. Modern datasets with over thousands of dimensions are demanding the ability to gain deeper understanding but hindered by the curse of dimensionality. We decide to review and explore further to negotiate with the curse and extend previous studies to pave a new way for estimating robustness then apply it to outlier detection and classification. We explored the random subspace learning and expand other classification and outlier detection algorithms to adapt its framework. Our proposed methods can handle both high-dimension low-sample size and traditional low-dimensional high-sample size datasets. Essentially, we avoid the computational bottleneck of techniques like Minimum Covariance Determinant (MCD) by computing the needed determinants and associated measures in much lower dimensional subspaces. Both theoretical and computational development of our approach reveal that it is computationally more efficient than the regularized methods in high-dimensional low-sample size, and often competes favorably with existing methods as far as the percentage of correct outlier detection are concerned
    corecore