8 research outputs found

    Mapping microarray gene expression data into dissimilarity spaces for tumor classification

    Get PDF
    Microarray gene expression data sets usually contain a large number of genes, but a small number of samples. In this article, we present a two-stage classification model by combining feature selection with the dissimilarity-based representation paradigm. In the preprocessing stage, the ReliefF algorithm is used to generate a subset with a number of topranked genes; in the learning/classification stage, the samples represented by the previously selected genes are mapped into a dissimilarity space, which is then used to construct a classifier capable of separating the classes more easily than a feature-based model. The ultimate aim of this paper is not to find the best subset of genes, but to analyze the performance of the dissimilarity-based models by means of a comprehensive collection of experiments for the classification of microarray gene expression data. To this end, we compare the classification results of an artificial neural network, a support vector machine and the Fisher’s linear discriminant classifier built on the feature (gene) space with those on the dissimilarity space when varying the number of genes selected by ReliefF, using eight different microarray databases. The results show that the dissimilarity-based classifiers systematically outperform the feature-based models. In addition, classification through the proposed representation appears to be more robust (i.e. less sensitive to the number of genes) than that with the conventional feature-based representation

    Gene selection and classification in autism gene expression data

    Get PDF
    Autism spectrum disorders (ASD) are neurodevelopmental disorders that are currently diagnosed on the basis of abnormal stereotyped behaviour as well as observable deficits in communication and social functioning. Although a variety of candidate genes have been attributed to the disorder, no single gene is applicable to more than 1–2% of the general ASD population. Despite extensive efforts, definitive genes that contribute to autism susceptibility have yet to be identified. The major problems in dealing with the gene expression dataset of autism include the presence of limited number of samples and large noises due to errors of experimental measurements and natural variation. In this study, a systematic combination of three important filters, namely t-test (TT), Wilcoxon Rank Sum (WRS) and Feature Correlation (COR) are applied along with efficient wrapper algorithm based on geometric binary particle swarm optimization-support vector machine (GBPSO-SVM), aiming at selecting and classifying the most attributed genes of autism. A new approach based on the criterion of median ratio, mean ratio and variance deviations is also applied to reduce the initial dataset prior to its involvement. Results showed that the most discriminative genes that were identified in the first and last selection steps concluded the presence of a repetitive gene (CAPS2), which was assigned as the most ASD risk gene. The fused result of genes subset that were selected by the GBPSO-SVM algorithm increased the classification accuracy to about 92.10%, which is higher than those reported in literature for the same autism dataset. Noticeably, the application of ensemble using random forest (RF) showed better performance compared to that of previous studies. However, the ensemble approach based on the employment of SVM as an integrator of the fused genes from the output branches of GBPSO-SVM outperformed the RF integrator. The overall improvement was ascribed to the selection strategies that were taken to reduce the dataset and the utilization of efficient wrapper based GBPSO-SVM algorithm

    Mitigating the effect of covariates in face recognition

    Get PDF
    Current face recognition systems capture faces of cooperative individuals in controlled environment as part of the face recognition process. It is therefore possible to control lighting, pose, background, and quality of images. However, in a real world application, we have to deal with both ideal and imperfect data. Performance of current face recognition systems is affected for such non-ideal and challenging cases. This research focuses on designing algorithms to mitigate the effect of covariates in face recognition.;To address the challenge of facial aging, an age transformation algorithm is proposed that registers two face images and minimizes the aging variations. Unlike the conventional method, the gallery face image is transformed with respect to the probe face image and facial features are extracted from the registered gallery and probe face images. The variations due to disguises cause change in visual perception, alter actual data, make pertinent facial information disappear, mask features to varying degrees, or introduce extraneous artifacts in the face image. To recognize face images with variations due to age progression and disguises, a granular face verification approach is designed which uses dynamic feed-forward neural architecture to extract 2D log polar Gabor phase features at different granularity levels. The granular levels provide non-disjoint spatial information which is combined using the proposed likelihood ratio based Support Vector Machine match score fusion algorithm. The face verification algorithm is validated using five face databases including the Notre Dame face database, FG-Net face database and three disguise face databases.;The information in visible spectrum images is compromised due to improper illumination whereas infrared images provide invariance to illumination and expression. A multispectral face image fusion algorithm is proposed to address the variations in illumination. The Support Vector Machine based image fusion algorithm learns the properties of the multispectral face images at different resolution and granularity levels to determine optimal information and combines them to generate a fused image. Experiments on the Equinox and Notre Dame multispectral face databases show that the proposed algorithm outperforms existing algorithms. We next propose a face mosaicing algorithm to address the challenge due to pose variations. The mosaicing algorithm generates a composite face image during enrollment using the evidence provided by frontal and semiprofile face images of an individual. Face mosaicing obviates the need to store multiple face templates representing multiple poses of a users face image. Experiments conducted on three different databases indicate that face mosaicing offers significant benefits by accounting for the pose variations that are commonly observed in face images.;Finally, the concept of online learning is introduced to address the problem of classifier re-training and update. A learning scheme for Support Vector Machine is designed to train the classifier in online mode. This enables the classifier to update the decision hyperplane in order to account for the newly enrolled subjects. On a heterogeneous near infrared face database, the case study using Principal Component Analysis and C2 feature algorithms shows that the proposed online classifier significantly improves the verification performance both in terms of accuracy and computational time

    Granular Support Vector Machines Based on Granular Computing, Soft Computing and Statistical Learning

    Get PDF
    With emergence of biomedical informatics, Web intelligence, and E-business, new challenges are coming for knowledge discovery and data mining modeling problems. In this dissertation work, a framework named Granular Support Vector Machines (GSVM) is proposed to systematically and formally combine statistical learning theory, granular computing theory and soft computing theory to address challenging predictive data modeling problems effectively and/or efficiently, with specific focus on binary classification problems. In general, GSVM works in 3 steps. Step 1 is granulation to build a sequence of information granules from the original dataset or from the original feature space. Step 2 is modeling Support Vector Machines (SVM) in some of these information granules when necessary. Finally, step 3 is aggregation to consolidate information in these granules at suitable abstract level. A good granulation method to find suitable granules is crucial for modeling a good GSVM. Under this framework, many different granulation algorithms including the GSVM-CMW (cumulative margin width) algorithm, the GSVM-AR (association rule mining) algorithm, a family of GSVM-RFE (recursive feature elimination) algorithms, the GSVM-DC (data cleaning) algorithm and the GSVM-RU (repetitive undersampling) algorithm are designed for binary classification problems with different characteristics. The empirical studies in biomedical domain and many other application domains demonstrate that the framework is promising. As a preliminary step, this dissertation work will be extended in the future to build a Granular Computing based Predictive Data Modeling framework (GrC-PDM) with which we can create hybrid adaptive intelligent data mining systems for high quality prediction

    Clustering System and Clustering Support Vector Machine for Local Protein Structure Prediction

    Get PDF
    Protein tertiary structure plays a very important role in determining its possible functional sites and chemical interactions with other related proteins. Experimental methods to determine protein structure are time consuming and expensive. As a result, the gap between protein sequence and its structure has widened substantially due to the high throughput sequencing techniques. Problems of experimental methods motivate us to develop the computational algorithms for protein structure prediction. In this work, the clustering system is used to predict local protein structure. At first, recurring sequence clusters are explored with an improved K-means clustering algorithm. Carefully constructed sequence clusters are used to predict local protein structure. After obtaining the sequence clusters and motifs, we study how sequence variation for sequence clusters may influence its structural similarity. Analysis of the relationship between sequence variation and structural similarity for sequence clusters shows that sequence clusters with tight sequence variation have high structural similarity and sequence clusters with wide sequence variation have poor structural similarity. Based on above knowledge, the established clustering system is used to predict the tertiary structure for local sequence segments. Test results indicate that highest quality clusters can give highly reliable prediction results and high quality clusters can give reliable prediction results. In order to improve the performance of the clustering system for local protein structure prediction, a novel computational model called Clustering Support Vector Machines (CSVMs) is proposed. In our previous work, the sequence-to-structure relationship with the K-means algorithm has been explored by the conventional K-means algorithm. The K-means clustering algorithm may not capture nonlinear sequence-to-structure relationship effectively. As a result, we consider using Support Vector Machine (SVM) to capture the nonlinear sequence-to-structure relationship. However, SVM is not favorable for huge datasets including millions of samples. Therefore, we propose a novel computational model called CSVMs. Taking advantage of both the theory of granular computing and advanced statistical learning methodology, CSVMs are built specifically for each information granule partitioned intelligently by the clustering algorithm. Compared with the clustering system introduced previously, our experimental results show that accuracy for local structure prediction has been improved noticeably when CSVMs are applied

    Evolución dirigida de penicilina V acilasa de "Streptomyces lavendulae" y aculeacina A acilasa de "Actinoplanes utahensis"

    Get PDF
    Tesis inédita de la Universidad Complutense de Madrid, Facultad de Farmacia, Departamento de Microbiología II, leída el 14-07-2016Actualmente la secuenciación y el consecuente depósito en bases de datos públicas de genomas bacterianos han incrementado de manera exponencial y constituye una herramienta inevitable en la investigación básica y aplicada. Sin embargo, la elucidación de la información encriptada en sus secuencias codificantes aunado a las particularidades de cada microorganismo, constituyen las barreras a ser superadas por parte de los investigadores, para lo cual estudios bioinformáticos integrados con evidencias experimentales son ineludibles de abordar en el laboratorio. En particular, es menester reconocer la versatilidad que ostentan las bacterias Gram-positivas y sus implicaciones que trascienden los entornos naturales y se inmiscuyen cada vez más en procesos biotecnológicos. Por tal motivo, en el presente estudio se secuenciaron los genomas de las cepas bacterianas Streptomyces lavendulae ATCC 13664 y Actinoplanes utahensis NRRL 12052, gracias a lo cual se logró determinar múltiples características de dichos microorganismos. En este sentido, el presente estudio logró determinar con base en la secuencia del 16S rRNA, al igual que con fundamento en una comparación de todo el genoma frente a una base de datos local de genomas, que la cepa de S. lavendulae se encuentra mal asignada y por lo tanto debería ser reasignada como una nueva especie, dado que fue detectada filogenéticamente cerca a otras especies de S. lavendulae, y en contraste dicha cepa se localiza aún más cerca de otras especies de S. griseus. Igualmente, dentro del genoma de A. utahensis resalta la detección de una acil-homoserin lactona acilasa (AuAHLA) putativa, la cual es documentada por primera vez en este estudio. Los análisis bioinformáticos desarrollados destacaron que dicha enzima presenta características similares a la aculeacin A acilasa (AuAAC) de A. utahensis y a la penicilina V acilasa (SlPVA) de S. lavendulae. Igualmente, cabe mencionar que no fue detectada la equinocandina B (ECB) deacilasa transmembrana dentro del genoma de A. utahensis, la cual se había descrito previamente por otros autores y que solo difiere ligeramente en su secuencia con respecto a AuAAC (aunque no se ha depositado la secuencia completa de la ECB deacilasa, si se ha informado sobre fragmentos del amino-terminal de cada subunidad), lo cual permite proponer que la ECB deacilasa debe ser reasignada. Asimismo, es de resaltar que en los dos microorganismos secuenciados fueron detectados clúster relacionados con la biosíntesis de NRPS (de su sigla en inglés non-ribosomal peptide-synthase) y PKS (de su sigla en inglés polyketide synthase). Específicamente, tanto AuAAC como AuAHLA fueron localizadas dentro de clústeres relacionados con la biosíntesis de sideróforos (i.e. gobichelina y laspartomicina, respectivamente según la predicción realizada), moléculas que son empleadas por las bacterias como compuestos quelantes del hierro, y que los seres humanos aprovechan gracias a su actividad biológica. En contraste, a pesar de que la plataforma empleada no predijo ningún clúster que contenga SlPVA, estudios adicionales permitieron que el presente estudio no descarte que SlPVA esté implicada en la biosíntesis de algún sideróforo, tal y como fue el caso de las acilasas de A. utahensis...Nowadays sequencing and consequent deposit in public data bases of bacterial genomes have been increased exponentially and constitutes an inevitable tool in basic and applied research. However, the elucidation of the encrypted information along their coding sequences and the particularities of each microorganism are barriers to be overcome by the researcher. Thus, bioinformatic studies integrated with experimental evidences are inescapably addressed in the laboratory. In particular, it is important to mention the versatility that holds the Gram-positive bacteria and its implications that transcends natural environments and interferes time after time in biotechnological processes. For this reason, in this study the genomes of the bacterial strains Streptomyces lavendulae ATCC 13664 and Actinoplanes utahensis NRRL 12052 were sequenced, and thanks to this information, it was possible to determine several features from those microorganisms. In this sense, the analysis of the 16S rRNA sequence as well as the comparison of the whole genome against a local database of genomes suggests that the strain of S. lavendulae is misassigned and should be assigned as a new specie, because despite the fact that it was detected phylogenetically close to other strains of S. lavendulae, it was located closer to other S. griseus species. Likewise, within the genome of A. utahensis highlights the presence of acyl-homoserine lactone acylase (AuAHLA), which is reported here for the first time. The bioinformatic analyses developed emphasizes that this enzyme had similar characteristics with respect to aculeacin A acylase (AuAAC) from A. utahensis and penicillin V acylase (SlPVA) from S. lavendulae. Surprisingly, it is noteworthy to mention that the transmembrane echinocandin B (ECB) deacylase was not detected within the genome of A. utahensis. Information about ECB deacylase reported by other authors and its sequence differs slightly with respect to AuAAC. Although the sequence of ECB deacylase has not been deposited, the authors reported the amino-terminus of each subunit. Thus, the present study suggests that this ECB deacylase should be reassigned. Likewise, it is important to mention that in both genomes clusters related with the biosynthesis of NRPS (non-ribosomal peptide-synthase) and PKS (polyketide synthase) were detected. Specifically, both AuAAC as AuAHLA were located within a cluster associated with the biosynthesis of siderophores (i.e. predicted gobichelin and laspartomycin, respectively). These molecules are employed by the bacteria as iron chelating compounds, and humans use their biological activity. In contrast, although the platform employed did not predict any cluster containing SlPVA, further studies might indicate that SlPVA could be implicated in the biosynthesis of some siderophore, similarly to that exposed with the acylases from A. utahensis...Depto. de Microbiología y ParasitologíaFac. de FarmaciaTRUEunpu
    corecore