19 research outputs found

    TriGen: A genetic algorithm to mine triclusters in temporal gene expression data

    Get PDF
    Analyzing microarray data represents a computational challenge due to the characteristics of these data. Clustering techniques are widely applied to create groups of genes that exhibit a similar behavior under the conditions tested. Biclustering emerges as an improvement of classical clustering since it relaxes the constraints for grouping genes to be evaluated only under a subset of the conditions and not under all of them. However, this technique is not appropriate for the analysis of longitudinal experiments in which the genes are evaluated under certain conditions at several time points. We present the TriGen algorithm, a genetic algorithm that finds triclusters of gene expression that take into account the experimental conditions and the time points simultaneously. We have used TriGen to mine datasets related to synthetic data, yeast (Saccharomyces cerevisiae) cell cycle and human inflammation and host response to injury experiments. TriGen has proved to be capable of extracting groups of genes with similar patterns in subsets of conditions and times, and these groups have shown to be related in terms of their functional annotations extracted from the Gene Ontology.Ministerio de Ciencia y Tecnología TIN2011-28956-C00Ministerio de Ciencia y Tecnología TIN2009-13950Junta de Andalucía TIC-752

    Discovering three-dimensional patterns in real-time from data streams: An online triclustering approach

    Get PDF
    Triclustering algorithms group sets of coordinates of 3-dimensional datasets. In this paper, a new triclustering approach for data streams is introduced. It follows a streaming scheme of learning in two steps: offline and online phases. First, the offline phase provides a sum mary model with the components of the triclusters. Then, the second stage is the online phase to deal with data in streaming. This online phase consists in using the summary model obtained in the offline stage to update the triclusters as fast as possible with genetic operators. Results using three types of synthetic datasets and a real-world environmental sensor dataset are reported. The performance of the proposed triclustering streaming algo rithm is compared to a batch triclustering algorithm, showing an accurate performance both in terms of quality and running timesMinisterio de Ciencia, Innovación y Universidades TIN2017-88209-C

    MSL: A Measure to Evaluate Three-dimensional Patterns in Gene Expression Data

    Get PDF
    Microarray technology is highly used in biological research environments due to its ability to monitor the RNA concentration levels. The analysis of the data generated represents a computational challenge due to the characteristics of these data. Clustering techniques are widely applied to create groups of genes that exhibit a similar behavior. Biclustering relaxes the constraints for grouping, allowing genes to be evaluated only under a subset of the conditions. Triclustering appears for the analysis of longitudinal experiments in which the genes are evaluated under certain conditions at several time points. These triclusters provide hidden information in the form of behavior patterns from temporal experiments with microarrays relating subsets of genes, experimental conditions, and time points. We present an evaluation measure for triclusters called Multi Slope Measure, based on the similarity among the angles of the slopes formed by each profile formed by the genes, conditions, and times of the triclusterMinisterio de Ciencia y Tecnología TIN2011-28956-C02-02Junta de Andalucía TIC-752

    A Novel Method for Seismogenic Zoning Based on Triclustering: Application to the Iberian Peninsula

    Get PDF
    A previous definition of seismogenic zones is required to do a probabilistic seismic hazard analysis for areas of spread and low seismic activity. Traditional zoning methods are based on the available seismic catalog and the geological structures. It is admitted that thermal and resistant parameters of the crust provide better criteria for zoning. Nonetheless, the working out of the rheological profiles causes a great uncertainty. This has generated inconsistencies, as different zones have been proposed for the same area. A new method for seismogenic zoning by means of triclustering is proposed in this research. The main advantage is that it is solely based on seismic data. Almost no human decision is made, and therefore, the method is nearly non-biased. To assess its performance, the method has been applied to the Iberian Peninsula, which is characterized by the occurrence of small to moderate magnitude earthquakes. The catalog of the National Geographic Institute of Spain has been used. The output map is checked for validity with the geology. Moreover, a geographic information system has been used for two purposes. First, the obtained zones have been depicted within it. Second, the data have been used to calculate the seismic parameters (b-value, annual rate). Finally, the results have been compared to Kohonen’s self-organizing maps.Ministerio de Economía y Competitividad TIN2014-55894-C2-RJunta de Andalucía P12-TIC-1728Universidad Pablo de Olavide APPB81309

    G-Tric: enhancing triclustering evaluation using three-way synthetic datasets with ground truth

    Get PDF
    Tese de mestrado, Ciência de Dados, Universidade de Lisboa, Faculdade de Ciências, 2020Three-dimensional datasets, or three-way data, started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations _ features _ contexts). With an increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount.These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real three-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. G-Tric can replicate real-world datasets and create new ones that match researchers’ needs across several properties, including data type (numeric or symbolic), dimension, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled by defining the number of missing values, noise, and errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches. Besides reviewing the current state-of-the-art regarding triclustering approaches, comparison studies and evaluation metrics, this work also analyzes how the lack of frameworks to generate synthetic data influences existent evaluation methodologies, limiting the scope of performance insights that can be extracted from each algorithm. As well as exemplifying how the set of decisions made on these evaluations can impact the quality and validity of those results. Alternatively, a different methodology that takes advantage of synthetic data with ground truth is presented. This approach, combined with the proposal of an extension to an existing clustering extrinsic measure, enables to assess solutions’ quality under new perspectives

    Técnicas big data para el procesamiento de flujos de datos masivos en tiempo real

    Get PDF
    Programa de Doctorado en Biotecnología, Ingeniería y Tecnología QuímicaLínea de Investigación: Ingeniería, Ciencia de Datos y BioinformáticaClave Programa: DBICódigo Línea: 111Machine learning techniques have become one of the most demanded resources by companies due to the large volume of data that surrounds us in these days. The main objective of these technologies is to solve complex problems in an automated way using data. One of the current perspectives of machine learning is the analysis of continuous flows of data or data streaming. This approach is increasingly requested by enterprises as a result of the large number of information sources producing time-indexed data at high frequency, such as sensors, Internet of Things devices, social networks, etc. However, nowadays, research is more focused on the study of historical data than on data received in streaming. One of the main reasons for this is the enormous challenge that this type of data presents for the modeling of machine learning algorithms. This Doctoral Thesis is presented in the form of a compendium of publications with a total of 10 scientific contributions in International Conferences and journals with high impact index in the Journal Citation Reports (JCR). The research developed during the PhD Program focuses on the study and analysis of real-time or streaming data through the development of new machine learning algorithms. Machine learning algorithms for real-time data consist of a different type of modeling than the traditional one, where the model is updated online to provide accurate responses in the shortest possible time. The main objective of this Doctoral Thesis is the contribution of research value to the scientific community through three new machine learning algorithms. These algorithms are big data techniques and two of them work with online or streaming data. In this way, contributions are made to the development of one of the current trends in Artificial Intelligence. With this purpose, algorithms are developed for descriptive and predictive tasks, i.e., unsupervised and supervised learning, respectively. Their common idea is the discovery of patterns in the data. The first technique developed during the dissertation is a triclustering algorithm to produce three-dimensional data clusters in offline or batch mode. This big data algorithm is called bigTriGen. In a general way, an evolutionary metaheuristic is used to search for groups of data with similar patterns. The model uses genetic operators such as selection, crossover, mutation or evaluation operators at each iteration. The goal of the bigTriGen is to optimize the evaluation function to achieve triclusters of the highest possible quality. It is used as the basis for the second technique implemented during the Doctoral Thesis. The second algorithm focuses on the creation of groups over three-dimensional data received in real-time or in streaming. It is called STriGen. Streaming modeling is carried out starting from an offline or batch model using historical data. As soon as this model is created, it starts receiving data in real-time. The model is updated in an online or streaming manner to adapt to new streaming patterns. In this way, the STriGen is able to detect concept drifts and incorporate them into the model as quickly as possible, thus producing triclusters in real-time and of good quality. The last algorithm developed in this dissertation follows a supervised learning approach for time series forecasting in real-time. It is called StreamWNN. A model is created with historical data based on the k-nearest neighbor or KNN algorithm. Once the model is created, data starts to be received in real-time. The algorithm provides real-time predictions of future data, keeping the model always updated in an incremental way and incorporating streaming patterns identified as novelties. The StreamWNN also identifies anomalous data in real-time allowing this feature to be used as a security measure during its application. The developed algorithms have been evaluated with real data from devices and sensors. These new techniques have demonstrated to be very useful, providing meaningful triclusters and accurate predictions in real time.Universidad Pablo de Olavide de Sevilla. Departamento de Deporte e informátic

    Clustering Algorithms: Their Application to Gene Expression Data

    Get PDF
    Gene expression data hide vital information required to understand the biological process that takes place in a particular organism in relation to its environment. Deciphering the hidden patterns in gene expression data proffers a prodigious preference to strengthen the understanding of functional genomics. The complexity of biological networks and the volume of genes present increase the challenges of comprehending and interpretation of the resulting mass of data, which consists of millions of measurements; these data also inhibit vagueness, imprecision, and noise. Therefore, the use of clustering techniques is a first step toward addressing these challenges, which is essential in the data mining process to reveal natural structures and iden-tify interesting patterns in the underlying data. The clustering of gene expression data has been proven to be useful in making known the natural structure inherent in gene expression data, understanding gene functions, cellular processes, and subtypes of cells, mining useful information from noisy data, and understanding gene regulation. The other benefit of clustering gene expression data is the identification of homology, which is very important in vaccine design. This review examines the various clustering algorithms applicable to the gene expression data in order to discover and provide useful knowledge of the appropriate clustering technique that will guarantee stability and high degree of accuracy in its analysis procedure
    corecore