6 research outputs found
An approach to validity indices for clustering techniques in Big Data
Clustering analysis is one of the most used
Machine Learning techniques to discover groups among data
objects. Some clustering methods require the number of clus ters into which the data is going to be partitioned. There exist
several cluster validity indices that help us to approximate
the optimal number of clusters of the dataset. However, such
indices are not suitable to deal with Big Data due to its size
limitation and runtime costs. This paper presents two cluster ing validity indices that handle large amount of data in low
computational time. Our indices are based on redefinitions
of traditional indices by simplifying the intra-cluster distance
calculation. Two types of tests have been carried out over 28
synthetic datasets to analyze the performance of the proposed
indices. First, we test the indices with small and medium size
datasets to verify that our indices have a similar effectiveness
to the traditional ones. Subsequently, tests on datasets of up
to 11 million records and 20 features have been executed to
check their efficiency. The results show that both indices can
handle Big Data in a very low computational time with an
effectiveness similar to the traditional indices using Apache
Spark framework.Ministerio de Economía y Competitividad TIN2014-55894-C2-1-
How crickets become freeze tolerant: the transcriptomic underpinnings of acclimation in Gryllus veletis
Some ectotherms can survive internal ice formation. In temperate regions, freeze tolerance is often induced by decreasing temperature and/or photoperiod during autumn. However, we have limited understanding of how seasonal changes in physiology contribute to freeze tolerance, and how these changes are regulated. During a six week autumn-like acclimation, late-instar juveniles of the spring field cricket Gryllus veletis (Orthoptera: Gryllidae) become freeze tolerant, which is correlated with accumulation of low molecular weight cryoprotectants, elevation of the temperature at which freezing begins, and metabolic rate suppression. We used RNA-Seq to assemble a de novo transcriptome of this emerging laboratory model for freeze tolerance research. We then focused on gene expression during acclimation in fat body tissue due to its role in cryoprotectant production and regulation of energetics. Acclimated G. veletis differentially expressed more than 3,000 transcripts in fat body. This differential expression may contribute to metabolic suppression in acclimated G. veletis, but we did not detect changes in expression that would support cryoprotectant accumulation or enhanced control of ice formation, suggesting that these latter processes are regulated post-transcriptionally. Acclimated G. veletis differentially regulated transcripts that likely coordinate additional freeze tolerance mechanisms, including upregulation of enzymes that may promote membrane and cytoskeletal remodelling, cryoprotectant transporters, cytoprotective proteins, and antioxidants. Thus, while accumulation of cryoprotectants and controlling ice formation are commonly associated with insect freeze tolerance, our results support the hypothesis that many other systems contribute to surviving internal ice formation. Together, this information suggests new avenues for understanding the mechanisms underlying insect freeze tolerance
New internal and external validation indices for clustering in Big Data
Esta tesis, presentada como un compendio de artículos de investigación,
analiza el concepto de índices de validación de clustering y aporta nuevas
medidas de bondad para conjuntos de datos que podrían considerarse Big
Data debido a su volumen. Además, estas medidas han sido aplicadas en
proyectos reales y se propone su aplicación futura para mejorar algoritmos
de clustering.
El clustering es una de las técnicas de aprendizaje automático no supervisado
más usada. Esta técnica nos permite agrupar datos en clusters de
manera que, aquellos datos que pertenezcan al mismo cluster tienen características
o atributos con valores similares, y a su vez esos datos son disimilares
respecto a aquellos que pertenecen a los otros clusters. La similitud de los
datos viene dada normalmente por la cercanía en el espacio, teniendo en
cuenta una función de distancia. En la literatura existen los llamados índices
de validación de clustering, los cuales podríamos definir como medidas para
cuantificar la calidad de un resultado de clustering. Estos índices se dividen
en dos tipos: índices de validación internos, que miden la calidad del clustering
en base a los atributos con los que se han construido los clusters; e
índices de validación externos, que son aquellos que cuantifican la calidad del
clustering a partir de atributos que no han intervenido en la construcción de
los clusters, y que normalmente son de tipo nominal o etiquetas.
En esta memoria se proponen dos índices de validación internos para clustering
basados en otros índices existentes en la literatura, que nos permiten
trabajar con grandes cantidades de datos, ofreciéndonos los resultados en un
tiempo razonable. Los índices propuestos han sido testeados en datasets sintéticos
y comparados con otros índices de la literatura. Las conclusiones de
este trabajo indican que estos índices ofrecen resultados muy prometedores
frente a sus competidores.
Por otro lado, se ha diseñado un nuevo índice de validación externo de
clustering basado en el test estadístico chi cuadrado. Este índice permite
medir la calidad del clustering basando el resultado en cómo han quedado
distribuidos los clusters respecto a una etiqueta dada en la distribución. Los
resultados de este índice muestran una mejora significativa frente a otros
índices externos de la literatura y en datasets de diferentes dimensiones y características.
Además, estos índices propuestos han sido aplicados en tres proyectos
con datos reales cuyas publicaciones están incluidas en esta tesis doctoral.
Para el primer proyecto se ha desarrollado una metodología para analizar el
consumo eléctrico de los edificios de una smart city. Para ello, se ha realizado
un análisis de clustering óptimo aplicando los índices internos mencionados
anteriormente. En el segundo proyecto se ha trabajado tanto los índices internos
como con los externos para realizar un análisis comparativo del mercado
laboral español en dos periodos económicos distintos. Este análisis se realizó
usando datos del Ministerio de Trabajo, Migraciones y Seguridad Social, y
los resultados podrían tenerse en cuenta para ayudar a la toma de decisión
en mejoras de políticas de empleo. En el tercer proyecto se ha trabajado con
datos de los clientes de una compañía eléctrica para caracterizar los tipos
de consumidores que existen. En este estudio se han analizado los patrones
de consumo para que las compañías eléctricas puedan ofertar nuevas tarifas
a los consumidores, y éstos puedan adaptarse a estas tarifas con el objetivo
de optimizar la generación de energía eliminando los picos de consumo que
existen la actualidad.This thesis, presented as a compendium of research articles, analyses
the concept of clustering validation indices and provides new measures of
goodness for datasets that could be considered Big Data. In addition, these
measures have been applied in real projects and their future application is
proposed for the improvement of clustering algorithms.
Clustering is one of the most popular unsupervised machine learning
techniques. This technique allows us to group data into clusters so that the
instances that belong to the same cluster have characteristics or attributes
with similar values, and are dissimilar to those that belong to the other
clusters. The similarity of the data is normally given by the proximity in
space, which is measured using a distance function. In the literature, there
are so-called clustering validation indices, which can be defined as measures
for the quantification of the quality of a clustering result. These indices are
divided into two types: internal validation indices, which measure the quality
of clustering based on the attributes with which the clusters have been built;
and external validation indices, which are those that quantify the quality of
clustering from attributes that have not intervened in the construction of
the clusters, and that are normally of nominal type or labels.
In this doctoral thesis, two internal validation indices are proposed for
clustering based on other indices existing in the literature, which enable
large amounts of data to be handled, and provide the results in a reasonable
time. The proposed indices have been tested with synthetic datasets and
compared with other indices in the literature. The conclusions of this work
indicate that these indices offer very promising results in comparison with
their competitors.
On the other hand, a new external clustering validation index based on
the chi-squared statistical test has been designed. This index enables the
quality of the clustering to be measured by basing the result on how the
clusters have been distributed with respect to a given label in the distribution.
The results of this index show a significant improvement compared to
other external indices in the literature when used with datasets of different
dimensions and characteristics.
In addition, these proposed indices have been applied in three projects with real data whose corresponding publications are included in this doctoral
thesis. For the first project, a methodology has been developed to analyse
the electrical consumption of buildings in a smart city. For this study, an
optimal clustering analysis has been carried out by applying the aforementioned
internal indices. In the second project, both internal and external
indices have been applied in order to perform a comparative analysis of the
Spanish labour market in two different economic periods. This analysis was
carried out using data from the Ministry of Labour, Migration, and Social
Security, and the results could be taken into account to help decision-making
for the improvement of employment policies. In the third project, data from
the customers of an electric company has been employed to characterise the
different types of existing consumers. In this study, consumption patterns
have been analysed so that electricity companies can offer new rates to consumers.
Conclusions show that consumers could adapt their usage to these
rates and hence the generation of energy could be optimised by eliminating
the consumption peaks that currently exist
Mechanisms Underlying Freeze Tolerance in the Spring Field Cricket, \u3cem\u3eGryllus veletis\u3c/em\u3e
Freeze tolerance has evolved repeatedly across insects, facilitating survival in low temperature environments. Internal ice formation poses several challenges, but the mechanisms that mitigate these challenges in freeze-tolerant insects are not well understood. To better understand how insects survive freezing, I describe a novel laboratory model, the spring field cricket Gryllus veletis (Orthoptera: Gryllidae). Following acclimation to six weeks of decreasing temperature and photoperiod (mimicking autumn), G. veletis juveniles becomes moderately freeze-tolerant, surviving freezing at -8 °C for up to one week, and surviving temperatures as low as -12 °C. Acclimation is associated with increased control of the temperature and location of ice formation, accumulation of cryoprotectant molecules (myo-inositol, proline, and trehalose) in hemolymph and fat body tissue, metabolic rate suppression, and differential expression of more than 3,000 genes in fat body tissue. To test cryoprotectant function, I increase their concentration in G. veletis hemolymph (via injection) and freeze isolated fat body tissue with exogenous cryoprotectants. I show that cryoprotectants improve survival of freeze-tolerant G. veletis (proline), their fat body cells (myo-inositol), or both (trehalose) under otherwise lethal conditions, suggesting limited functional overlap of these cryoprotectants. However, no cryoprotectant (alone or in combination) can confer freeze tolerance on freeze-intolerant G. veletis or their cells. During acclimation, G. veletis upregulates genes encoding cryoprotectant transmembrane transporters, antioxidants, and molecular chaperones, which may protect cells during freezing and thawing. In addition, acclimated G. veletis upregulates genes encoding lipid metabolism enzymes, and cytoskeletal proteins and their regulators, which I hypothesize promote membrane and cytoskeletal remodelling. To investigate the function of these genes in freeze tolerance, I develop a method to knock down gene expression in G. veletis using RNA interference. I knock down expression of three genes (encoding a cryoprotectant transporter, an antioxidant, and a cytoskeletal regulator), laying the ground work for others to test whether and how these genes contribute to mechanisms underlying freeze tolerance. By using a combination of descriptive and manipulative experiments in an appropriate laboratory model, I improve our understanding of the factors that contribute to insect freeze tolerance