15,250 research outputs found
External clustering validity index based on chi-squared statistical test
Clustering is one of the most commonly used techniques in data mining. Its main goal is
to group objects into clusters so that each group contains objects that are more similar to
each other than to objects in other clusters. The evaluation of a clustering solution is a task
carried out through the application of validity indices. These indices measure the quality
of the solution and can be classified as either internal that calculate the quality of the
solution through the data of the clusters, or as external indices that measure the quality
by means of external information such as the class. Generally, indices from the literature
determine their optimal result through graphical representation, whose results could be
imprecisely interpreted. The aim of this paper is to present a new external validity index
based on the chi-squared statistical test named Chi Index, which presents accurate results
that require no further interpretation. Chi Index was analyzed using the clustering results
of 3 clustering methods in 47 public datasets. Results indicate a better hit rate and a lower
percentage of error against 15 external validity indices from the literature.Ministerio de EconomĂa y Competitividad TIN2014-55894-C2-RMinisterio de EconomĂa y Competitividad TIN2017-88209-C2-2-
New internal and external validation indices for clustering in Big Data
Esta tesis, presentada como un compendio de artĂculos de investigaciĂłn,
analiza el concepto de Ăndices de validaciĂłn de clustering y aporta nuevas
medidas de bondad para conjuntos de datos que podrĂan considerarse Big
Data debido a su volumen. Además, estas medidas han sido aplicadas en
proyectos reales y se propone su aplicaciĂłn futura para mejorar algoritmos
de clustering.
El clustering es una de las técnicas de aprendizaje automático no supervisado
más usada. Esta técnica nos permite agrupar datos en clusters de
manera que, aquellos datos que pertenezcan al mismo cluster tienen caracterĂsticas
o atributos con valores similares, y a su vez esos datos son disimilares
respecto a aquellos que pertenecen a los otros clusters. La similitud de los
datos viene dada normalmente por la cercanĂa en el espacio, teniendo en
cuenta una funciĂłn de distancia. En la literatura existen los llamados Ăndices
de validaciĂłn de clustering, los cuales podrĂamos definir como medidas para
cuantificar la calidad de un resultado de clustering. Estos Ăndices se dividen
en dos tipos: Ăndices de validaciĂłn internos, que miden la calidad del clustering
en base a los atributos con los que se han construido los clusters; e
Ăndices de validaciĂłn externos, que son aquellos que cuantifican la calidad del
clustering a partir de atributos que no han intervenido en la construcciĂłn de
los clusters, y que normalmente son de tipo nominal o etiquetas.
En esta memoria se proponen dos Ăndices de validaciĂłn internos para clustering
basados en otros Ăndices existentes en la literatura, que nos permiten
trabajar con grandes cantidades de datos, ofreciéndonos los resultados en un
tiempo razonable. Los Ăndices propuestos han sido testeados en datasets sintĂ©ticos
y comparados con otros Ăndices de la literatura. Las conclusiones de
este trabajo indican que estos Ăndices ofrecen resultados muy prometedores
frente a sus competidores.
Por otro lado, se ha diseñado un nuevo Ăndice de validaciĂłn externo de
clustering basado en el test estadĂstico chi cuadrado. Este Ăndice permite
medir la calidad del clustering basando el resultado en cĂłmo han quedado
distribuidos los clusters respecto a una etiqueta dada en la distribuciĂłn. Los
resultados de este Ăndice muestran una mejora significativa frente a otros
Ăndices externos de la literatura y en datasets de diferentes dimensiones y caracterĂsticas.
Además, estos Ăndices propuestos han sido aplicados en tres proyectos
con datos reales cuyas publicaciones están incluidas en esta tesis doctoral.
Para el primer proyecto se ha desarrollado una metodologĂa para analizar el
consumo eléctrico de los edificios de una smart city. Para ello, se ha realizado
un análisis de clustering Ăłptimo aplicando los Ăndices internos mencionados
anteriormente. En el segundo proyecto se ha trabajado tanto los Ăndices internos
como con los externos para realizar un análisis comparativo del mercado
laboral español en dos periodos económicos distintos. Este análisis se realizó
usando datos del Ministerio de Trabajo, Migraciones y Seguridad Social, y
los resultados podrĂan tenerse en cuenta para ayudar a la toma de decisiĂłn
en mejoras de polĂticas de empleo. En el tercer proyecto se ha trabajado con
datos de los clientes de una compañĂa elĂ©ctrica para caracterizar los tipos
de consumidores que existen. En este estudio se han analizado los patrones
de consumo para que las compañĂas elĂ©ctricas puedan ofertar nuevas tarifas
a los consumidores, y Ă©stos puedan adaptarse a estas tarifas con el objetivo
de optimizar la generaciĂłn de energĂa eliminando los picos de consumo que
existen la actualidad.This thesis, presented as a compendium of research articles, analyses
the concept of clustering validation indices and provides new measures of
goodness for datasets that could be considered Big Data. In addition, these
measures have been applied in real projects and their future application is
proposed for the improvement of clustering algorithms.
Clustering is one of the most popular unsupervised machine learning
techniques. This technique allows us to group data into clusters so that the
instances that belong to the same cluster have characteristics or attributes
with similar values, and are dissimilar to those that belong to the other
clusters. The similarity of the data is normally given by the proximity in
space, which is measured using a distance function. In the literature, there
are so-called clustering validation indices, which can be defined as measures
for the quantification of the quality of a clustering result. These indices are
divided into two types: internal validation indices, which measure the quality
of clustering based on the attributes with which the clusters have been built;
and external validation indices, which are those that quantify the quality of
clustering from attributes that have not intervened in the construction of
the clusters, and that are normally of nominal type or labels.
In this doctoral thesis, two internal validation indices are proposed for
clustering based on other indices existing in the literature, which enable
large amounts of data to be handled, and provide the results in a reasonable
time. The proposed indices have been tested with synthetic datasets and
compared with other indices in the literature. The conclusions of this work
indicate that these indices offer very promising results in comparison with
their competitors.
On the other hand, a new external clustering validation index based on
the chi-squared statistical test has been designed. This index enables the
quality of the clustering to be measured by basing the result on how the
clusters have been distributed with respect to a given label in the distribution.
The results of this index show a significant improvement compared to
other external indices in the literature when used with datasets of different
dimensions and characteristics.
In addition, these proposed indices have been applied in three projects with real data whose corresponding publications are included in this doctoral
thesis. For the first project, a methodology has been developed to analyse
the electrical consumption of buildings in a smart city. For this study, an
optimal clustering analysis has been carried out by applying the aforementioned
internal indices. In the second project, both internal and external
indices have been applied in order to perform a comparative analysis of the
Spanish labour market in two different economic periods. This analysis was
carried out using data from the Ministry of Labour, Migration, and Social
Security, and the results could be taken into account to help decision-making
for the improvement of employment policies. In the third project, data from
the customers of an electric company has been employed to characterise the
different types of existing consumers. In this study, consumption patterns
have been analysed so that electricity companies can offer new rates to consumers.
Conclusions show that consumers could adapt their usage to these
rates and hence the generation of energy could be optimised by eliminating
the consumption peaks that currently exist
Chapter 19 Unsupervised Methods
The Handbook of Computational Social Science is a comprehensive reference source for scholars across multiple disciplines. It outlines key debates in the field, showcasing novel statistical modeling and machine learning methods, and draws from specific case studies to demonstrate the opportunities and challenges in CSS approaches. The Handbook is divided into two volumes written by outstanding, internationally renowned scholars in the field. This second volume focuses on foundations and advances in data science, statistical modeling, and machine learning. It covers a range of key issues, including the management of big data in terms of record linkage, streaming, and missing data. Machine learning, agent-based and statistical modeling, as well as data quality in relation to digital trace and textual data, as well as probability, non-probability, and crowdsourced samples represent further foci. The volume not only makes major contributions to the consolidation of this growing research field, but also encourages growth into new directions. With its broad coverage of perspectives (theoretical, methodological, computational), international scope, and interdisciplinary approach, this important resource is integral reading for advanced undergraduates, postgraduates, and researchers engaging with computational methods across the social sciences, as well as those within the scientific and engineering sectors
What determines adult cognitive skills?: Impacts of preschooling, schooling, and post-schooling experiences in Guatemala
"Most investigations into the importance and determinants of adult cognitive skills assume that (1) they are produced primarily by schooling, and (2) schooling is statistically predetermined or exogenous. This study uses longitudinal data collected in Guatemala over 35 years to investigate production functions for adult cognitive skills—that is, reading-comprehension skills and nonverbal cognitive skills—as being dependent on behaviorally determined preschooling, schooling, and post-schooling experiences. We use an indicator of whether the child was stunted (child height-for-age Z-scoreHuman capital, cognitive skills, Stunting, work experience, Development, Education, Gender, Health and nutrition,
Individual Assets, Market Structure And The Drivers Of Return
Much prior research on the structure and performance of UK real estate portfolios has relied on aggregated measures for sector and region. For these groupings to have validity, the performance of individual properties within each group should be similar. This paper analyses a sample of 1,200 properties using multiple discriminant analysis and cluster analysis techniques. It is shown that conventional property type and spatial classifications do not capture the variation in return behaviour at the individual building level. The major feature is heterogeneity - but there may be distinctions between growth and income properties and between single and multi-let properties that could help refine portfolio structures.Portfolio Structure, Return Generation Process, Real Estate
Does financing behavior of Tunisian firms follow the predictions of the market timing theory of capital structure?
In this paper, we show how capital structure decisions made by non-financial firms listed in the Tunis Stock Exchange are affected by the predictions of the so-called market timing theory. Using a set of some relevant variables which reflect the market-timing signals, the firm fundamentals, and the performance of local stock market, we mainly find that leverage ratio of Tunisian firms is short-term driven by their current market valuations. In the long run, the market timing effects are not present at all. Rather, Tunisian firms seem to behave according to the tradeoff theory of capital structure by attempting to adjust their leverage levels towards a target ratio.Market timing theory
Security Bug Report Classification using Feature Selection, Clustering, and Deep Learning
As the numbers of software vulnerabilities and cybersecurity threats increase, it is becoming more difficult and time consuming to classify bug reports manually. This thesis is focused on exploring techniques that have potential to improve the performance of automated classification of software bug reports as security or non-security related. Using supervised learning, feature selection was used to engineer new feature vectors to be used in machine learning. Feature selection changes the vocabulary used by selecting words with the greatest impact on classification. Feature selection was able to increase the F-Score across the datasets by increasing the precision. We also explored unsupervised classification based on clustering. A distribution of software issues was created using variational autoencoders, where the majority of security related issues were closely related. However, a portion of non-security issues also ended up in the distribution. Furthermore, we explored recent advances in text mining classification based on deep learning. Specifically, we used recurrent networks for supervised and semi-supervised classification. LSTM networks outperformed the Naive Bayes classifier in projects with a high ratio of security related issues. Sequence autoencoders were trained on unlabeled data and tuned with labeled data. The results showed that using unlabeled software issues different from the testing datasets degraded the results. Sequence autoencoders may be used on large datasets, where labeled data is scarce
Asteroid lightcurves from the Palomar Transient Factory survey: Rotation periods and phase functions from sparse photometry
We fit 54,296 sparsely-sampled asteroid lightcurves in the Palomar Transient
Factory to a combined rotation plus phase-function model. Each lightcurve
consists of 20+ observations acquired in a single opposition. Using 805
asteroids in our sample that have reference periods in the literature, we find
the reliability of our fitted periods is a complicated function of the period,
amplitude, apparent magnitude and other attributes. Using the 805-asteroid
ground-truth sample, we train an automated classifier to estimate (along with
manual inspection) the validity of the remaining 53,000 fitted periods. By this
method we find 9,033 of our lightcurves (of 8,300 unique asteroids) have
reliable periods. Subsequent consideration of asteroids with multiple
lightcurve fits indicate 4% contamination in these reliable periods. For 3,902
lightcurves with sufficient phase-angle coverage and either a reliably-fit
period or low amplitude, we examine the distribution of several phase-function
parameters, none of which are bimodal though all correlate with the bond albedo
and with visible-band colors. Comparing the theoretical maximal spin rate of a
fluid body with our amplitude versus spin-rate distribution suggests that, if
held together only by self-gravity, most asteroids are in general less dense
than 2 g/cm, while C types have a lower limit of between 1 and 2 g/cm,
in agreement with previous density estimates. For 5-20km diameters, S types
rotate faster and have lower amplitudes than C types. If both populations share
the same angular momentum, this may indicate the two types' differing ability
to deform under rotational stress. Lastly, we compare our absolute magnitudes
and apparent-magnitude residuals to those of the Minor Planet Center's nominal
, rotation-neglecting model; our phase-function plus Fourier-series
fitting reduces asteroid photometric RMS scatter by a factor of 3.Comment: 35 pages, 29 figures. Accepted 15-Apr-2015 to The Astronomical
Journal (AJ). Supplementary material including ASCII data tables will be
available through the publishing journal's websit
- …