40,203 research outputs found

    Exploratory Cluster Analysis from Ubiquitous Data Streams using Self-Organizing Maps

    Get PDF
    This thesis addresses the use of Self-Organizing Maps (SOM) for exploratory cluster analysis over ubiquitous data streams, where two complementary problems arise: first, to generate (local) SOM models over potentially unbounded multi-dimensional non-stationary data streams; second, to extrapolate these capabilities to ubiquitous environments. Towards this problematic, original contributions are made in terms of algorithms and methodologies. Two different methods are proposed regarding the first problem. By focusing on visual knowledge discovery, these methods fill an existing gap in the panorama of current methods for cluster analysis over data streams. Moreover, the original SOM capabilities in performing both clustering of observations and features are transposed to data streams, characterizing these contributions as versatile compared to existing methods, which target an individual clustering problem. Also, additional methodologies that tackle the ubiquitous aspect of data streams are proposed in respect to the second problem, allowing distributed and collaborative learning strategies. Experimental evaluations attest the effectiveness of the proposed methods and realworld applications are exemplified, namely regarding electric consumption data, air quality monitoring networks and financial data, motivating their practical use. This research study is the first to clearly address the use of the SOM towards ubiquitous data streams and opens several other research opportunities in the future

    Strategies and algorithms for clustering large datasets: a review

    Get PDF
    The exploratory nature of data analysis and data mining makes clustering one of the most usual tasks in these kind of projects. More frequently these projects come from many different application areas like biology, text analysis, signal analysis, etc that involve larger and larger datasets in the number of examples and the number of attributes. Classical methods for clustering data like K-means or hierarchical clustering are beginning to reach its maximum capability to cope with this increase of dataset size. The limitation for these algorithms come either from the need of storing all the data in memory or because of their computational time complexity. These problems have opened an area for the search of algorithms able to reduce this data overload. Some solutions come from the side of data preprocessing by transforming the data to a lower dimensionality manifold that represents the structure of the data or by summarizing the dataset by obtaining a smaller subset of examples that represent an equivalent information. A different perspective is to modify the classical clustering algorithms or to derive other ones able to cluster larger datasets. This perspective relies on many different strategies. Techniques such as sampling, on-line processing, summarization, data distribution and efficient datastructures have being applied to the problem of scaling clustering algorithms. This paper presents a review of different strategies and clustering algorithms that apply these techniques. The aim is to cover the different range of methodologies applied for clustering data and how they can be scaled.Preprin

    Data clustering procedures: a general review

    Get PDF
    In the age of data science, the clustering of various types of objects (e.g., documents, genes, customers) has become a key activity and many high-quality computer implementations are provided for this purpose by many general software packages. Clustering consists of grouping a set of objects in such a way that objects which are similar to one another according to some metric belong to the same group, named a cluster. It is one of the most valuable and used tasks of exploratory data mining and can be applied to a wide variety of fields. Research on the problem of clustering tends to be fragmented across pattern recognition, database, data mining, and machine learning communities. This work discusses the common techniques that are used in cluster analysis. These methodologies will be applied to data analysis in the framework of polymer processing.A. Manuela Gonçalves was partially financed by Portuguese Funds through FCT (Fundação para a CiĂȘncia e a Tecnologia) within the Projects UIDB/00013/2020 and UIDP/00013/2020 of CMAT-UMThis project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie SkƂodowskaCurie grant agreement No. 734205 – H2020-MSCA-RISE-2016

    A robust approach to model-based classification based on trimming and constraints

    Full text link
    In a standard classification framework a set of trustworthy learning data are employed to build a decision rule, with the final aim of classifying unlabelled units belonging to the test set. Therefore, unreliable labelled observations, namely outliers and data with incorrect labels, can strongly undermine the classifier performance, especially if the training size is small. The present work introduces a robust modification to the Model-Based Classification framework, employing impartial trimming and constraints on the ratio between the maximum and the minimum eigenvalue of the group scatter matrices. The proposed method effectively handles noise presence in both response and exploratory variables, providing reliable classification even when dealing with contaminated datasets. A robust information criterion is proposed for model selection. Experiments on real and simulated data, artificially adulterated, are provided to underline the benefits of the proposed method

    ON THE RELATIONSHIPS BETWEEN SPATIAL CLUSTERING, INEQUALITY, AND ECONOMIC GROWTH IN THE UNITED STATES : 1969-2000

    Get PDF
    The literature on economic development has been divided as to the nature of the relationship between inequality and growth. Recent exploratory work in the field has provided evidence that the dynamic and spatial relationships between the two may be far more complicated than previously thought. This paper provides an spatial econometric specification for the analysis of economic growth, that allows for simultaneity as it relates to inequality. Furthermore, attention is given to the possible impacts of local clustering on the performance of individual economies in a global setting. The new methodology is applied to the US states from 1969–2000, where the counties are used for the local inequality and clustering estimates.ECONOMIC GROWTH, INEQUALITY, SIMULTANEITY, SPATIAL CLUSTERING

    Quantifying the consistency of scientific databases

    Full text link
    Science is a social process with far-reaching impact on our modern society. In the recent years, for the first time we are able to scientifically study the science itself. This is enabled by massive amounts of data on scientific publications that is increasingly becoming available. The data is contained in several databases such as Web of Science or PubMed, maintained by various public and private entities. Unfortunately, these databases are not always consistent, which considerably hinders this study. Relying on the powerful framework of complex networks, we conduct a systematic analysis of the consistency among six major scientific databases. We found that identifying a single "best" database is far from easy. Nevertheless, our results indicate appreciable differences in mutual consistency of different databases, which we interpret as recipes for future bibliometric studies.Comment: 20 pages, 5 figures, 4 table
    • 

    corecore