17 research outputs found

    Ranked centroid projection: A data visualization approach based on self-organizing maps

    Get PDF
    The Self-Organizing Map (SOM) is an unsupervised neural network model that provides topology-preserving mapping from high-dimensional input spaces onto a commonly two-dimensional output space. In this study, the clustering and visualization capabilities of the SOM, especially in the analysis of textual data, i.e. document collections, are reviewed and further developed. A novel clustering and visualization approach based on the SOM is proposed for the task of text data mining. The proposed approach first transforms the document space into a multi-dimensional vector space by means of document encoding. Then a growing hierarchical SOM (GHSOM) is trained and used as a baseline framework, which automatically produces maps with various levels of details. Following the training of the GHSOM, a novel projection method, namely the Ranked Centroid Projection (RCP), is applied to project the input vectors onto a hierarchy of two-dimensional output maps. The projection of the input vectors is treated as a vector interpolation into a two-dimensional regular map grid. A ranking scheme is introduced to select the nearest R units around the input vector in the original data space, the positions of which will be taken into account in computing the projection coordinates.The proposed approach can be used both as a data analysis tool and as a direct interface to the data. Its applicability has been demonstrated in this study using an illustrative data set and two real-world document clustering tasks, i.e. the SOM paper collection and the Anthrax paper collection. Based on the proposed approach, a software toolbox is designed for analyzing and visualizing document collections, which provides a user-friendly interface and several exploration and analysis functions.The presented SOM-based approach incorporates several unique features, such as the adaptive structure, the hierarchical training, the automatic parameter adjustment and the incremental clustering. Its advantages include the ability to convey a large amount of information in a limited space with comparatively low computation load, the potential to reveal conceptual relationships among documents, and the facilitation of perceptual inferences on both inter-cluster and within-cluster relationships

    Facing-up Challenges of Multiobjective Clustering Based on Evolutionary Algorithms: Representations, Scalability and Retrieval Solutions

    Get PDF
    Aquesta tesi es centra en algorismes de clustering multiobjectiu, que estan basats en optimitzar varis objectius simultàniament obtenint una col•lecció de solucions potencials amb diferents compromisos entre objectius. El propòsit d'aquesta tesi consisteix en dissenyar i implementar un nou algorisme de clustering multiobjectiu basat en algorismes evolutius per afrontar tres reptes actuals relacionats amb aquest tipus de tècniques. El primer repte es centra en definir adequadament l'àrea de possibles solucions que s'explora per obtenir la millor solució i que depèn de la representació del coneixement. El segon repte consisteix en escalar el sistema dividint el conjunt de dades original en varis subconjunts per treballar amb menys dades en el procés de clustering. El tercer repte es basa en recuperar la solució més adequada tenint en compte la qualitat i la forma dels clusters a partir de la regió més interessant de la col•lecció de solucions ofertes per l’algorisme.Esta tesis se centra en los algoritmos de clustering multiobjetivo, que están basados en optimizar varios objetivos simultáneamente obteniendo una colección de soluciones potenciales con diferentes compromisos entre objetivos. El propósito de esta tesis consiste en diseñar e implementar un nuevo algoritmo de clustering multiobjetivo basado en algoritmos evolutivos para afrontar tres retos actuales relacionados con este tipo de técnicas. El primer reto se centra en definir adecuadamente el área de posibles soluciones explorada para obtener la mejor solución y que depende de la representación del conocimiento. El segundo reto consiste en escalar el sistema dividiendo el conjunto de datos original en varios subconjuntos para trabajar con menos datos en el proceso de clustering El tercer reto se basa en recuperar la solución más adecuada según la calidad y la forma de los clusters a partir de la región más interesante de la colección de soluciones ofrecidas por el algoritmo.This thesis is focused on multiobjective clustering algorithms, which are based on optimizing several objectives simultaneously obtaining a collection of potential solutions with different trade¬offs among objectives. The goal of the thesis is to design and implement a new multiobjective clustering technique based on evolutionary algorithms for facing up three current challenges related to these techniques. The first challenge is focused on successfully defining the area of possible solutions that is explored in order to find the best solution, and this depends on the knowledge representation. The second challenge tries to scale-up the system splitting the original data set into several data subsets in order to work with less data in the clustering process. The third challenge is addressed to the retrieval of the most suitable solution according to the quality and shape of the clusters from the most interesting region of the collection of solutions returned by the algorithm

    Marc integrador de les capacitats de Soft-Computing i de Knowledge Discovery dels Mapes Autoorganitzatius en el Raonament Basat en Casos

    Get PDF
    El Raonament Basat en Casos (CBR) és un paradigma d'aprenentatge basat en establir analogies amb problemes prèviament resolts per resoldre'n de nous. Per tant, l'organització, l'accés i la utilització del coneixement previ són aspectes claus per tenir èxit en aquest procés. No obstant, la majoria dels problemes reals presenten grans volums de dades complexes, incertes i amb coneixement aproximat i, conseqüentment, el rendiment del CBR pot veure's minvat degut a la complexitat de gestionar aquest tipus de coneixement. Això ha fet que en els últims anys hagi sorgit una nova línia de recerca anomenada Soft-Computing and Intelligent Information Retrieval enfocada en mitigar aquests efectes. D'aquí neix el context d'aquesta tesi.Dins de l'ampli ventall de tècniques Soft-Computing per tractar coneixement complex, els Mapes Autoorganitzatius (SOM) destaquen sobre la resta per la seva capacitat en agrupar les dades en patrons, els quals permeten detectar relacions ocultes entre les dades. Aquesta capacitat ha estat explotada en treballs previs d'altres investigadors, on s'ha organitzat la memòria de casos del CBR amb SOM per tal de millorar la recuperació dels casos.La finalitat de la present tesi és donar un pas més enllà en la simple combinació del CBR i de SOM, de tal manera que aquí s'introdueixen les capacitats de Soft-Computing i de Knowledge Discovery de SOM en totes les fases del CBR per nodrir-les del nou coneixement descobert. A més a més, les mètriques de complexitat apareixen en aquest context com un instrument precís per modelar el funcionament de SOM segons la tipologia de les dades. L'assoliment d'aquesta integració es pot dividir principalment en quatre fites: (1) la definició d'una metodologia per determinar la millor manera de recuperar els casos tenint en compte la complexitat de les dades i els requeriments de l'usuari; (2) la millora de la fiabilitat de la proposta de solucions gràcies a les relacions entre els clústers i els casos; (3) la potenciació de les capacitats explicatives mitjançant la generació d'explicacions simbòliques; (4) el manteniment incremental i semi-supervisat de la memòria de casos organitzada per SOM.Tots aquests punts s'integren sota la plataforma SOMCBR, la qual és extensament avaluada sobre datasets provinents de l'UCI Repository i de dominis mèdics i telemàtics.Addicionalment, la tesi aborda de manera secundària dues línies de recerca fruït dels requeriments dels projectes on ha estat ubicada. D'una banda, s'aborda la definició de funcions de similitud específiques per definir com comparar un cas resolt amb un de nou mitjançant una variant de la Computació Evolutiva anomenada Evolució de Gramàtiques (GE). D'altra banda, s'estudia com definir esquemes de cooperació entre sistemes heterogenis per millorar la fiabilitat de la seva resposta conjunta mitjançant GE. Ambdues línies són integrades en dues plataformes, BRAIN i MGE respectivament, i són també avaluades amb els datasets anteriors.El Razonamiento Basado en Casos (CBR) es un paradigma de aprendizaje basado en establecer analogías con problemas previamente resueltos para resolver otros nuevos. Por tanto, la organización, el acceso y la utilización del conocimiento previo son aspectos clave para tener éxito. No obstante, la mayoría de los problemas presentan grandes volúmenes de datos complejos, inciertos y con conocimiento aproximado y, por tanto, el rendimiento del CBR puede verse afectado debido a la complejidad de gestionarlos. Esto ha hecho que en los últimos años haya surgido una nueva línea de investigación llamada Soft-Computing and Intelligent Information Retrieval focalizada en mitigar estos efectos. Es aquí donde nace el contexto de esta tesis.Dentro del amplio abanico de técnicas Soft-Computing para tratar conocimiento complejo, los Mapas Autoorganizativos (SOM) destacan por encima del resto por su capacidad de agrupar los datos en patrones, los cuales permiten detectar relaciones ocultas entre los datos. Esta capacidad ha sido aprovechada en trabajos previos de otros investigadores, donde se ha organizado la memoria de casos del CBR con SOM para mejorar la recuperación de los casos.La finalidad de la presente tesis es dar un paso más en la simple combinación del CBR y de SOM, de tal manera que aquí se introducen las capacidades de Soft-Computing y de Knowledge Discovery de SOM en todas las fases del CBR para alimentarlas del conocimiento nuevo descubierto. Además, las métricas de complejidad aparecen en este contexto como un instrumento preciso para modelar el funcionamiento de SOM en función de la tipología de los datos. La consecución de esta integración se puede dividir principalmente en cuatro hitos: (1) la definición de una metodología para determinar la mejor manera de recuperar los casos teniendo en cuenta la complejidad de los datos y los requerimientos del usuario; (2) la mejora de la fiabilidad en la propuesta de soluciones gracias a las relaciones entre los clusters y los casos; (3) la potenciación de las capacidades explicativas mediante la generación de explicaciones simbólicas; (4) el mantenimiento incremental y semi-supervisado de la memoria de casos organizada por SOM. Todos estos puntos se integran en la plataforma SOMCBR, la cual es ampliamente evaluada sobre datasets procedentes del UCI Repository y de dominios médicos y telemáticos.Adicionalmente, la tesis aborda secundariamente dos líneas de investigación fruto de los requeri-mientos de los proyectos donde ha estado ubicada la tesis. Por un lado, se aborda la definición de funciones de similitud específicas para definir como comparar un caso resuelto con otro nuevo mediante una variante de la Computación Evolutiva denominada Evolución de Gramáticas (GE). Por otro lado, se estudia como definir esquemas de cooperación entre sistemas heterogéneos para mejorar la fiabilidad de su respuesta conjunta mediante GE. Ambas líneas son integradas en dos plataformas, BRAIN y MGE, las cuales también son evaluadas sobre los datasets anteriores.Case-Based Reasoning (CBR) is an approach of machine learning based on solving new problems by identifying analogies with other previous solved problems. Thus, organization, access and management of this knowledge are crucial issues for achieving successful results. Nevertheless, the major part of real problems presents a huge amount of complex data, which also presents uncertain and partial knowledge. Therefore, CBR performance is influenced by the complex management of this knowledge. For this reason, a new research topic has appeared in the last years for tackling this problem: Soft-Computing and Intelligent Information Retrieval. This is the point where this thesis was born.Inside the wide variety of Soft-Computing techniques for managing complex data, the Self-Organizing Maps (SOM) highlight from the rest due to their capability for grouping data according to certain patterns using the relations hidden in data. This capability has been used in a wide range of works, where the CBR case memory has been organized with SOM for improving the case retrieval.The goal of this thesis is to take a step up in the simple combination of CBR and SOM. This thesis presents how to introduce the Soft-Computing and Knowledge Discovery capabilities of SOM inside all the steps of CBR to promote them with the discovered knowledge. Furthermore, complexity measures appear in this context as a mechanism to model the performance of SOM according to data topology. The achievement of this goal can be split in the next four points: (1) the definition of a methodology for setting up the best way of retrieving cases taking into account the data complexity and user requirements; (2) the improvement of the classification reliability through the relations between cases and clusters; (3) the promotion of the explaining capabilities by means of the generation of symbolic explanations; (4) the incremental and semi-supervised case-based maintenance. All these points are integrated in the SOMCBR framework, which has been widely tested in datasets from UCI Repository and from medical and telematic domains. Additionally, this thesis secondly tackles two additional research lines due to the requirements of a project in which it has been developed. First, the definition of similarity functions ad hoc a domain is analyzed using a variant of the Evolutionary Computation called Grammar Evolution (GE). Second, the definition of cooperation schemes between heterogeneous systems is also analyzed for improving the reliability from the point of view of GE. Both lines are developed in two frameworks, BRAIN and MGE respectively, which are also evaluated over the last explained datasets

    Microbial community functioning at hypoxic sediments revealed by targeted metagenomics and RNA stable isotope probing

    Get PDF
    Microorganisms are instrumental to the structure and functioning of marine ecosystems and to the chemistry of the ocean due to their essential part in the cycling of the elements and in the recycling of the organic matter. Two of the most critical ocean biogeochemical cycles are those of nitrogen and sulfur, since they can influence the synthesis of nucleic acids and proteins, primary productivity and microbial community structure. Oxygen concentration in marine environments is one of the environmental variables that have been largely affected by anthropogenic activities; its decline induces hypoxic events which affect benthic organisms and fisheries. Hypoxia has been traditionally defined based on the level of oxygen below which most animal life cannot be sustained. Hypoxic conditions impact microbial composition and activity since anaerobic reactions and pathways are favoured, at the expense of the aerobic ones. Naturally occurring hypoxia can be found in areas where water circulation is restricted, such as coastal lagoons, and in areas where oxygen-depleted water is driven into the continental shelf, i.e. coastal upwelling regions. Coastal lagoons are highly dynamic aquatic systems, particularly vulnerable to human activities and susceptible to changes induced by natural events. For the purpose of this PhD project, the lagoonal complex of Amvrakikos Gulf, one of the largest semi-enclosed gulfs in the Mediterranean Sea, was chosen as a study site. Coastal upwelling regions are another type of environment limited in oxygen, where also formation of oxygen minimum zones (OMZs) has been reported. Sediment in upwelling regions is rich in organic matter and bottom water is often depleted of oxygen because of intense heterotrophic respiration. For the purpose of this PhD project, the chosen coastal upwelling system was the Benguela system off Namibia, situated along the coast of south western Africa. The aim of this PhD project was to study the microbial community assemblages of hypoxic ecosystems and to identify a potential link between their identity and function, with a particular emphasis on the microorganisms involved in the nitrogen and sulfur cycles. The methodology that was applied included targeted metagenomics and RNA stable isotope probing (SIP). It has been shown that the microbial community diversity pattern can be differentiated based on habitat type, i.e. between riverine, lagoonal and marine environments. Moreover, the studied habitats were functionally distinctive. Apart from salinity, which was the abiotic variable best correlated with the microbial community pattern, oxygen concentration was highly correlated with the predicted metabolic pattern of the microbial communities. In addition, when the total number of Operational Taxonomic Units (OTUs) was taken into consideration, a negative linear relationship with salinity was identified (see Chapter 2). Microbial community diversity patterns can also be differentiated based on the lagoon under study since each lagoon hosts a different sulfate-reducing microbial (SRM) community, again highly correlated with salinity. Moreover, the majority of environmental terms that characterized the SRM communities were classified to the marine biome, but terms belonging to the freshwater or brackish biomes were also found in stations were a freshwater effect was more evident (see Chapter 3). Taxonomic groups that were expected to be thriving in the sediments of the Benguela coastal upwelling system were absent or present but in very low abundances. Epsilonproteobacteria dominated the anaerobic assimilation of acetate as confirmed by their isotopic enrichment in the SIP experiments. Enhancement of known sulfate-reducers was not achieved under sulfate addition, possibly due to competition for electron donors among nitrate-reducers and sulfate-reducers, to the inability of certain sulfate-reducing bacteria to use acetate as electron donor or to the short duration of the incubations (see Chapter 4). Future research should focus more on the community functioning of such habitats; an increased understanding of the biogeochemical cycles that characterize these hypoxic ecosystems will perhaps allow for predictions regarding the intensity and direction of the cycling of elements, especially of nitrogen and sulfur given their biological importance. Regulation of hypoxic episodes will aid the end-users of these ecosystems to possibly achieve higher productivity, in terms of fish catches, which otherwise is largely compromised by the elevated hydrogen sulfide concentrations

    Nonsmooth optimization models and algorithms for data clustering and visualization

    Get PDF
    Cluster analysis deals with the problem of organization of a collection of patterns into clusters based on a similarity measure. Various distance functions can be used to define this measure. Clustering problems with the similarity measure defined by the squared Euclidean distance have been studied extensively over the last five decades. However, problems with other Minkowski norms have attracted significantly less attention. The use of different similarity measures may help to identify different cluster structures of a data set. This in turn may help to significantly improve the decision making process. High dimensional data visualization is another important task in the field of data mining and pattern recognition. To date, the principal component analysis and the self-organizing maps techniques have been used to solve such problems. In this thesis we develop algorithms for solving clustering problems in large data sets using various similarity measures. Such similarity measures are based on the squared LDoctor of Philosoph

    Microbial community functioning at hypoxic sediments revealed by targeted metagenomics and RNA stable isotope probing

    Get PDF
    Microorganisms are instrumental to the structure and functioning of marine ecosystems and to the chemistry of the ocean due to their essential part in the cycling of the elements and in the recycling of the organic matter. Two of the most critical ocean biogeochemical cycles are those of nitrogen and sulfur, since they can influence the synthesis of nucleic acids and proteins, primary productivity and microbial community structure. Oxygen concentration in marine environments is one of the environmental variables that have been largely affected by anthropogenic activities; its decline induces hypoxic events which affect benthic organisms and fisheries. Hypoxia has been traditionally defined based on the level of oxygen below which most animal life cannot be sustained. Hypoxic conditions impact microbial composition and activity since anaerobic reactions and pathways are favoured, at the expense of the aerobic ones. Naturally occurring hypoxia can be found in areas where water circulation is restricted, such as coastal lagoons, and in areas where oxygen-depleted water is driven into the continental shelf, i.e. coastal upwelling regions. Coastal lagoons are highly dynamic aquatic systems, particularly vulnerable to human activities and susceptible to changes induced by natural events. For the purpose of this PhD project, the lagoonal complex of Amvrakikos Gulf, one of the largest semi-enclosed gulfs in the Mediterranean Sea, was chosen as a study site. Coastal upwelling regions are another type of environment limited in oxygen, where also formation of oxygen minimum zones (OMZs) has been reported. Sediment in upwelling regions is rich in organic matter and bottom water is often depleted of oxygen because of intense heterotrophic respiration. For the purpose of this PhD project, the chosen coastal upwelling system was the Benguela system off Namibia, situated along the coast of south western Africa. The aim of this PhD project was to study the microbial community assemblages of hypoxic ecosystems and to identify a potential link between their identity and function, with a particular emphasis on the microorganisms involved in the nitrogen and sulfur cycles. The methodology that was applied included targeted metagenomics and RNA stable isotope probing (SIP). It has been shown that the microbial community diversity pattern can be differentiated based on habitat type, i.e. between riverine, lagoonal and marine environments. Moreover, the studied habitats were functionally distinctive. Apart from salinity, which was the abiotic variable best correlated with the microbial community pattern, oxygen concentration was highly correlated with the predicted metabolic pattern of the microbial communities. In addition, when the total number of Operational Taxonomic Units (OTUs) was taken into consideration, a negative linear relationship with salinity was identified (see Chapter 2). Microbial community diversity patterns can also be differentiated based on the lagoon under study since each lagoon hosts a different sulfate-reducing microbial (SRM) community, again highly correlated with salinity. Moreover, the majority of environmental terms that characterized the SRM communities were classified to the marine biome, but terms belonging to the freshwater or brackish biomes were also found in stations were a freshwater effect was more evident (see Chapter 3). Taxonomic groups that were expected to be thriving in the sediments of the Benguela coastal upwelling system were absent or present but in very low abundances. Epsilonproteobacteria dominated the anaerobic assimilation of acetate as confirmed by their isotopic enrichment in the SIP experiments. Enhancement of known sulfate-reducers was not achieved under sulfate addition, possibly due to competition for electron donors among nitrate-reducers and sulfate-reducers, to the inability of certain sulfate-reducing bacteria to use acetate as electron donor or to the short duration of the incubations (see Chapter 4). Future research should focus more on the community functioning of such habitats; an increased understanding of the biogeochemical cycles that characterize these hypoxic ecosystems will perhaps allow for predictions regarding the intensity and direction of the cycling of elements, especially of nitrogen and sulfur given their biological importance. Regulation of hypoxic episodes will aid the end-users of these ecosystems to possibly achieve higher productivity, in terms of fish catches, which otherwise is largely compromised by the elevated hydrogen sulfide concentrations

    ERP implementation methodologies and frameworks: a literature review

    Get PDF
    Enterprise Resource Planning (ERP) implementation is a complex and vibrant process, one that involves a combination of technological and organizational interactions. Often an ERP implementation project is the single largest IT project that an organization has ever launched and requires a mutual fit of system and organization. Also the concept of an ERP implementation supporting business processes across many different departments is not a generic, rigid and uniform concept and depends on variety of factors. As a result, the issues addressing the ERP implementation process have been one of the major concerns in industry. Therefore ERP implementation receives attention from practitioners and scholars and both, business as well as academic literature is abundant and not always very conclusive or coherent. However, research on ERP systems so far has been mainly focused on diffusion, use and impact issues. Less attention has been given to the methods used during the configuration and the implementation of ERP systems, even though they are commonly used in practice, they still remain largely unexplored and undocumented in Information Systems research. So, the academic relevance of this research is the contribution to the existing body of scientific knowledge. An annotated brief literature review is done in order to evaluate the current state of the existing academic literature. The purpose is to present a systematic overview of relevant ERP implementation methodologies and frameworks as a desire for achieving a better taxonomy of ERP implementation methodologies. This paper is useful to researchers who are interested in ERP implementation methodologies and frameworks. Results will serve as an input for a classification of the existing ERP implementation methodologies and frameworks. Also, this paper aims also at the professional ERP community involved in the process of ERP implementation by promoting a better understanding of ERP implementation methodologies and frameworks, its variety and history
    corecore