128 research outputs found

    Multivariate Approaches to Classification in Extragalactic Astronomy

    Get PDF
    Clustering objects into synthetic groups is a natural activity of any science. Astrophysics is not an exception and is now facing a deluge of data. For galaxies, the one-century old Hubble classification and the Hubble tuning fork are still largely in use, together with numerous mono-or bivariate classifications most often made by eye. However, a classification must be driven by the data, and sophisticated multivariate statistical tools are used more and more often. In this paper we review these different approaches in order to situate them in the general context of unsupervised and supervised learning. We insist on the astrophysical outcomes of these studies to show that multivariate analyses provide an obvious path toward a renewal of our classification of galaxies and are invaluable tools to investigate the physics and evolution of galaxies.Comment: Open Access paper. http://www.frontiersin.org/milky\_way\_and\_galaxies/10.3389/fspas.2015.00003/abstract\>. \<10.3389/fspas.2015.00003 \&g

    Tree species diversity estimation using airborne imaging spectroscopy

    Get PDF
    With the ongoing global biodiversity loss, approaches to measuring and monitoring biodiversity are necessary for effective conservation planning, especially in tropical forests. Remote sensing is a very potential tool for biodiversity mapping, and high spatial resolution imaging spectroscopy allows for direct estimation of tree species diversity based on spectral reflectance. The objective of this study is to test an approach for estimating tree species alpha diversity in a tropical montane forest in the Taita Hills, Kenya. Tree species diversity is estimated based on spectral variation of high spatial resolution imaging spectroscopy data. The approach is an unsupervised classification, or clustering, applied to objects that represent tree crowns. Airborne imaging spectroscopy data and species data from 31 field plots were collected from the study area. After preprocessing of the spectroscopic imagery, a minimum noise fraction (MNF) transformation with a subsequent selection of 13 bands was applied to the data to reduce its noise and dimensionality. The imagery was then segmented to obtain objects that represent tree crowns. A clustering algorithm was applied to the segments, with the aim of grouping spectrally similar tree crowns. Experiments were made to find the optimal range for the number of clusters. Tree species richness and two diversity indices were calculated from the field data and from the clustering results. The clusters were assumed to represent species in the calculations. Correlation analysis and linear regression analysis were used to study the relationship between diversity measures from the field data and from the clustering results. It was found that the approach succeeded well in revealing tree species diversity patterns with all three diversity measures. Despite some factors that added error to the relationship between field-derived and clustering-derived diversity measures, high correlations were observed. Especially tree species richness could be modelled well using the approach (standard error: 3 species). The size of the considered trees was found to be an important determinant of the relationships. Finally, a tree species richness map was created for the study area. With further development, the presented approach has potential for other interesting applications, such as estimation of beta diversity, and tree species identification by linking the reflectance properties of individual crowns to their corresponding species.Luonnon monimuotoisuuden maailmanlaajuisen vähenemisen vuoksi biodiversiteetin mittaus- ja tarkkailumenetelmiä tarvitaan tehokkaaseen suojelualueiden suunnitteluun, erityisesti trooppisissa metsissä. Kaukokartoitus on erittäin lupaava väline biodiversiteetin kartoitukseen, ja spatiaalisesti tarkka hyperspektraalinen aineisto (kuvantava spektroskopia) mahdollistaa puiden lajidiversiteetin arvioinnin suoraan niiden spektraalisen heijastuksen perusteella. Tämän tutkimuksen tarkoitus on kokeilla lähestymistapaa puulajien alfadiversiteetin mittaamiseen trooppisessa vuoristometsässä Kenian Taitavuorilla. Puulajien monimuotoisuutta arvioidaan spatiaalisesti tarkan hyperspektraalisen aineiston heijastuksen vaihtelun avulla. Lähestymistapa on puunlatvuksia edustaville kohteille tehty ohjaamaton luokittelu, tarkemmin ilmaistuna klusterointi. Tutkimusalueelta kerättiin hyperspektraalista ilmakuva-aineistoa sekä puulajitiedot 31 maastokoealalta. Hyperspektraalisen aineiston esikäsittelyn jälkeen sen hälyä ja ulottuvuuksia vähennettiin tekemällä sille MNF (minimum noise fraction) –muunnos ja valitsemalla 13 parasta kanavaa. Tämän jälkeen ilmakuva segmentoitiin puunlatvuksia kuvaaviksi kohteiksi. Kohteet klusteroitiin klusterointialgoritmia käyttäen, tarkoituksena ryhmitellä spektraalisesti samankaltaiset puunlatvukset. Ihanteellisen klusterimäärän löytämiseksi tehtiin kokeiluja. Puulajirunsaus ja kaksi diversiteetti-indeksiä laskettiin maastoaineistolle ja klusteroinnin tuloksille. Klustereiden oletettiin edustavan puulajeja laskelmissa. Maastoaineistosta ja klusterointituloksista laskettujen diversiteettimittareiden suhdetta tutkittiin korrelaatioanalyysin ja lineaarisen regressioanalyysin avulla. Lähestymistapaa soveltaen onnistuttiin hyvin paljastamaan puulajien monimuotoisuuden piirteitä kaikkien kolmen diversiteettimittarin avulla. Huolimatta tekijöistä, jotka aiheuttivat virhettä maastoaineistoon ja klusterointituloksiin perustuvien diversiteettimittareiden suhteeseen, korrelaatioasteet olivat korkeita. Varsinkin puiden lajirunsautta pystyttiin mallintamaan hyvin lähestymistavan avulla (keskivirhe: kolme lajia). Mukaanluettujen puiden koko oli tärkeä tekijä muuttujien suhteissa. Lopuksi tehtiin kartta puulajirunsaudesta tutkimusalueelle. Jatkokehittämisen avulla esitellyllä lähestymistavalla on mahdollisuuksia muihinkin mielenkiintoisiin sovelluksiin, kuten betadiversiteetin arvioimiseen, sekä puulajien tunnistukseen, kun yksittäisten latvusten heijastusominaisuudet liitetään niitä vastaaviin lajeihin

    Methodology and Algorithms for Pedestrian Network Construction

    Get PDF
    With the advanced capabilities of mobile devices and the success of car navigation systems, interest in pedestrian navigation systems is on the rise. A critical component of any navigation system is a map database which represents a network (e.g., road networks in car navigation systems) and supports key functionality such as map display, geocoding, and routing. Road networks, mainly due to the popularity of car navigation systems, are well defined and publicly available. However, in pedestrian navigation systems, as well as other applications including urban planning and physical activities studies, road networks do not adequately represent the paths that pedestrians usually travel. Currently, there are no techniques to automatically construct pedestrian networks, impeding research and development of applications requiring pedestrian data. This coupled with the increased demand for pedestrian networks is the prime motivation for this dissertation which is focused on development of a methodology and algorithms that can construct pedestrian networks automatically. A methodology, which involves three independent approaches, network buffering (using existing road networks), collaborative mapping (using GPS traces collected by volunteers), and image processing (using high-resolution satellite and laser imageries) was developed. Experiments were conducted to evaluate the pedestrian networks constructed by these approaches with a pedestrian network baseline as a ground truth. The results of the experiments indicate that these three approaches, while differing in complexity and outcome, are viable for automatically constructing pedestrian networks

    Cluster Analysis of Time Series Data with Application to Hydrological Events and Serious Illness Conversations

    Get PDF
    Cluster analysis explores the underlying structure of data and organizes it into groups (i.e., clusters) such that observations within the same group are more similar than those in different groups. Quantifying the ``similarity\u27\u27 between observations, choosing the optimal number of clusters, and interpreting the results all require careful consideration of the research question at hand, the model parameters, the amount of data and their attributes. In this dissertation, the first manuscript explores the impact of design choices and the variability in clustering performance on different datasets. This is demonstrated through a benchmark study consisting of 128 datasets from the University of California, Riverside time series classification archive. Next, a multivariate event time series clustering approach is applied to hydrological storm events in watershed science. Specifically, river discharge and suspended sediment data from six watersheds in the Vermont are clustered, and yield four types of hydrological water quality events to help inform conservation and management efforts. In a second application, a novel and computationally efficient clustering algorithm called SOMTimeS (Self-organizing Map for Time Series) is designed for large time series analysis using dynamic time warping (DTW). The algorithm scales linearly with increasing data, making SOMTimeS, to the best of our knowledge, the fastest DTW-based clustering algorithm to date. For proof of concept, it is applied to conversational features from a Palliative Care Communication Research Initiative study with the goal of understanding and motivating high quality communication in serious illness health care settings

    Texture and Colour in Image Analysis

    Get PDF
    Research in colour and texture has experienced major changes in the last few years. This book presents some recent advances in the field, specifically in the theory and applications of colour texture analysis. This volume also features benchmarks, comparative evaluations and reviews

    Data Clustering and Partial Supervision with Some Parallel Developments

    Get PDF
    Data Clustering and Partial Supell'ision with SOllie Parallel Developments by Sameh A. Salem Clustering is an important and irreplaceable step towards the search for structures in the data. Many different clustering algorithms have been proposed. Yet, the sources of variability in most clustering algorithms affect the reliability of their results. Moreover, the majority tend to be based on the knowledge of the number of clusters as one of the input parameters. Unfortunately, there are many scenarios, where this knowledge may not be available. In addition, clustering algorithms are very computationally intensive which leads to a major challenging problem in scaling up to large datasets. This thesis gives possible solutions for such problems. First, new measures - called clustering performance measures (CPMs) - for assessing the reliability of a clustering algorithm are introduced. These CPMs can be used to evaluate: I) clustering algorithms that have a structure bias to certain type of data distribution as well as those that have no such biases, 2) clustering algorithms that have initialisation dependency as well as the clustering algorithms that have a unique solution for a given set of parameter values with no initialisation dependency. Then, a novel clustering algorithm, which is a RAdius based Clustering ALgorithm (RACAL), is proposed. RACAL uses a distance based principle to map the distributions of the data assuming that clusters are determined by a distance parameter, without having to specify the number of clusters. Furthermore, RACAL is enhanced by a validity index to choose the best clustering result, i.e. result has compact clusters with wide cluster separations, for a given input parameter. Comparisons with other clustering algorithms indicate the applicability and reliability of the proposed clustering algorithm. Additionally, an adaptive partial supervision strategy is proposed for using in conjunction with RACAL_to make it act as a classifier. Results from RACAL with partial supervision, RACAL-PS, indicate its robustness in classification. Additionally, a parallel version of RACAL (P-RACAL) is proposed. The parallel evaluations of P-RACAL indicate that P-RACAL is scalable in terms of speedup and scaleup, which gives the ability to handle large datasets of high dimensions in a reasonable time. Next, a novel clustering algorithm, which achieves clustering without any control of cluster sizes, is introduced. This algorithm, which is called Nearest Neighbour Clustering, Algorithm (NNCA), uses the same concept as the K-Nearest Neighbour (KNN) classifier with the advantage that the algorithm needs no training set and it is completely unsupervised. Additionally, NNCA is augmented with a partial supervision strategy, NNCA-PS, to act as a classifier. Comparisons with other methods indicate the robustness of the proposed method in classification. Additionally, experiments on parallel environment indicate the suitability and scalability of the parallel NNCA, P-NNCA, in handling large datasets. Further investigations on more challenging data are carried out. In this context, microarray data is considered. In such data, the number of clusters is not clearly defined. This points directly towards the clustering algorithms that does not require the knowledge of the number of clusters. Therefore, the efficacy of one of these algorithms is examined. Finally, a novel integrated clustering performance measure (lCPM) is proposed to be used as a guideline for choosing the proper clustering algorithm that has the ability to extract useful biological information in a particular dataset. Supplied by The British Library - 'The world's knowledge' Supplied by The British Library - 'The world's knowledge

    Uma abordagem de agrupamento baseada na técnica de divisão e conquista e floresta de caminhos ótimos

    Get PDF
    Orientador: Alexandre Xavier FalcãoDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: O agrupamento de dados é um dos principais desafios em problemas de Ciência de Dados. Apesar do seu progresso científico em quase um século de existência, algoritmos de agrupamento ainda falham na identificação de grupos (clusters) naturalmente relacionados com a semântica do problema. Ademais, os avanços das tecnologias de aquisição, comunicação, e armazenamento de dados acrescentam desafios cruciais com o aumento considerável de dados, os quais não são tratados pela maioria das técnicas. Essas questões são endereçadas neste trabalho através da proposta de uma abordagem de divisão e conquista para uma técnica de agrupamento única em encontrar um grupo por domo da função de densidade de probabilidade dos dados --- o algoritmo de agrupamento por floresta de caminhos ótimos (OPF - Optimum-Path Forest). Nesta técnica, amostras são interpretadas como nós de um grafo cujos arcos conectam os kk-vizinhos mais próximos no espaço de características. Os nós são ponderados pela sua densidade de probabilidade e um mapa de conexidade é maximizado de modo que cada máximo da função densidade de probabilidade se torna a raiz de uma árvore de caminhos ótimos (grupo). O melhor valor de kk é estimado por otimização em um intervalo de valores dependente da aplicação. O problema com este método é que um número alto de amostras torna o algoritmo inviável, devido ao espaço de memória necessário para armazenar o grafo e o tempo computacional para encontrar o melhor valor de kk. Visto que as soluções existentes levam a resultados ineficazes, este trabalho revisita o problema através da proposta de uma abordagem de divisão e conquista com dois níveis. No primeiro nível, o conjunto de dados é dividido em subconjuntos (blocos) menores e as amostras pertencentes a cada bloco são agrupadas pelo algoritmo OPF. Em seguida, as amostras representativas de cada grupo (mais especificamente as raízes da floresta de caminhos ótimos) são levadas ao segundo nível, onde elas são agrupadas novamente. Finalmente, os rótulos de grupo obtidos no segundo nível são transferidos para todas as amostras do conjunto de dados através de seus representantes do primeiro nível. Nesta abordagem, todas as amostras, ou pelo menos muitas delas, podem ser usadas no processo de aprendizado não supervisionado, sem afetar a eficácia do agrupamento e, portanto, o procedimento é menos susceptível a perda de informação relevante ao agrupamento. Os resultados mostram agrupamentos satisfatórios em dois cenários, segmentação de imagem e agrupamento de dados arbitrários, tendo como base a comparação com abordagens populares. No primeiro cenário, a abordagem proposta atinge os melhores resultados em todas as bases de imagem testadas. No segundo cenário, os resultados são similares aos obtidos por uma versão otimizada do método original de agrupamento por floresta de caminhos ótimosAbstract: Data clustering is one of the main challenges when solving Data Science problems. Despite its progress over almost one century of research, clustering algorithms still fail in identifying groups naturally related to the semantics of the problem. Moreover, the advances in data acquisition, communication, and storage technologies add crucial challenges with a considerable data increase, which are not handled by most techniques. We address these issues by proposing a divide-and-conquer approach to a clustering technique, which is unique in finding one group per dome of the probability density function of the data --- the Optimum-Path Forest (OPF) clustering algorithm. In the OPF-clustering technique, samples are taken as nodes of a graph whose arcs connect the kk-nearest neighbors in the feature space. The nodes are weighted by their probability density values and a connectivity map is maximized such that each maximum of the probability density function becomes the root of an optimum-path tree (cluster). The best value of kk is estimated by optimization within an application-specific interval of values. The problem with this method is that a high number of samples makes the algorithm prohibitive, due to the required memory space to store the graph and the computational time to obtain the clusters for the best value of kk. Since the existing solutions lead to ineffective results, we decided to revisit the problem by proposing a two-level divide-and-conquer approach. At the first level, the dataset is divided into smaller subsets (blocks) and the samples belonging to each block are grouped by the OPF algorithm. Then, the representative samples (more specifically the roots of the optimum-path forest) are taken to a second level where they are clustered again. Finally, the group labels obtained in the second level are transferred to all samples of the dataset through their representatives of the first level. With this approach, we can use all samples, or at least many samples, in the unsupervised learning process without affecting the grouping performance and, therefore, the procedure is less likely to lose relevant grouping information. We show that our proposal can obtain satisfactory results in two scenarios, image segmentation and the general data clustering problem, in comparison with some popular baselines. In the first scenario, our technique achieves better results than the others in all tested image databases. In the second scenario, it obtains outcomes similar to an optimized version of the traditional OPF-clustering algorithmMestradoCiência da ComputaçãoMestre em Ciência da ComputaçãoCAPE

    A Novel Cooperative Algorithm for Clustering Large Databases With Sampling.

    Get PDF
    Agrupamento de dados é uma tarefa recorrente em mineração de dados. Com o passar do tempo, vem se tornando mais importante o agrupamento de bases cada vez maiores. Contudo, aplicar heurísticas de agrupamento tradicionais em grandes bases não é uma tarefa fácil. Essas técnicas geralmente possuem complexidades pelo menos quadráticas no número de pontos da base, tornando o seu uso inviável pelo alto tempo de resposta ou pela baixa qualidade da solução final. A solução mais comumente utilizada para resolver o problema de agrupamento em bases de dados grandes é usar algoritmos especiais, mais fracos no ponto de vista da qualidade. Este trabalho propõe uma abordagem diferente para resolver esse problema: o uso de algoritmos tradicionais, mais fortes, em um sub-conjunto dos dados originais. Esse sub-conjunto dos dados originais é obtido com uso de um algoritmo co-evolutivo que seleciona um sub-conjunto de pontos difícil de agrupar
    corecore