5,454 research outputs found

    STATS - A Point Access Method for Multidimensional Clusters.

    Get PDF
    The ubiquity of high-dimensional data in machine learning and data mining applications makes its efficient indexing and retrieval from main memory crucial. Frequently, these machine learning algorithms need to query specific characteristics of single multidimensional points. For example, given a clustered dataset, the cluster membership (CM) query retrieves the cluster to which an object belongs. To efficiently answer this type of query we have developed STATS, a novel main-memory index which scales to answer CM queries on increasingly big datasets. Current indexing methods are oblivious to the structure of clusters in the data, and we thus, develop STATS around the key insight that exploiting the cluster information when indexing and preserving it in the index will accelerate look up. We show experimentally that STATS outperforms known methods in regards to retrieval time and scales well with dataset size for any number of dimensions

    High throughput powder diffraction: II Applications of clustering methods and multivariate data analysis

    Get PDF
    In high throughput crystallography is possible to accumulate over 1000 powder diffraction patterns on a series of related compounds, often polymorphs. We present a method that can analyse such data, automatically sort the patterns into related clusters or classes, characterise each cluster and identify any unusual samples containing, for example, unknown or unexpected polymorphs. Mixtures may be analysed quantitatively if a database of pure phases is available. A key component of the method is a set of visualisation tools based on dendrograms, cluster analysis, pie charts, principal component based score plots and metric multidimensional scaling. Applications are presented to pharmaceutical data, and inorganic compounds. The procedures have been incorporated into the PolySNAP commercial computer software

    AMADA-Analysis of Multidimensional Astronomical Datasets

    Get PDF
    We present AMADA, an interactive web application to analyse multidimensional datasets. The user uploads a simple ASCII file and AMADA performs a number of exploratory analysis together with contemporary visualizations diagnostics. The package performs a hierarchical clustering in the parameter space, and the user can choose among linear, monotonic or non-linear correlation analysis. AMADA provides a number of clustering visualization diagnostics such as heatmaps, dendrograms, chord diagrams, and graphs. In addition, AMADA has the option to run a standard or robust principal components analysis, displaying the results as polar bar plots. The code is written in R and the web interface was created using the Shiny framework. AMADA source-code is freely available at https://goo.gl/KeSPue, and the shiny-app at http://goo.gl/UTnU7I.Comment: Accepted for publication in Astronomy & Computin

    Similarity Structure Analysis and Structural Equation Modeling in Studying Latent Structures: An Application to the Attitudes towards Portuguese Language Questionnaire

    Get PDF
    Several international studies such as PISA and PILRS (Progress in International Reading Literacy Study), have stressed the importance of positive attitudes and behaviours as facilitators of individuals reading literacy during the school years and throughout their lives. Considering that there are not available instruments for assessing attitudes Towards Portuguese Language, it was proposed the development of the Attitudes towards Portuguese Language Questionnaire – ATPLQ (Questionário de Atitudes Face à Língua Portuguesa: QAFLP, Neto et al., 2011; Rebelo, 2012). The questionnaire has 22 Likert-type items, with four levels of response (Strongly Disagree, Disagree, Agree, Strongly Agree), spread, through exploratory factor analysis (EFA), over three attitudinal dimensions: Behavioural, Affective, and Motivational.In this study we aimed to analyse the ATPLQ’s latent structure with a pooled sample data of 1441 participants, applying similarity structure analysis (SSA) and confirmatory factor analysis of ordinal data (CFA). The SSA was carried out with Hudap in order to identify the structural properties of the questionnaire and to assess its adequacy in a Portuguese population. The CFA was carried out with LISREL in order to assure structural validity, i.e., accounting for factorial validity, but also for factors’ convergent and discriminant validity, and composite reliability. These psychometric features allowed the comparison of both the EFA derived model and the SSA derived model. We justify the selection of the SSA’s model, and we discuss the similarities between the results generated by SSA and LISREL procedures, highlighting their use in modeling constructs with ordinal indicators

    Women and Stability: A Topological View of the Relationship between Women and Armed Conflict in West Africa

    Get PDF
    The relationship between women and stability, if any, is a topic of much debate and research. Several large and influential organizations have all researched women\u27s effect on stability. Furthermore, several of these world organizations, the United Nations, in particular, have declared gender equality to be a driving force in promoting stability and conflict prevention. Due to the United States active involvement in conflict prevention in such regions as West Africa, research concerning the relationship between women and stability is of particular interest to the United States Africa Command. As such, this research applied Topological Data Analysis, combined with other machine learning algorithms, to Demographic and Health Survey Program data combined with Armed Conflict Location and Event Data so as to observe the relationship between women\u27s status and armed conflicts in the West African region. While this team did not observe any direct correlation between women\u27s well-being and stability - defined as a lack of armed conflict events - the chosen methodologies and data usage have potential implications for future research concerning stability and conflict

    Visualizing Profiles of Large Datasets of Weighted and Mixed Data

    Get PDF
    This work provides a procedure with which to construct and visualize profiles, i.e., groups of individuals with similar characteristics, for weighted and mixed data by combining two classical multivariate techniques, multidimensional scaling (MDS) and the k-prototypes clustering algorithm. The well-known drawback of classical MDS in large datasets is circumvented by selecting a small random sample of the dataset, whose individuals are clustered by means of an adapted version of the k-prototypes algorithm and mapped via classical MDS. Gower’s interpolation formula is used to project remaining individuals onto the previous configuration. In all the process, Gower’s distance is used to measure the proximity between individuals. The methodology is illustrated on a real dataset, obtained from the Survey of Health, Ageing and Retirement in Europe (SHARE), which was carried out in 19 countries and represents over 124 million aged individuals in Europe. The performance of the method was evaluated through a simulation study, whose results point out that the new proposal solves the high computational cost of the classical MDS with low error.This research was funded by the Spanish Ministry of Economy and Competitiveness, grant number MTM2014-56535-R; and the V Regional Plan for Scientific Research and Technological Innovation 2016-2020 of the Community of Madrid, an agreement with Universidad Carlos III de Madrid in the action of "Excellence for University Professors.

    Análisis de la similaridad para la toma de decisiones en el Draft de la NBA

    Get PDF
    Trabajo de Fin de Grado en Ingeniería Informática, Facultad de Informática UCM, Departamento de Sistemas Informáticos y Computación, Curso 2019/2020This work is based on the different statistical studies published by Mock Draft Websites and on webs that store the official statistics of the NBA players. The data associated with NBA players and teams are currently very precious since their correct exploitation can materialize in great economic benefits. The objective of this work is to show how data mining can be useful to help the scouts in this real problem. Scouts participating in the Draft could use the information provided by the models to make a better decision that complements their personal experience. This would save time and money since by simply analyzing the results of the models, teams would not have to travel around the world to find players who could be discarded for the choice. In this work, unsupervised grouping techniques are studied to analyze the similarity between players. Databases with statistics of both current and past players are used. Besides, three different clustering techniques are implemented that allow the results to be compared, adding value to the information and facilitating decision- making. The most relevant result is shown at the moment in which the shooting in the NCAA is analyzed using grouping techniques.Este trabajo se basa en los diferentes estudios estadísticos publicados en páginas web de predicción de Drafts de la NBA y en webs que almacenan las estadísticas oficiales de los jugadores de la NBA. Los datos asociados a los jugadores y equipos de la NBA son actualmente muy valiosos ya que su correcta explotación puede materializarse en grandes beneficios económicos. El objetivo de este trabajo es mostrar cómo la minería de datos puede ser útil para ayudar a los entrenadores y directivos de los equipos en la elección de nuevas incorporaciones. Los ojeadores de los equipos que participan en el Draft podrían utilizar la información proporcionada por los modelos para tomar una mejor decisión que la que tomarían valorando su experiencia personal. Esto ahorraría mucho tiempo y dinero ya que, simplemente analizando los resultados de los modelos, los equipos no tendrían que viajar por el mundo para encontrar jugadores que pudieran ser descartados para la elección. En este trabajo se estudian técnicas de agrupación para analizar la similitud entre jugadores. Se utilizan bases de datos con estadísticas de jugadores tanto actuales como del pasado. Además, se implementan tres técnicas de clustering diferentes que permiten comparar los resultados, agregando valor a la información y facilitando la toma de decisiones. El resultado más relevante se muestra en el momento en el que se analiza el tiro en liga universitaria mediante técnicas de agrupación.Depto. de Sistemas Informáticos y ComputaciónFac. de InformáticaTRUEunpu
    • …
    corecore