    Data visualization plays a crucial role in identifying interesting patterns in exploratory data analysis. Its use is, however, made difficult by the large number of possible data projections showing different attribute subsets that must be evaluated by the data analyst. In this paper, we introduce a method called VizRank, which is applied on classified data to automatically select the most useful data projections. VizRank can be used with any visualization method that maps attribute values to points in a two-dimensional visualization space. It assesses possible data projections and ranks them by their ability to visually discriminate between classes. The quality of class separation is estimated by computing the predictive accuracy of k-nearest neighbor classifier on the data set consisting of x and y positions of the projected data points and their class information. The paper introduces the method and presents experimental results which show that VizRank's ranking of projections highly agrees with subjective rankings by data analysts. The practical use of VizRank is also demonstrated by an application in the field of functional genomics

    The new full-metal ITER-like wall at JET was found to have a deep impact on the physics of disruptions at JET. In order to develop disruption classification, the 10D operational space of JET with the new ITER-like wall has been explored using the generative topographic mapping method. The 2D map has been exploited to develop an automatic disruption classification of several disruption classes manually identified. In particular, all the non-intentional disruptions have been considered, that occurred in JET from 2011 to 2013 with the new wall. A statistical analysis of the plasma parameters describing the operational spaces of JET with carbon wall and JET ITER-like wall has been performed and some physical considerations have been made on the difference between these two operational spaces and the disruption classes which can be identified. The performance of the JET- ITER-like wall classifier is tested in realtime in conjunction with a disruption predictor presently operating at JET with good results. Moreover, to validate and analyse the results, another reference classifier has been developed, based on the k-nearest neighbour technique. Finally, in order to verify the reliability of the performed classification, a conformal predictor based on non-conformity measures has been developed

    La posibilidad de implementar un algoritmo capaz de dotar de inteligencia a un sistema robótico que pueda desplazarse y percibir su entorno (que de ahora en adelante se conocerá como un agente inteligente), se convierte en un recurso valioso para la robótica móvil y la sociedad al aplicarse para realizar tareas de manera autónoma, algunas de las cuales pueden ser demasiado complejas o peligrosas para ser desarrolladas por un ser vivo. En el área de la inteligencia artificial aplicada a la planificación de rutas y especialmente cuando se hace uso de funciones heurísticas, encontrar una ruta que una dos puntos conocidos como inicio y destino no garantiza directamente que se encuentre la mejor ruta; sin embargo el hecho de encontrar un camino que permita conectar dos puntos se vuelve una solución valiosa dependiendo la situación donde se realice la planificación

    An optimal performance of a Case-Based Reasoning (CBR) system means, the CBR system must be efficient both in time and in size, and must be optimally competent. The efficiency in time is closely related to an efficient and optimal retrieval process over the Case Base of the CBR system. Efficiency in size means that the Case Library (CL) size should be minimal. Therefore, the efficiency in size is closely related to optimal case learning policies, optimal meta-case learning policies, optimal case forgetting policies, etc. On the other hand, the optimal competence of a CBR system means that the number of problems that the CBR system can satisfactorily solve must be maximum. To improve or optimize all three dimensions in a CBR system at the same time is a difficult challenge because they are interrelated, and it becomes even more difficult when the CBR system is applied to a dynamic or continuous domain (data stream). In this thesis, a Dynamic Adaptive Case Library framework (DACL) is proposed to improve the CBR system performance coping especially with reducing the retrieval time, increasing the CBR system competence, and maintaining and adapting the CL to be efficient in size, especially in continuous domains. DACL learns cases and organizes them into dynamic cluster structures. The DACL is able to adapt itself to a dynamic environment, where new clusters, meta-cases or prototype of cases, and associated indexing structures (discriminant trees, k-d trees, etc.) can be formed, updated, or even removed. DACL offers a possible solution to the management of the large amount of data generated in an unsupervised continuous domain (data stream). In addition, we propose the use of a Multiple Case Library (MCL), which is a static version of a DACL, with the same structure but being defined statically to be used in supervised domains. The thesis work proposes some techniques for improving the indexation and the retrieval task. The most important indexing method is the NIAR k-d tree algorithm, which improves the retrieval time and competence, compared against the baseline approach (a flat CL) and against the well-known techniques based on using standard k-d tree strategies. The proposed Partial Matching Exploration (PME) technique explores a hierarchical case library with a tree indexing-structure aiming at not losing the most similar cases to a query case. This technique allows not only exploring the best matching path, but also several alternative partial matching paths to be explored. The results show an improvement in competence and time of retrieving of similar cases. Through the experimentation tests done, with a set of well-known benchmark supervised databases. The dynamic building of prototypes in DACL has been tested in an unsupervised domain (environmental domain) where the air pollution is evaluated. The core task of building prototypes in a DACL is the implementation of a stochastic method for the learning of new cases and management of prototypes. Finally, the whole dynamic framework, integrating all the main proposed approaches of the research work, has been tested in simulated unsupervised domains with several well-known databases in an incremental way, as data streams are processed in real life. The conclusions outlined that from the experimental results, it can be stated that the dynamic adaptive framework proposed (DACL/MCL), jointly with the contributed indexing strategies and exploration techniques, and with the proposed stochastic case learning policies, and meta-case learning policies, improves the performance of standard CBR systems both in supervised domains (MCL) and in unsupervised continuous domains (DACL).El rendimiento óptimo de un sistema de razonamiento basado en casos (CBR) significa que el sistema CBR debe ser eficiente tanto en tiempo como en tamaño, y debe ser competente de manera óptima. La eficiencia temporal está estrechamente relacionada con que el proceso de recuperación sobre la Base de Casos del sistema CBR sea eficiente y óptimo. La eficiencia en tamaño significa que el tamaño de la Base de Casos (CL) debe ser mínimo. Por lo tanto, la eficiencia en tamaño está estrechamente relacionada con las políticas óptimas de aprendizaje de casos y meta-casos, y las políticas óptimas de olvido de casos, etc. Por otro lado, la competencia óptima de un sistema CBR significa que el número de problemas que el sistema puede resolver de forma satisfactoria debe ser máximo. Mejorar u optimizar las tres dimensiones de un sistema CBR al mismo tiempo es un reto difícil, ya que están relacionadas entre sí, y se vuelve aún más difícil cuando se aplica el sistema de CBR a un dominio dinámico o continuo (flujo de datos). En esta tesis se propone el Dynamic Adaptive Case Library framework (DACL) para mejorar el rendimiento del sistema CBR especialmente con la reducción del tiempo de recuperación, aumentando la competencia del sistema CBR, manteniendo y adaptando la CL para ser eficiente en tamaño, especialmente en dominios continuos. DACL aprende casos y los organiza en estructuras dinámicas de clusters. DACL es capaz de adaptarse a entornos dinámicos, donde los nuevos clusters, meta-casos o prototipos de los casos, y las estructuras asociadas de indexación (árboles discriminantes, árboles k-d, etc.) se pueden formar, actualizarse, o incluso ser eliminados. DACL ofrece una posible solución para la gestión de la gran cantidad de datos generados en un dominio continuo no supervisado (flujo de datos). Además, se propone el uso de la Multiple Case Library (MCL), que es una versión estática de una DACL, con la misma estructura pero siendo definida estáticamente para ser utilizada en dominios supervisados. El trabajo de tesis propone algunas técnicas para mejorar los procesos de indexación y de recuperación. El método de indexación más importante es el algoritmo NIAR k-d tree, que mejora el tiempo de recuperación y la competencia, comparado con una CL plana y con las técnicas basadas en el uso de estrategias de árboles k-d estándar. Partial Matching Exploration (PME) technique, la técnica propuesta, explora una base de casos jerárquica con una indexación de estructura de árbol con el objetivo de no perder los casos más similares a un caso de consulta. Esta técnica no sólo permite explorar el mejor camino coincidente, sino también varios caminos parciales alternativos coincidentes. Los resultados, a través de la experimentación realizada con bases de datos supervisadas conocidas, muestran una mejora de la competencia y del tiempo de recuperación de casos similares. Además la construcción dinámica de prototipos en DACL ha sido probada en un dominio no supervisado (dominio ambiental), donde se evalúa la contaminación del aire. La tarea central de la construcción de prototipos en DACL es la implementación de un método estocástico para el aprendizaje de nuevos casos y la gestión de prototipos. Por último, todo el sistema, integrando todos los métodos propuestos en este trabajo de investigación, se ha evaluado en dominios no supervisados simulados con varias bases de datos de una manera gradual, como se procesan los flujos de datos en la vida real. Las conclusiones, a partir de los resultados experimentales, muestran que el sistema de adaptación dinámica propuesto (DACL / MCL), junto con las estrategias de indexación y de exploración, y con las políticas de aprendizaje de casos estocásticos y de meta-casos propuestas, mejora el rendimiento de los sistemas estándar de CBR tanto en dominios supervisados (MCL) como en dominios continuos no supervisados (DACL).Postprint (published version

    Folksonomies have become a powerful tool to describe, discover, search, and navigate online resources (e.g., pictures, videos, blogs) on the Social Web. Unlike taxonomies and ontologies, which impose a hierarchical categorisation on content, folksonomies directly allow end users to freely create and choose the categories (in this case, tags) that best describe a piece of information. However, the freedom aafforded to users comes at a cost: as tags are defined informally, the retrieval of information becomes more challenging. Different solutions have been proposed to help users discover content in this highly dynamic setting. However, they have proved to be effective only for users who have already heavily used the system (active users) and who are interested in popular items (i.e., items tagged by many other users). In this thesis we explore principles to help both active users and more importantly new or inactive users (cold starters) to find content they are interested in even when this content falls into the long tail of medium-to-low popularity items (cold start items). We investigate the tagging behaviour of users on content and show how the similarities between users and tags can be used to produce better recommendations. We then analyse how users create new content on social tagging websites and show how preferences of only a small portion of active users (leaders), responsible for the vast majority of the tagged content, can be used to improve the recommender system's scalability. We also investigate the growth of the number of users, items and tags in the system over time. We then show how this information can be used to decide whether the benefits of an update of the data structures modelling the system outweigh the corresponding cost. In this work we formalize the ideas introduced above and we describe their implementation. To demonstrate the improvements of our proposal in recommendation efficacy and efficiency, we report the results of an extensive evaluation conducted on three different social tagging websites: CiteULike, Bibsonomy and MovieLens. Our results demonstrate that our approach achieves higher accuracy than state-of-the-art systems for cold start users and for users searching for cold start items. Moreover, while accuracy of our technique is comparable to other techniques for active users, the computational cost that it requires is much smaller. In other words our approach is more scalable and thus more suitable for large and quickly growing settings

    Because of the ongoing digital data explosion, more advanced search paradigms than the traditional exact match are needed for contentbased retrieval in huge and ever growing collections of data produced in application areas such as multimedia, molecular biology, marketing, computer-aided design and purchasing assistance. As the variety of data types is fast going towards creating a database utilized by people, the computer systems must be able to model human fundamental reasoning paradigms, which are naturally based on similarity. The ability to perceive similarities is crucial for recognition, classification, and learning, and it plays an important role in scientific discovery and creativity. Recently, the mathematical notion of metric space has become a useful abstraction of similarity and many similarity search indexes have been developed. In this thesis, we accept the metric space similarity paradigm and concentrate on the scalability issues. By exploiting computer networks and applying the Peer-to-Peer communication paradigms, we build a structured network of computers able to process similarity queries in parallel. Since no centralized entities are used, such architectures are fully scalable. Specifically, we propose a Peer-to-Peer system for similarity search in metric spaces called Metric Content-Addressable Network (MCAN) which is an extension of the well known Content-Addressable Network (CAN) used for hash lookup. A prototype implementation of MCAN was tested on real-life datasets of image features, protein symbols, and text — observed results are reported. We also compared the performance of MCAN with three other, recently proposed, distributed data structures for similarity search in metric spaces

    Entity resolution (ER) identifies and links records that belong to the same real-world entities, where an entity refer to any real-world object. It is a primary task in data integration. Accurate and efficient ER substantially impacts various commercial, security, and scientific applications. Often, there are no unique identifiers for entities in datasets/databases that would make the ER task easy. Therefore record matching depends on entity identifying attributes and approximate matching techniques. The issues of efficiently handling large-scale data remain an open research problem with the increasing volumes and velocities in modern data collections. Fast, scalable, real-time and approximate entity matching techniques that provide high-quality results are highly demanding. This thesis proposes solutions to address the challenges of lack of test datasets and the demand for fast indexing algorithms in large-scale ER. The shortage of large-scale, real-world datasets with ground truth is a primary concern in developing and testing new ER algorithms. Usually, for many datasets, there is no information on the ground truth or ‘gold standard’ data that specifies if two records correspond to the same entity or not. Moreover, obtaining test data for ER algorithms that use personal identifying keys (e.g., names, addresses) is difficult due to privacy and confidentiality issues. Towards this challenge, we proposed a numerical simulation model that produces realistic large-scale data to test new methods when suitable public datasets are unavailable. One of the important findings of this work is the approximation of vectors that represent entity identification keys and their relationships, e.g., dissimilarities and errors. Indexing techniques reduce the search space and execution time in the ER process. Based on the ideas of the approximate vectors of entity identification keys, we proposed a fast indexing technique (Em-K indexing) suitable for real-time, approximate entity matching in large-scale ER. Our Em-K indexing method provides a quick and accurate block of candidate matches for a querying record by searching an existing reference database. All our solutions are metric-based. We transform metric or non-metric spaces to a lowerdimensional Euclidean space, known as configuration space, using multidimensional scaling (MDS). This thesis discusses how to modify MDS algorithms to solve various ER problems efficiently. We proposed highly efficient and scalable approximation methods that extend the MDS algorithm for large-scale datasets. We empirically demonstrate the improvements of our proposed approaches on several datasets with various parameter settings. The outcomes show that our methods can generate large-scale testing data, perform fast real-time and approximate entity matching, and effectively scale up the mapping capacity of MDS.Thesis (Ph.D.) -- University of Adelaide, School of Mathematical Sciences, 202

    In this thesis, a novel sequential genes selection and classification (k-SS) method is proposed. The method is analogous to the classical non-linear stepwise variable selection (SVS) methods but unlike any of the SVS methods, this new method uses the misclassification error rates (MERs) as its search criteria for informative marker genes in any given microarray data. Here, the importance of any selected gene is determined based on its marginal contribution at improving the prediction accuracy of the classification rule. This method ensures continuous selection of more genes in as much as the improvements brought into the decision models by the selected genes are considered to be significant enough by some established test criteria. However, further gene selection terminates when none of the remaining genes is capable at improving the prediction accuracy (lowering the MER) of the current model. Therefore, our approach only seeks to select the best combination of k marker genes that are most predictive of the biological samples in any given microarray data sets. An important feature of our new k-SS method is that the size α used by its test is not arbitrarily fixed by the user as common to some of the classical SVS methods. Rather, the value of α at which the best prediction accuracy is achieved (or the best combination of genes is selected) is determined by cross-validation. The new k-SS classifier competes favourably with selected eight existing classification methods using eleven published microarray data sets. The k-SS classifier is very simple to apply and does not require any rigid assumption for its implementation. Another merit of this method lies in its ability to select only those genes that are of biological relevance to the existing cancer sub-groups in microarray data sets. Lastly, we proposed a new preliminary feature selection procedure that employs the cross-validated area under the ROC curve (CVAUC) for gene selection. This method is capable at removing all the irrelevant genes at the preliminary selection stage before any standard classifier like the k-SS method is employed on the remaining data set for final optimum gene selection and classification of mRNA samples. Unlike some other data pruning methods, the new method employs the sub-sampling technique of the v-fold cross-validation to ensure consistency and efficiency of selections made at the preliminary selection stage

    Spatial networks (e.g., road networks) are general graphs with spatial information (e.g., latitude/longitude) information associated with the vertices and/or the edges of the graph. Techniques are presented for query processing on spatial networks that are based on the observed coherence between the spatial positions of the vertices and the shortest paths between them. This facilitates aggregation of the vertices into coherent regions that share vertices on the shortest paths between them. Using this observation, a framework, termed SILC, is introduced that precomputes and compactly encodes the N^2 shortest path and network distances between every pair of vertices on a spatial network containing N vertices. The compactness of the shortest paths from source vertex V is achieved by partitioning the destination vertices into subsets based on the identity of the first edge to them from V. The spatial coherence of these subsets is captured by using a quadtree representation whose dimension-reducing property enables the storage requirements of each subset to be reduced to be proportional to the perimeter of the spatially coherent regions, instead of to the number of vertices in the spatial network. In particular, experiments on a number of large road networks as well as a theoretical analysis have shown that the total storage for the shortest paths has been reduced from O(N^3) to O(N^1.5). In addition to SILC, another framework, termed PCP, is proposed that also takes advantage of the spatial coherence of the source vertices and makes use of the Well Separated Pair decomposition to further reduce the storage, under suitably defined conditions, to O(N). Using these frameworks, scalable algorithms are presented to implement a wide variety of operations such as nearest neighbor finding and distance joins on large datasets of locations residing on a spatial network. These frameworks essentially decouple the process of computing shortest paths from that of spatial query processing as well as also decouple the domain of the participating objects from the domain of the vertices of the spatial network. This means that as long as the spatial network is unchanged, the algorithm and underlying representation of the shortest paths in the spatial network can be used with different sets of objects