256 research outputs found

    Fisher networks: A principled approach to retrieval-based classification

    Get PDF
    Due to the technological advances in the acquisition and processing of information, current data mining applications involve databases of sizes that would be unthinkable just two decades ago. However, real-word datasets are often riddled with irrelevant variables that not only do not generate any meaningful information about the process of interest, but may also obstruct the contribution of the truly informative data features. Taking into consideration the relevance of the different measures available can make the difference between reaching an accurate reflection of the underlying truth and obtaining misleading results that cause the drawing of erroneousconclusions. Another important consideration in data analysis is the interpretability of the models used to fit the data. It is clear that performance must be a key aspect in deciding which methodology to use, but it should not be the only one. Models with an obscure internal operation see their practical usefulness effectively diminished by the difficulty to understand the reasoning behind their inferences, which makes them less appealing to users that are not familiar with their theoretical basis. This thesis proposes a novel framework for the visualisation and categorisation of data in classification contexts that tackles the two issues discussed above and provides an informative output of intuitive interpretation. The system is based on a Fisher information metric that automatically filters the contribution of variables depending on their relevance with respect to the classification problem at hand, measured by their influence on the posterior class probabilities. Fisher distances can then be used to calculate rigorous problem-specific similarity measures, which can be grouped into a pairwise adjacency matrix, thus defining a network. Following this novel construction process results in a principled visualisation of the data organised in communities that highlights the structure of the underlying class membership probabilities. Furthermore, the relational nature of the network can be used to reproduce the probabilistic predictions of the original estimates in a case-based approach, making them explainable by means of known cases in the dataset. The potential applications and usefulness of the framework are illustrated using several real-world datasets, giving examples of the typical output that the end user receives and how they can use it to learn more about the cases of interest as well as about the dataset as a whole

    Feature selection and extraction in spatiotemporal traffic forecasting: a systematic literature review

    Get PDF
    A spatiotemporal approach that simultaneously utilises both spatial and temporal relationships is gaining scientific interest in the field of traffic flow forecasting. Accurate identification of the spatiotemporal structure (dependencies amongst traffic flows in space and time) plays a critical role in modern traffic forecasting methodologies, and recent developments of data-driven feature selection and extraction methods allow the identification of complex relationships. This paper systematically reviews studies that apply feature selection and extraction methods for spatiotemporal traffic forecasting. The reviewed bibliographic database includes 211 publications and covers the period from early 1984 to March 2018. A synthesis of bibliographic sources clarifies the advantages and disadvantages of different feature selection and extraction methods for learning the spatiotemporal structure and discovers trends in their applications. We conclude that there is a clear need for development of comprehensive guidelines for selecting appropriate spatiotemporal feature selection and extraction methods for urban traffic forecasting. Document type: Articl

    Descriptive document clustering via discriminant learning in a co-embedded space of multilevel similarities

    Get PDF
    Descriptive document clustering aims at discovering clusters of semantically interrelated documents together with meaningful labels to summarize the content of each document cluster. In this work, we propose a novel descriptive clustering framework, referred to as CEDL. It relies on the formulation and generation of 2 types of heterogeneous objects, which correspond to documents and candidate phrases, using multilevel similarity information. CEDL is composed of 5 main processing stages. First, it simultaneously maps the documents and candidate phrases into a common co‐embedded space that preserves higher‐order, neighbor‐based proximities between the combined sets of documents and phrases. Then, it discovers an approximate cluster structure of documents in the common space. The third stage extracts promising topic phrases by constructing a discriminant model where documents along with their cluster memberships are used as training instances. Subsequently, the final cluster labels are selected from the topic phrases using a ranking scheme using multiple scores based on the extracted co‐embedding information and the discriminant output. The final stage polishes the initial clusters to reduce noise and accommodate the multitopic nature of documents. The effectiveness and competitiveness of CEDL is demonstrated qualitatively and quantitatively with experiments using document databases from different application fields

    Mining Time-aware Actor-level Evolution Similarity for Link Prediction in Dynamic Network

    Get PDF
    Topological evolution over time in a dynamic network triggers both the addition and deletion of actors and the links among them. A dynamic network can be represented as a time series of network snapshots where each snapshot represents the state of the network over an interval of time (for example, a minute, hour or day). The duration of each snapshot denotes the temporal scale/sliding window of the dynamic network and all the links within the duration of the window are aggregated together irrespective of their order in time. The inherent trade-off in selecting the timescale in analysing dynamic networks is that choosing a short temporal window may lead to chaotic changes in network topology and measures (for example, the actors’ centrality measures and the average path length); however, choosing a long window may compromise the study and the investigation of network dynamics. Therefore, to facilitate the analysis and understand different patterns of actor-oriented evolutionary aspects, it is necessary to define an optimal window length (temporal duration) with which to sample a dynamic network. In addition to determining the optical temporal duration, another key task for understanding the dynamics of evolving networks is being able to predict the likelihood of future links among pairs of actors given the existing states of link structure at present time. This phenomenon is known as the link prediction problem in network science. Instead of considering a static state of a network where the associated topology does not change, dynamic link prediction attempts to predict emerging links by considering different types of historical/temporal information, for example the different types of temporal evolutions experienced by the actors in a dynamic network due to the topological evolution over time, known as actor dynamicities. Although there has been some success in developing various methodologies and metrics for the purpose of dynamic link prediction, mining actor-oriented evolutions to address this problem has received little attention from the research community. In addition to this, the existing methodologies were developed without considering the sampling window size of the dynamic network, even though the sampling duration has a large impact on mining the network dynamics of an evolutionary network. Therefore, although the principal focus of this thesis is link prediction in dynamic networks, the optimal sampling window determination was also considered

    Saliency for Image Description and Retrieval

    Get PDF
    We live in a world where we are surrounded by ever increasing numbers of images. More often than not, these images have very little metadata by which they can be indexed and searched. In order to avoid information overload, techniques need to be developed to enable these image collections to be searched by their content. Much of the previous work on image retrieval has used global features such as colour and texture to describe the content of the image. However, these global features are insufficient to accurately describe the image content when different parts of the image have different characteristics. This thesis initially discusses how this problem can be circumvented by using salient interest regions to select the areas of the image that are most interesting and generate local descriptors to describe the image characteristics in that region. The thesis discusses a number of different saliency detectors that are suitable for robust retrieval purposes and performs a comparison between a number of these region detectors. The thesis then discusses how salient regions can be used for image retrieval using a number of techniques, but most importantly, two techniques inspired from the field of textual information retrieval. Using these robust retrieval techniques, a new paradigm in image retrieval is discussed, whereby the retrieval takes place on a mobile device using a query image captured by a built-in camera. This paradigm is demonstrated in the context of an art gallery, in which the device can be used to find more information about particular images. The final chapter of the thesis discusses some approaches to bridging the semantic gap in image retrieval. The chapter explores ways in which un-annotated image collections can be searched by keyword. Two techniques are discussed; the first explicitly attempts to automatically annotate the un-annotated images so that the automatically applied annotations can be used for searching. The second approach does not try to explicitly annotate images, but rather, through the use of linear algebra, it attempts to create a semantic space in which images and keywords are positioned such that images are close to the keywords that represent them within the space

    New approaches for unsupervised transcriptomic data analysis based on Dictionary learning

    Get PDF
    The era of high-throughput data generation enables new access to biomolecular profiles and exploitation thereof. However, the analysis of such biomolecular data, for example, transcriptomic data, suffers from the so-called "curse of dimensionality". This occurs in the analysis of datasets with a significantly larger number of variables than data points. As a consequence, overfitting and unintentional learning of process-independent patterns can appear. This can lead to insignificant results in the application. A common way of counteracting this problem is the application of dimension reduction methods and subsequent analysis of the resulting low-dimensional representation that has a smaller number of variables. In this thesis, two new methods for the analysis of transcriptomic datasets are introduced and evaluated. Our methods are based on the concepts of Dictionary learning, which is an unsupervised dimension reduction approach. Unlike many dimension reduction approaches that are widely applied for transcriptomic data analysis, Dictionary learning does not impose constraints on the components that are to be derived. This allows for great flexibility when adjusting the representation to the data. Further, Dictionary learning belongs to the class of sparse methods. The result of sparse methods is a model with few non-zero coefficients, which is often preferred for its simplicity and ease of interpretation. Sparse methods exploit the fact that the analysed datasets are highly structured. Indeed, a characteristic of transcriptomic data is particularly their structuredness, which appears due to the connection of genes and pathways, for example. Nonetheless, the application of Dictionary learning in medical data analysis is mainly restricted to image analysis. Another advantage of Dictionary learning is that it is an interpretable approach. Interpretability is a necessity in biomolecular data analysis to gain a holistic understanding of the investigated processes. Our two new transcriptomic data analysis methods are each designed for one main task: (1) identification of subgroups for samples from mixed populations, and (2) temporal ordering of samples from dynamic datasets, also referred to as "pseudotime estimation". Both methods are evaluated on simulated and real-world data and compared to other methods that are widely applied in transcriptomic data analysis. Our methods convince through high performance and overall outperform the comparison methods

    Analysis of cellular heterogeneity in breast cancer by single cell sequencing

    Get PDF
    Breast cancer is a complex heterogenous 3D ecosystem. The heterogenous composition of breast cancer determines disease progression and treatment responses. Triple receptor negative breast cancer (TNBC) is a distinct subtype with poor clinical outcomes. Deconvolution of spatially-regulated transcriptomic and microenvironmental drivers unique to TNBC offers the potential to reveal new therapeutic vulnerabilities. Single cell RNA sequencing (scRNA-seq) and spatial transcriptomic technologies were applied to three treatment naive patient-derived breast cancer samples. New spatial transcriptomic and scRNA-seq experimental pipelines were established. The new technologies were successfully applied to clinical grade biopsy samples. Cellular heterogeneity within the epithelial and non-epithelial compartment was identified across the three samples. The heterogeneity identified is consistent with the published literature. Knowledge in the theoretical underpinnings for scRNA-seq analysis along with the skills required for data analysis in a small patient cohort were acquired during the DPhil. The application of algebraic topology, manifold learning and graph theory in evaluating and interpreting scRNA-seq has been studied. The computational tools available for integrating spatial transcriptomics and scRNA-seq data were critically appraised. Future perspectives on approachesfor multimodal integration were explored
    corecore