156 research outputs found

    Unsupervised discovery of relations for analysis of textual data in digital forensics

    Get PDF
    This dissertation addresses the problem of analysing digital data in digital forensics. It will be shown that text mining methods can be adapted and applied to digital forensics to aid analysts to more quickly, efficiently and accurately analyse data to reveal truly useful information. Investigators who wish to utilise digital evidence must examine and organise the data to piece together events and facts of a crime. The difficulty with finding relevant information quickly using the current tools and methods is that these tools rely very heavily on background knowledge for query terms and do not fully utilise the content of the data. A novel framework in which to perform evidence discovery is proposed in order to reduce the quantity of data to be analysed, aid the analysts' exploration of the data and enhance the intelligibility of the presentation of the data. The framework combines information extraction techniques with visual exploration techniques to provide a novel approach to performing evidence discovery, in the form of an evidence discovery system. By utilising unrestricted, unsupervised information extraction techniques, the investigator does not require input queries or keywords for searching, thus enabling the investigator to analyse portions of the data that may not have been identified by keyword searches. The evidence discovery system produces text graphs of the most important concepts and associations extracted from the full text to establish ties between the concepts and provide an overview and general representation of the text. Through an interactive visual interface the investigator can explore the data to identify suspects, events and the relations between suspects. Two models are proposed for performing the relation extraction process of the evidence discovery framework. The first model takes a statistical approach to discovering relations based on co-occurrences of complex concepts. The second model utilises a linguistic approach using named entity extraction and information extraction patterns. A preliminary study was performed to assess the usefulness of a text mining approach to digital forensics as against the traditional information retrieval approach. It was concluded that the novel approach to text analysis for evidence discovery presented in this dissertation is a viable and promising approach. The preliminary experiment showed that the results obtained from the evidence discovery system, using either of the relation extraction models, are sensible and useful. The approach advocated in this dissertation can therefore be successfully applied to the analysis of textual data for digital forensics CopyrightDissertation (MSc)--University of Pretoria, 2010.Computer Scienceunrestricte

    Semantic Interaction in Web-based Retrieval Systems : Adopting Semantic Web Technologies and Social Networking Paradigms for Interacting with Semi-structured Web Data

    Get PDF
    Existing web retrieval models for exploration and interaction with web data do not take into account semantic information, nor do they allow for new forms of interaction by employing meaningful interaction and navigation metaphors in 2D/3D. This thesis researches means for introducing a semantic dimension into the search and exploration process of web content to enable a significantly positive user experience. Therefore, an inherently dynamic view beyond single concepts and models from semantic information processing, information extraction and human-machine interaction is adopted. Essential tasks for semantic interaction such as semantic annotation, semantic mediation and semantic human-computer interaction were identified and elaborated for two general application scenarios in web retrieval: Web-based Question Answering in a knowledge-based dialogue system and semantic exploration of information spaces in 2D/3D

    Spondyloarthritis mass cytometry immuno-monitoring: a proof of concept study in the tight-control and treat-to target TiCoSpA trial

    Get PDF
    Objective: Mass cytometry (MC) immunoprofiling allows high-parameter phenotyping of immune cells. We set to investigate the potential of MC immuno-monitoring of axial spondyloarthritis (axSpA) patients enrolled in the Tight Control SpondyloArthritis (TiCoSpA) trial. Methods: Fresh, longitudinal PBMCs samples (baseline, 24, and 48 weeks) from 9 early, untreated axSpA patients and 7 HLA-B27+ controls were analyzed using a 35-marker panel. Data were subjected to HSNE dimension reduction and Gaussian mean shift clustering (Cytosplore), followed by Cytofast analysis. Linear discriminant analyzer (LDA), based on initial HSNE clustering, was applied onto week 24 and 48 samples. Results: Unsupervised analysis yielded a clear separation of baseline patients and controls including a significant difference in 9 T cell, B cell, and monocyte clusters (cl), indicating disrupted immune homeostasis. Decrease in disease activity (ASDAS score; median 1.7, range 0.6-3.2) from baseline to week 48 matched significant changes over time in five clusters: cl10 CD4 Tnai cells median 4.7 to 0.02%, cl37 CD4 T-em cells median 0.13 to 8.28%, cl8 CD4 Tcm cells median 3.2 to 0.02%, cl39 B cells median 0.12 to 2.56%, and cl5 CD38+ B cells median 2.52 to 0.64% (all pPathophysiology and treatment of rheumatic disease

    Collaborative Knowledge Visualisation for Cross-Community Knowledge Exchange

    Get PDF
    The notion of communities as informal social networks based on shared interests or common practices has been increasingly used as an important unit of analysis when considering the processes of cooperative creation and sharing of knowledge. While knowledge exchange within communities has been extensively researched, different studies observed the importance of cross-community knowledge exchange for the creation of new knowledge and innovation in knowledge-intensive organizations. Especially in knowledge management a critical problem has become the need to support the cooperation and exchange of knowledge between different communities with highly specialized expertise and activities. Though several studies discuss the importance and difficulties of knowledge sharing across community boundaries, the development of technological support incorporating these findings has been little addressed. This work presents an approach to supporting cross-community knowledge exchange based on using knowledge visualisation for facilitating information access in unfamiliar community domains. The theoretical grounding and practical relevance of the proposed approach are ensured by defining a requirements model that integrates theoretical frameworks for cross-community knowledge exchange with practical needs of typical knowledge management processes and sensemaking tasks in information access in unfamiliar domains. This synthesis suggests that visualising knowledge structures of communities and supporting the discovery of relationships between them during access to community spaces, could provide valuable support for cross-community discovery and sharing of knowledge. This is the main hypothesis investigated in this thesis. Accordingly, a novel method is developed for eliciting and visualising implicit knowledge structures of individuals and communities in form of dynamic knowledge maps that make the elicited knowledge usable for semantic exploration and navigation of community spaces. The method allows unobtrusive construction of personal and community knowledge maps based on user interaction with information and their use for dynamic classification of information from a specific point of view. The visualisation model combines Document Maps presenting main topics, document clusters and relationships between knowledge reflected in community spaces with Concept Maps visualising personal and shared conceptual structures of community members. The technical realization integrates Kohonen’s self-organizing maps with extraction of word categories from texts, collaborative indexing and personalised classification based on user-induced templates. This is accompanied by intuitive visualisation and interaction with complex information spaces based on multi-view navigation of document landscapes and concept networks. The developed method is prototypically implemented in form of an application framework, a concrete system and a visual information interface for multi-perspective access to community information spaces, the Knowledge Explorer. The application framework implements services for generating and using personal and community knowledge maps to support explicit and implicit knowledge exchange between members of different communities. The Knowledge Explorer allows simultaneous visualisation of different personal and community knowledge structures and enables their use for structuring, exploring and navigating community information spaces from different points of view. The empirical evaluation in a comparative laboratory study confirms the adequacy of the developed solutions with respect to specific requirements of the cross-community problem and demonstrates much better quality of knowledge access compared to a standard information seeking reference system. The developed evaluation framework and operative measures for quality of knowledge access in cross-community contexts also provide a theoretically grounded and practically feasible method for further developing and evaluating new solutions addressing this important but little investigated problem

    Analysis of large-scale molecular biological data using self-organizing maps

    Get PDF
    Modern high-throughput technologies such as microarrays, next generation sequencing and mass spectrometry provide huge amounts of data per measurement and challenge traditional analyses. New strategies of data processing, visualization and functional analysis are inevitable. This thesis presents an approach which applies a machine learning technique known as self organizing maps (SOMs). SOMs enable the parallel sample- and feature-centered view of molecular phenotypes combined with strong visualization and second-level analysis capabilities. We developed a comprehensive analysis and visualization pipeline based on SOMs. The unsupervised SOM mapping projects the initially high number of features, such as gene expression profiles, to meta-feature clusters of similar and hence potentially co-regulated single features. This reduction of dimension is attained by the re-weighting of primary information and does not entail a loss of primary information in contrast to simple filtering approaches. The meta-data provided by the SOM algorithm is visualized in terms of intuitive mosaic portraits. Sample-specific and common properties shared between samples emerge as a handful of localized spots in the portraits collecting groups of co-regulated and co-expressed meta-features. This characteristic color patterns reflect the data landscape of each sample and promote immediate identification of (meta-)features of interest. It will be demonstrated that SOM portraits transform large and heterogeneous sets of molecular biological data into an atlas of sample-specific texture maps which can be directly compared in terms of similarities and dissimilarities. Spot-clusters of correlated meta-features can be extracted from the SOM portraits in a subsequent step of aggregation. This spot-clustering effectively enables reduction of the dimensionality of the data in two subsequent steps towards a handful of signature modules in an unsupervised fashion. Furthermore we demonstrate that analysis techniques provide enhanced resolution if applied to the meta-features. The improved discrimination power of meta-features in downstream analyses such as hierarchical clustering, independent component analysis or pairwise correlation analysis is ascribed to essentially two facts: Firstly, the set of meta-features better represents the diversity of patterns and modes inherent in the data and secondly, it also possesses the better signal-to-noise characteristics as a comparable collection of single features. Additionally to the pattern-driven feature selection in the SOM portraits, we apply statistical measures to detect significantly differential features between sample classes. Implementation of scoring measurements supplements the basal SOM algorithm. Further, two variants of functional enrichment analyses are introduced which link sample specific patterns of the meta-feature landscape with biological knowledge and support functional interpretation of the data based on the ‘guilt by association’ principle. Finally, case studies selected from different ‘OMIC’ realms are presented in this thesis. In particular, molecular phenotype data derived from expression microarrays (mRNA, miRNA), sequencing (DNA methylation, histone modification patterns) or mass spectrometry (proteome), and also genotype data (SNP-microarrays) is analyzed. It is shown that the SOM analysis pipeline implies strong application capabilities and covers a broad range of potential purposes ranging from time series and treatment-vs.-control experiments to discrimination of samples according to genotypic, phenotypic or taxonomic classifications

    Challenges and prospects of spatial machine learning

    Get PDF
    The main objective of this thesis is to improve the usefulness of spatial machine learning for the spatial sciences and to allow its unused potential to be exploited. To achieve this objective, this thesis addresses several important but distinct challenges which spatial machine learning is facing. These are the modeling of spatial autocorrelation and spatial heterogeneity, the selection of an appropriate model for a given spatial problem, and the understanding of complex spatial machine learning models.Das wesentliche Ziel dieser Arbeit ist es, die NĂŒtzlichkeit des rĂ€umlichen maschinellen Lernens fĂŒr die Raumwissenschaften zu verbessern und es zu ermöglichen, ungenutztes Potenzial auszuschöpfen. Um dieses Ziel zu erreichen, befasst sich diese Arbeit mit mehreren wichtigen Herausforderungen, denen das rĂ€umliche maschinelle Lernen gegenĂŒbersteht. Diese sind die Modellierung von rĂ€umlicher Autokorrelation und rĂ€umlicher HeterogenitĂ€t, die Auswahl eines geeigneten Modells fĂŒr ein gegebenes rĂ€umliches Problem und das VerstĂ€ndnis komplexer rĂ€umlicher maschineller Lernmodelle

    Big-Data Science in Porous Materials: Materials Genomics and Machine Learning

    Full text link
    By combining metal nodes with organic linkers we can potentially synthesize millions of possible metal organic frameworks (MOFs). At present, we have libraries of over ten thousand synthesized materials and millions of in-silico predicted materials. The fact that we have so many materials opens many exciting avenues to tailor make a material that is optimal for a given application. However, from an experimental and computational point of view we simply have too many materials to screen using brute-force techniques. In this review, we show that having so many materials allows us to use big-data methods as a powerful technique to study these materials and to discover complex correlations. The first part of the review gives an introduction to the principles of big-data science. We emphasize the importance of data collection, methods to augment small data sets, how to select appropriate training sets. An important part of this review are the different approaches that are used to represent these materials in feature space. The review also includes a general overview of the different ML techniques, but as most applications in porous materials use supervised ML our review is focused on the different approaches for supervised ML. In particular, we review the different method to optimize the ML process and how to quantify the performance of the different methods. In the second part, we review how the different approaches of ML have been applied to porous materials. In particular, we discuss applications in the field of gas storage and separation, the stability of these materials, their electronic properties, and their synthesis. The range of topics illustrates the large variety of topics that can be studied with big-data science. Given the increasing interest of the scientific community in ML, we expect this list to rapidly expand in the coming years.Comment: Editorial changes (typos fixed, minor adjustments to figures

    Text Similarity Between Concepts Extracted from Source Code and Documentation

    Get PDF
    Context: Constant evolution in software systems often results in its documentation losing sync with the content of the source code. The traceability research field has often helped in the past with the aim to recover links between code and documentation, when the two fell out of sync. Objective: The aim of this paper is to compare the concepts contained within the source code of a system with those extracted from its documentation, in order to detect how similar these two sets are. If vastly different, the difference between the two sets might indicate a considerable ageing of the documentation, and a need to update it. Methods: In this paper we reduce the source code of 50 software systems to a set of key terms, each containing the concepts of one of the systems sampled. At the same time, we reduce the documentation of each system to another set of key terms. We then use four different approaches for set comparison to detect how the sets are similar. Results: Using the well known Jaccard index as the benchmark for the comparisons, we have discovered that the cosine distance has excellent comparative powers, and depending on the pre-training of the machine learning model. In particular, the SpaCy and the FastText embeddings offer up to 80% and 90% similarity scores. Conclusion: For most of the sampled systems, the source code and the documentation tend to contain very similar concepts. Given the accuracy for one pre-trained model (e.g., FastText), it becomes also evident that a few systems show a measurable drift between the concepts contained in the documentation and in the source code.</p

    Deep Clustering and Deep Network Compression

    Get PDF
    The use of deep learning has grown increasingly in recent years, thereby becoming a much-discussed topic across a diverse range of fields, especially in computer vision, text mining, and speech recognition. Deep learning methods have proven to be robust in representation learning and attained extraordinary achievement. Their success is primarily due to the ability of deep learning to discover and automatically learn feature representations by mapping input data into abstract and composite representations in a latent space. Deep learning’s ability to deal with high-level representations from data has inspired us to make use of learned representations, aiming to enhance unsupervised clustering and evaluate the characteristic strength of internal representations to compress and accelerate deep neural networks.Traditional clustering algorithms attain a limited performance as the dimensionality in-creases. Therefore, the ability to extract high-level representations provides beneficial components that can support such clustering algorithms. In this work, we first present DeepCluster, a clustering approach embedded in a deep convolutional auto-encoder. We introduce two clustering methods, namely DCAE-Kmeans and DCAE-GMM. The DeepCluster allows for data points to be grouped into their identical cluster, in the latent space, in a joint-cost function by simultaneously optimizing the clustering objective and the DCAE objective, producing stable representations, which is appropriate for the clustering process. Both qualitative and quantitative evaluations of proposed methods are reported, showing the efficiency of deep clustering on several public datasets in comparison to the previous state-of-the-art methods.Following this, we propose a new version of the DeepCluster model to include varying degrees of discriminative power. This introduces a mechanism which enables the imposition of regularization techniques and the involvement of a supervision component. The key idea of our approach is to distinguish the discriminatory power of numerous structures when searching for a compact structure to form robust clusters. The effectiveness of injecting various levels of discriminatory powers into the learning process is investigated alongside the exploration and analytical study of the discriminatory power obtained through the use of two discriminative attributes: data-driven discriminative attributes with the support of regularization techniques, and supervision discriminative attributes with the support of the supervision component. An evaluation is provided on four different datasets.The use of neural networks in various applications is accompanied by a dramatic increase in computational costs and memory requirements. Making use of the characteristic strength of learned representations, we propose an iterative pruning method that simultaneously identifies the critical neurons and prunes the model during training without involving any pre-training or fine-tuning procedures. We introduce a majority voting technique to compare the activation values among neurons and assign a voting score to evaluate their importance quantitatively. This mechanism effectively reduces model complexity by eliminating the less influential neurons and aims to determine a subset of the whole model that can represent the reference model with much fewer parameters within the training process. Empirically, we demonstrate that our pruning method is robust across various scenarios, including fully-connected networks (FCNs), sparsely-connected networks (SCNs), and Convolutional neural networks (CNNs), using two public datasets.Moreover, we also propose a novel framework to measure the importance of individual hidden units by computing a measure of relevance to identify the most critical filters and prune them to compress and accelerate CNNs. Unlike existing methods, we introduce the use of the activation of feature maps to detect valuable information and the essential semantic parts, with the aim of evaluating the importance of feature maps, inspired by novel neural network interpretability. A majority voting technique based on the degree of alignment between a se-mantic concept and individual hidden unit representations is utilized to evaluate feature maps’ importance quantitatively. We also propose a simple yet effective method to estimate new convolution kernels based on the remaining crucial channels to accomplish effective CNN compression. Experimental results show the effectiveness of our filter selection criteria, which outperforms the state-of-the-art baselines.To conclude, we present a comprehensive, detailed review of time-series data analysis, with emphasis on deep time-series clustering (DTSC), and a founding contribution to the area of applying deep clustering to time-series data by presenting the first case study in the context of movement behavior clustering utilizing the DeepCluster method. The results are promising, showing that the latent space encodes sufficient patterns to facilitate accurate clustering of movement behaviors. Finally, we identify state-of-the-art and present an outlook on this important field of DTSC from five important perspectives

    Neuroinformatics in Functional Neuroimaging

    Get PDF
    This Ph.D. thesis proposes methods for information retrieval in functional neuroimaging through automatic computerized authority identification, and searching and cleaning in a neuroscience database. Authorities are found through cocitation analysis of the citation pattern among scientific articles. Based on data from a single scientific journal it is shown that multivariate analyses are able to determine group structure that is interpretable as particular “known ” subgroups in functional neuroimaging. Methods for text analysis are suggested that use a combination of content and links, in the form of the terms in scientific documents and scientific citations, respectively. These included context sensitive author ranking and automatic labeling of axes and groups in connection with multivariate analyses of link data. Talairach foci from the BrainMap ℱ database are modeled with conditional probability density models useful for exploratory functional volumes modeling. A further application is shown with conditional outlier detection where abnormal entries in the BrainMap ℱ database are spotted using kernel density modeling and the redundancy between anatomical labels and spatial Talairach coordinates. This represents a combination of simple term and spatial modeling. The specific outliers that were found in the BrainMap ℱ database constituted among others: Entry errors, errors in the article and unusual terminology
    • 

    corecore