128,584 research outputs found

    Graph-based Representation for Sentence Similarity Measure : A Comparative Analysis

    Get PDF
    Textual data are a rich source of knowledge hence, sentence comparison has become one of the important tasks in text mining related works.Most previous work in text comparison are performed at document level, research suggest that comparing sentence level text is a non-trivial problem.One of the reason is two sentences can convey the same meaning with totally dissimilar words.This paper presents the results of a comparative analysis on three representation schemes i.e. term frequency inverse document frequency, Latent Semantic Analysis and Graph based representation using three similarity measures i.e. Cosine, Dice coefficient and Jaccard similarity to compare the similarity of sentences.Results reveal that the graph based representation and the Jaccard similarity measure outperforms the others in terms of precision, recall and F-measures

    Encapsulating and representing the knowledge on the evaluation of an engineering system

    Get PDF
    This paper proposes a cross-disciplinary methodology for a fundamental question in product development: How can the innovation patterns during the evolution of an engineering system (ES) be encapsulated, so that it can later be mined through data mining analysis methods? Reverse engineering answers the question of which components a developed engineering system consists of, and how the components interact to make the working product. TRIZ answers the question of which problem-solving principles can be, or have been employed in developing that system, in comparison to its earlier versions, or with respect to similar systems. While these two methodologies have been very popular, to the best of our knowledge, there does not yet exist a methodology that reverseengineers and encapsulates and represents the information regarding the complete product development process in abstract terms. This paper suggests such a methodology, that consists of mathematical formalism, graph visualization, and database representation. The proposed approach is demonstrated by analyzing the design and development process for a prototype wrist-rehabilitation robot

    Generic Architecture for Predictive Computational Modelling with Application to Financial Data Analysis: Integration of Semantic Approach and Machine Learning

    Get PDF
    The PhD thesis introduces a Generic Architecture for Predictive Computational Modelling capable of automating analytical conclusions regarding quantitative data structured as a data frame. The model involves heterogeneous data mining based on a semantic approach, graph-based methods (ontology, knowledge graphs, graph databases) and advanced machine learning methods. The main focus of my research is data pre-processing aimed at a more efficient selection of input features to the computational model. Since the model I propose is generic, it can be applied for data mining of all quantitative datasets (containing two-dimensional, size-mutable, heterogeneous tabular data); however, it is best suitable for highly interconnected data. To adapt this generic model to a specific use case, an Ontology as the formal conceptual representation for the relevant domain knowledge is needed. I have determined to use financial/market data for my use cases. In the course of practical experiments, the effectiveness of the PCM model application for the UK companies’ financial risk analysis and the FTSE100 market index forecasting was evaluated. The tests confirmed that the PCM model has more accurate outcomes than stand-alone traditional machine learning methods. By critically evaluating this architecture, I proved its validity and suggested directions for future research

    Real-time analytics for complex structure data

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.The advancement of data acquisition and analysis technology has resulted in many real-world data being dynamic and containing rich content and structured information. More specifically, with the fast development of information technology, many current real-world data are always featured with dynamic changes, such as new instances, new nodes and edges, and modifications to the node content. Different from traditional data, which are represented as feature vectors, data with complex relationships are often represented as graphs to denote the content of the data entries and their structural relationships, where instances (nodes) are not only characterized by the content but are also subject to dependency relationships. Plus, real-time availability is one of outstanding features of today’s data. Real-time analytics is dynamic analysis and reporting based on data entered into a system before the actual time of use. Real-time analytics emphasizes on deriving immediate knowledge from dynamic data sources, such as data streams, and knowledge discovery and pattern mining are facing complex, dynamic data sources. However, how to combine structure information and node content information for accurate and real-time data mining is still a big challenge. Accordingly, this thesis focuses on real-time analytics for complex structure data. We explore instance correlation in complex structure data and utilises it to make mining tasks more accurate and applicable. To be specific, our objective is to combine node correlation with node content and utilize them for three different tasks, including (1) graph stream classification, (2) super-graph classification and clustering, and (3) streaming network node classification. Understanding the role of structured patterns for graph classification: the thesis introduces existing works on data mining from an complex structured perspective. Then we propose a graph factorization-based fine-grained representation model, where the main objective is to use linear combinations of a set of discriminative cliques to represent graphs for learning. The optimization-oriented factorization approach ensures minimum information loss for graph representation, and also avoids the expensive sub-graph isomorphism validation process. Based on this idea, we propose a novel framework for fast graph stream classification. A new structure data classification algorithm: The second method introduces a new super-graph classification and clustering problem. Due to the inherent complex structure representation, all existing graph classification methods cannot be applied to super-graph classification. In the thesis, we propose a weighted random walk kernel which calculates the similarity between two super-graphs by assessing (a) the similarity between super-nodes of the super-graphs, and (b) the common walks of the super-graphs. Our key contribution is: (1) a new super-node and super-graph structure to enrich existing graph representation for real-world applications; (2) a weighted random walk kernel considering node and structure similarities between graphs; (3) a mixed-similarity considering structured content inside super-nodes and structural dependency between super-nodes; and (4) an effective kernel-based super-graph classification method with sound theoretical basis. Empirical studies show that the proposed methods significantly outperform the state-of-the-art methods. Real-time analytics framework for dynamic complex structure data: For streaming networks, the essential challenge is to properly capture the dynamic evolution of the node content and node interactions in order to support node classification. While streaming networks are dynamically evolving, for a short temporal period, a subset of salient features are essentially tied to the network content and structures, and therefore can be used to characterize the network for classification. To achieve this goal, we propose to carry out streaming network feature selection (SNF) from the network, and use selected features as gauge to classify unlabeled nodes. A Laplacian based quality criterion is proposed to guide the node classification, where the Laplacian matrix is generated based on node labels and network topology structures. Node classification is achieved by finding the class label that results in the minimal gauging value with respect to the selected features. By frequently updating the features selected from the network, node classification can quickly adapt to the changes in the network for maximal performance gain. Experiments and comparisons on real-world networks demonstrate that SNOC is able to capture dynamics in the network structures and node content, and outperforms baseline approaches with significant performance gain

    Representing Semantified Biological Assays in the Open Research Knowledge Graph

    Get PDF
    In the biotechnology and biomedical domains, recent text mining efforts advocate for machine-interpretable, and preferably, semantified, documentation formats of laboratory processes. This includes wet-lab protocols, (in)organic materials synthesis reactions, genetic manipulations and procedures for faster computer-mediated analysis and predictions. Herein, we present our work on the representation of semantified bioassays in the Open Research Knowledge Graph (ORKG). In particular, we describe a semantification system work-in-progress to generate, automatically and quickly, the critical semantified bioassay data mass needed to foster a consistent user audience to adopt the ORKG for recording their bioassays and facilitate the organisation of research, according to FAIR principles.Comment: In Proceedings of 'The 22nd International Conference on Asia-Pacific Digital Libraries
    corecore