133,282 research outputs found
Graph-based Representation for Sentence Similarity Measure : A Comparative Analysis
Textual data are a rich source of knowledge hence, sentence comparison has become one of the important tasks in text mining related works.Most previous work in text comparison are performed at document level, research suggest that comparing sentence level text is a non-trivial problem.One of the reason is two sentences can convey the same meaning with totally dissimilar words.This paper presents the results of a comparative analysis on three representation schemes i.e. term frequency inverse document frequency, Latent Semantic Analysis and Graph based representation using three similarity measures i.e. Cosine, Dice coefficient and Jaccard similarity to compare the similarity of sentences.Results reveal that the graph based representation and the Jaccard similarity measure outperforms the others in terms of precision, recall and F-measures
Encapsulating and representing the knowledge on the evaluation of an engineering system
This paper proposes a cross-disciplinary methodology for a fundamental question in product development: How can the innovation patterns during the evolution of an engineering system (ES) be encapsulated, so that it can later be mined through data mining analysis methods? Reverse engineering answers the question of which components a developed engineering system consists of, and how the components interact to make the working product. TRIZ answers the
question of which problem-solving principles can be, or have been employed in developing that system, in comparison to its earlier versions, or with respect to similar systems. While these two methodologies have been very popular, to the best of our knowledge, there does not yet exist a methodology that reverseengineers and encapsulates and represents the information regarding the complete product development process in abstract terms. This paper suggests such a methodology, that consists of mathematical formalism, graph visualization, and database representation. The proposed approach is
demonstrated by analyzing the design and development process for a prototype wrist-rehabilitation robot
Generic Architecture for Predictive Computational Modelling with Application to Financial Data Analysis: Integration of Semantic Approach and Machine Learning
The PhD thesis introduces a Generic Architecture for Predictive Computational Modelling capable of automating analytical conclusions regarding quantitative data structured as a data frame. The model involves heterogeneous data mining based on a semantic approach, graph-based methods (ontology, knowledge graphs, graph databases) and advanced machine learning methods. The main focus of my research is data pre-processing aimed at a more efficient selection of input features to the computational model.
Since the model I propose is generic, it can be applied for data mining of all quantitative datasets (containing two-dimensional, size-mutable, heterogeneous tabular data); however, it is best suitable for highly interconnected data.
To adapt this generic model to a specific use case, an Ontology as the formal conceptual representation for the relevant domain knowledge is needed.
I have determined to use financial/market data for my use cases. In the course of practical experiments, the effectiveness of the PCM model application for the UK companies’ financial risk analysis and the FTSE100 market index forecasting was evaluated. The tests confirmed that the PCM model has more accurate outcomes than stand-alone traditional machine learning methods.
By critically evaluating this architecture, I proved its validity and suggested directions for future research
Real-time analytics for complex structure data
University of Technology Sydney. Faculty of Engineering and Information Technology.The advancement of data acquisition and analysis technology has resulted in many real-world data being dynamic and containing rich content and structured information. More specifically, with the fast development of information technology, many current real-world data are always featured with dynamic changes, such as new instances, new nodes and edges, and modifications to the node content. Different from traditional data, which are represented as feature vectors, data with complex relationships are often represented as graphs to denote the content of the data entries and their structural relationships, where instances (nodes) are not only characterized by the content but are also subject to dependency relationships. Plus, real-time availability is one of outstanding features of today’s data. Real-time analytics is dynamic analysis and reporting based on data entered into a system before the actual time of use. Real-time analytics emphasizes on deriving immediate knowledge from dynamic data sources, such as data streams, and knowledge discovery and pattern mining are facing complex, dynamic data sources. However, how to combine structure information and node content information for accurate and real-time data mining is still a big challenge. Accordingly, this thesis focuses on real-time analytics for complex structure data. We explore instance correlation in complex structure data and utilises it to make mining tasks more accurate and applicable. To be specific, our objective is to combine node correlation with node content and utilize them for three different tasks, including (1) graph stream classification, (2) super-graph classification and clustering, and (3) streaming network node classification.
Understanding the role of structured patterns for graph classification: the thesis introduces existing works on data mining from an complex structured perspective. Then we propose a graph factorization-based fine-grained representation model, where the main objective is to use linear combinations of a set of discriminative cliques to represent graphs for learning. The optimization-oriented factorization approach ensures minimum information loss for graph representation, and also avoids the expensive sub-graph isomorphism validation process. Based on this idea, we propose a novel framework for fast graph stream classification.
A new structure data classification algorithm: The second method introduces a new super-graph classification and clustering problem. Due to the inherent complex structure representation, all existing graph classification methods cannot be applied to super-graph classification. In the thesis, we propose a weighted random walk kernel which calculates the similarity between two super-graphs by assessing (a) the similarity between super-nodes of the super-graphs, and (b) the common walks of the super-graphs. Our key contribution is: (1) a new super-node and super-graph structure to enrich existing graph representation for real-world applications; (2) a weighted random walk kernel considering node and structure similarities between graphs; (3) a mixed-similarity considering structured content inside super-nodes and structural dependency between super-nodes; and (4) an effective kernel-based super-graph classification method with sound theoretical basis. Empirical studies show that the proposed methods significantly outperform the state-of-the-art methods.
Real-time analytics framework for dynamic complex structure data: For streaming networks, the essential challenge is to properly capture the dynamic evolution of the node content and node interactions in order to support node classification. While streaming networks are dynamically evolving, for a short temporal period, a subset of salient features are essentially tied to the network content and structures, and therefore can be used to characterize the network for classification. To achieve this goal, we propose to carry out streaming network feature selection (SNF) from the network, and use selected features as gauge to classify unlabeled nodes. A Laplacian based quality criterion is proposed to guide the node classification, where the Laplacian matrix is generated based on node labels and network topology structures. Node classification is achieved by finding the class label that results in the minimal gauging value with respect to the selected features. By frequently updating the features selected from the network, node classification can quickly adapt to the changes in the network for maximal performance gain. Experiments and comparisons on real-world networks demonstrate that SNOC is able to capture dynamics in the network structures and node content, and outperforms baseline approaches with significant performance gain
Recommended from our members
Neural Relational Learning Through Semi-Propositionalization of Bottom Clauses
Relational learning can be described as the task of learning first-order logic rules from examples. It has enabled a number of new machine learning applications, e.g. graph mining and link analysis in social networks. The CILP++ system is a neural-symbolic system which can perform efficient relational learning, by being able to process first-order logic knowledge into a neural network. CILP++ relies on BCP, a recently discovered propositionalization algorithm, to perform relational learning. However, efficient knowledge extraction from such networks is an open issue and features generated by BCP do not have an independent relational description, which prevents sound knowledge extraction from such networks. We present a methodology for generating independent propositional features for BCP by using semi-propositionalization of bottom clauses. Empirical results obtained in comparison with the original version of BCP show that this approach has comparable accuracy and runtimes, while allowing proper relational knowledge representation of features for knowledge extraction from CILP++ networks
Representing Semantified Biological Assays in the Open Research Knowledge Graph
In the biotechnology and biomedical domains, recent text mining efforts
advocate for machine-interpretable, and preferably, semantified, documentation
formats of laboratory processes. This includes wet-lab protocols, (in)organic
materials synthesis reactions, genetic manipulations and procedures for faster
computer-mediated analysis and predictions. Herein, we present our work on the
representation of semantified bioassays in the Open Research Knowledge Graph
(ORKG). In particular, we describe a semantification system work-in-progress to
generate, automatically and quickly, the critical semantified bioassay data
mass needed to foster a consistent user audience to adopt the ORKG for
recording their bioassays and facilitate the organisation of research,
according to FAIR principles.Comment: In Proceedings of 'The 22nd International Conference on Asia-Pacific
Digital Libraries
- …