3,522 research outputs found

    Transforming Graph Representations for Statistical Relational Learning

    Full text link
    Relational data representations have become an increasingly important topic due to the recent proliferation of network datasets (e.g., social, biological, information networks) and a corresponding increase in the application of statistical relational learning (SRL) algorithms to these domains. In this article, we examine a range of representation issues for graph-based relational data. Since the choice of relational data representation for the nodes, links, and features can dramatically affect the capabilities of SRL algorithms, we survey approaches and opportunities for relational representation transformation designed to improve the performance of these algorithms. This leads us to introduce an intuitive taxonomy for data representation transformations in relational domains that incorporates link transformation and node transformation as symmetric representation tasks. In particular, the transformation tasks for both nodes and links include (i) predicting their existence, (ii) predicting their label or type, (iii) estimating their weight or importance, and (iv) systematically constructing their relevant features. We motivate our taxonomy through detailed examples and use it to survey and compare competing approaches for each of these tasks. We also discuss general conditions for transforming links, nodes, and features. Finally, we highlight challenges that remain to be addressed

    EFFECTIVE METHODS AND TOOLS FOR MINING APP STORE REVIEWS

    Get PDF
    Research on mining user reviews in mobile application (app) stores has noticeably advanced in the past few years. The main objective is to extract useful information that app developers can use to build more sustainable apps. In general, existing research on app store mining can be classified into three genres: classification of user feedback into different types of software maintenance requests (e.g., bug reports and feature requests), building practical tools that are readily available for developers to use, and proposing visions for enhanced mobile app stores that integrate multiple sources of user feedback to ensure app survivability. Despite these major advances, existing tools and techniques still suffer from several drawbacks. Specifically, the majority of techniques rely on the textual content of user reviews for classification. However, due to the inherently diverse and unstructured nature of user-generated online textual reviews, text-based review mining techniques often produce excessively complicated models that are prone to over-fitting. Furthermore, the majority of proposed techniques focus on extracting and classifying the functional requirements in mobile app reviews, providing a little or no support for extracting and synthesizing the non-functional requirements (NFRs) raised in user feedback (e.g., security, reliability, and usability). In terms of tool support, existing tools are still far from being adequate for practical applications. In general, there is a lack of off-the-shelf tools that can be used by researchers and practitioners to accurately mine user reviews. Motivated by these observations, in this dissertation, we explore several research directions aimed at addressing the current issues and shortcomings in app store review mining research. In particular, we introduce a novel semantically aware approach for mining and classifying functional requirements from app store reviews. This approach reduces the dimensionality of the data and enhances the predictive capabilities of the classifier. We then present a two-phase study aimed at automatically capturing the NFRs in user reviews. We also introduce MARC, a tool that enables developers to extract, classify, and summarize user reviews

    Two-Level Text Classification Using Hybrid Machine Learning Techniques

    Get PDF
    Nowadays, documents are increasingly being associated with multi-level category hierarchies rather than a flat category scheme. To access these documents in real time, we need fast automatic methods to navigate these hierarchies. Today’s vast data repositories such as the web also contain many broad domains of data which are quite distinct from each other e.g. medicine, education, sports and politics. Each domain constitutes a subspace of the data within which the documents are similar to each other but quite distinct from the documents in another subspace. The data within these domains is frequently further divided into many subcategories. Subspace Learning is a technique popular with non-text domains such as image recognition to increase speed and accuracy. Subspace analysis lends itself naturally to the idea of hybrid classifiers. Each subspace can be processed by a classifier best suited to the characteristics of that particular subspace. Instead of using the complete set of full space feature dimensions, classifier performances can be boosted by using only a subset of the dimensions. This thesis presents a novel hybrid parallel architecture using separate classifiers trained on separate subspaces to improve two-level text classification. The classifier to be used on a particular input and the relevant feature subset to be extracted is determined dynamically by using a novel method based on the maximum significance value. A novel vector representation which enhances the distinction between classes within the subspace is also developed. This novel system, the Hybrid Parallel Classifier, was compared against the baselines of several single classifiers such as the Multilayer Perceptron and was found to be faster and have higher two-level classification accuracies. The improvement in performance achieved was even higher when dealing with more complex category hierarchies

    From text mining to knowledge mining: An integrated framework of concept extraction and categorization for domain ontology

    Get PDF
    Organizations are struggling with the challenges coming from the regulatory, social and economic environment which are complex and changing continuously. They cause increase demand for the management of organizational knowledge, like how to provide employees, the necessary job-specific knowledge in right time and in right format. Employees have to update their knowledge, improve their competencies continuously. Knowledge repositories have key roles from knowledge management aspects, because they contain primarily the organizations’ intellectual assets (it is explicit knowledge) while employees have tacit knowledge, which is difficult to extract and codify. Business processes are also important from the management of organizational knowledge aspects, they have explicit and tacit knowledge elements as well. One of the key questions is how to handle this hidden knowledge in order to improve the organizational knowledge especially employees' knowledge by providing the most appropriate learning and/or training materials and how can we ensure that the knowledge in business processes are the same as in knowledge repositories and employees' head. These are the major themes in this thesis

    Context Aware Textual Entailment

    Get PDF
    In conversations, stories, news reporting, and other forms of natural language, understanding requires participants to make assumptions (hypothesis) based on background knowledge, a process called entailment. These assumptions may then be supported, contradicted, or refined as a conversation or story progresses and additional facts become known and context changes. It is often the case that we do not know an aspect of the story with certainty but rather believe it to be the case; i.e., what we know is associated with uncertainty or ambiguity. In this research a method has been developed to identify different contexts of the input raw text along with specific features of the contexts such as time, location, and objects. The method includes a two-phase SVM classifier along with a voting mechanism in the second phase to identify the contexts. Rule-based algorithms were utilized to extract the context elements. This research also develops a new context˗aware text representation. This representation maintains semantic aspects of sentences, as well as textual contexts and context elements. The method can offer both graph representation and First-Order-Logic representation of the text. This research also extracts a First-Order Logic (FOL) and XML representation of a text or series of texts. The method includes entailment using background knowledge from sources (VerbOcean and WordNet), with resolution of conflicts between extracted clauses, and handling the role of context in resolving uncertain truth

    Adaptive constrained clustering with application to dynamic image database categorization and visualization.

    Get PDF
    The advent of larger storage spaces, affordable digital capturing devices, and an ever growing online community dedicated to sharing images has created a great need for efficient analysis methods. In fact, analyzing images for the purpose of automatic categorization and retrieval is quickly becoming an overwhelming task even for the casual user. Initially, systems designed for these applications relied on contextual information associated with images. However, it was realized that this approach does not scale to very large data sets and can be subjective. Then researchers proposed methods relying on the content of the images. This approach has also proved to be limited due to the semantic gap between the low-level representation of the image and the high-level user perception. In this dissertation, we introduce a novel clustering technique that is designed to combine multiple forms of information in order to overcome the disadvantages observed while using a single information domain. Our proposed approach, called Adaptive Constrained Clustering (ACC), is a robust, dynamic, and semi-supervised algorithm. It is based on minimizing a single objective function incorporating the abilities to: (i) use multiple feature subsets while learning cluster independent feature relevance weights; (ii) search for the optimal number of clusters; and (iii) incorporate partial supervision in the form of pairwise constraints. The content of the images is used to extract the features used in the clustering process. The context information is used in constructing a set of appropriate constraints. These constraints are used as partial supervision information to guide the clustering process. The ACC algorithm is dynamic in the sense that the number of categories are allowed to expand and contract depending on the distribution of the data and the available set of constraints. We show that the proposed ACC algorithm is able to partition a given data set into meaningful clusters using an adaptive, soft constraint satisfaction methodology for the purpose of automatically categorizing and summarizing an image database. We show that the ACC algorithm has the ability to incorporate various types of contextual information. This contextual information includes: spatial information provided by geo-referenced images that include GPS coordinates pinpointing their location, temporal information provided by each image\u27s time stamp indicating the capture time, and textual information provided by a set of keywords describing the semantics of the associated images

    The Generation of Compound Nominals to Represent the Essence of Text The COMMIX System

    Get PDF
    This thesis concerns the COMMIX system, which automatically extracts information on what a text is about, and generates that information in the highly compacted form of compound nominal expressions. The expressions generated are complex and may include novel terms which do not appear themselves in the input text. From the practical point of view, the work is driven by the need for better representations of content: for representations which are shorter and more concise than would appear in an abstract, yet more informative and representative of the actual aboutness than commonly occurs in indexing expressions and key terms. This additional layer of representation is referred to in this work as pertaining to the essence of a particular text. From a theoretical standpoint, the thesis shows how the compound nominal as a construct can be successfully employed in these highly informative representations. It involves an exploration of the claim that there is sufficient semantic information contained within the standard dictionary glosses for individual words to enable the construction of useful and highly representative novel compound nominal expressions, without recourse to standard syntactic and statistical methods. It shows how a shallow semantic approach to content identification which is based on lexical overlap can produce some very encouraging results. The methodology employed, and described herein, is domain-independent, and does not require the specification of templates with which the input text must comply. In these two respects, the methodology developed in this work avoids two of the most common problems associated with information extraction. As regards the evaluation of this type of work, the thesis introduces and utilises the notion of percentage attainment value, which is used in conjunction with subjects' opinions about the degree to which the aboutness terms succeed in indicating the subject matter of the texts for which they were generated
    corecore