6,425 research outputs found

    Feature selection, optimization and clustering strategies of text documents

    Get PDF
    Clustering is one of the most researched areas of data mining applications in the contemporary literature. The need for efficient clustering is observed across wide sectors including consumer segmentation, categorization, shared filtering, document management, and indexing. The research of clustering task is to be performed prior to its adaptation in the text environment. Conventional approaches typically emphasized on the quantitative information where the selected features are numbers. Efforts also have been put forward for achieving efficient clustering in the context of categorical information where the selected features can assume nominal values. This manuscript presents an in-depth analysis of challenges of clustering in the text environment. Further, this paper also details prominent models proposed for clustering along with the pros and cons of each model. In addition, it also focuses on various latest developments in the clustering task in the social network and associated environments

    From Frequency to Meaning: Vector Space Models of Semantics

    Full text link
    Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term-document, word-context, and pair-pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field

    Semantic enrichment of knowledge sources supported by domain ontologies

    Get PDF
    This thesis introduces a novel conceptual framework to support the creation of knowledge representations based on enriched Semantic Vectors, using the classical vector space model approach extended with ontological support. One of the primary research challenges addressed here relates to the process of formalization and representation of document contents, where most existing approaches are limited and only take into account the explicit, word-based information in the document. This research explores how traditional knowledge representations can be enriched through incorporation of implicit information derived from the complex relationships (semantic associations) modelled by domain ontologies with the addition of information presented in documents. The relevant achievements pursued by this thesis are the following: (i) conceptualization of a model that enables the semantic enrichment of knowledge sources supported by domain experts; (ii) development of a method for extending the traditional vector space, using domain ontologies; (iii) development of a method to support ontology learning, based on the discovery of new ontological relations expressed in non-structured information sources; (iv) development of a process to evaluate the semantic enrichment; (v) implementation of a proof-of-concept, named SENSE (Semantic Enrichment kNowledge SourcEs), which enables to validate the ideas established under the scope of this thesis; (vi) publication of several scientific articles and the support to 4 master dissertations carried out by the department of Electrical and Computer Engineering from FCT/UNL. It is worth mentioning that the work developed under the semantic referential covered by this thesis has reused relevant achievements within the scope of research European projects, in order to address approaches which are considered scientifically sound and coherent and avoid “reinventing the wheel”.European research projects - CoSpaces (IST-5-034245), CRESCENDO (FP7-234344) and MobiS (FP7-318452

    Discovering topics in Slack message streams

    Get PDF
    Slack is an instant messaging platform intended for the internal communications of companies and other organizations. For organizations that use Slack extensively it may provide an interesting source of insight, but as such the data is difficult to analyze. Topic modeling, primarily latent Dirichlet allocation (LDA), is commonly used to summarize textual data in a meaningful way. Instant messages tend to be very short, which causes problems for conventional topic modeling methods such as LDA. The data sparsity problem can be tackled with data expansion and data combination techniques. For instant messages, data combination is particularly attractive as the messages are not independent of each other, but form implicit, and sometimes expicit, threads as the participants reply to each other. Most of the threads in the Slack data are not explicit, but must be ’untangled’ from the message stream if they are to be used as a basis for a data combination scheme. In this thesis we study the possibility of detecting implicit threads from a slack message stream and leveraging the threads as a data combination scheme in topic modeling. The threads are detected using a hierarchical clustering algorithm which uses word mover’s distance, latent semantic analysis, and metadata to compute the distances between messages. The clusters are then concatenated and used as the input for LDA. It is shown that on a dataset gathered from the Gofore Oyj Slack workspace, the cluster-based model improves on the message-based model, but falls short of being practical

    XML Matchers: approaches and challenges

    Full text link
    Schema Matching, i.e. the process of discovering semantic correspondences between concepts adopted in different data source schemas, has been a key topic in Database and Artificial Intelligence research areas for many years. In the past, it was largely investigated especially for classical database models (e.g., E/R schemas, relational databases, etc.). However, in the latest years, the widespread adoption of XML in the most disparate application fields pushed a growing number of researchers to design XML-specific Schema Matching approaches, called XML Matchers, aiming at finding semantic matchings between concepts defined in DTDs and XSDs. XML Matchers do not just take well-known techniques originally designed for other data models and apply them on DTDs/XSDs, but they exploit specific XML features (e.g., the hierarchical structure of a DTD/XSD) to improve the performance of the Schema Matching process. The design of XML Matchers is currently a well-established research area. The main goal of this paper is to provide a detailed description and classification of XML Matchers. We first describe to what extent the specificities of DTDs/XSDs impact on the Schema Matching task. Then we introduce a template, called XML Matcher Template, that describes the main components of an XML Matcher, their role and behavior. We illustrate how each of these components has been implemented in some popular XML Matchers. We consider our XML Matcher Template as the baseline for objectively comparing approaches that, at first glance, might appear as unrelated. The introduction of this template can be useful in the design of future XML Matchers. Finally, we analyze commercial tools implementing XML Matchers and introduce two challenging issues strictly related to this topic, namely XML source clustering and uncertainty management in XML Matchers.Comment: 34 pages, 8 tables, 7 figure

    Document Clustering as an approach to template extraction

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceA great part of customer support is done via the exchange of emails. As the number of emails exchanged daily is constantly increasing, companies need to find approaches to ensure its efficiency. One common strategy is the usage of template emails as an answer. These answers templates are usually found by a human agent through the repetitive usage of the same answer. In this work, we use a clustering approach to find these answer templates. Several clustering algorithms are researched in this work, with a focus on the k-means methodology, as well as other clustering components such as similarity measures and pre-processing steps. As we are dealing with text data, several text representation methods are also compared. Due to the peculiarity of the provided data, we are able to design methodologies to ensure the feasibility of this task and develop strategies to extract the answer templates from the clustering results
    corecore