8 research outputs found

    Cross Validation Of Neural Network Applications For Automatic New Topic Identification

    Get PDF
    There are recent studies in the literature on automatic topic-shift identification in Web search engine user sessions; however most of this work applied their topic-shift identification algorithms on data logs from a single search engine. The purpose of this study is to provide the cross-validation of an artificial neural network application to automatically identify topic changes in a web search engine user session by using data logs of different search engines for training and testing the neural network. Sample data logs from the Norwegian search engine FAST (currently owned by Overture) and Excite are used in this study. Findings of this study suggest that it could be possible to identify topic shifts and continuations successfully on a particular search engine user session using neural networks that are trained on a different search engine data log

    Kindles, card catalogs, and the future of libraries: A collaborative digital humanities project

    Get PDF
    journal articleLibrarianship is a profession that often inspires commentary both from practitioners within the profession and the public who use libraries. For librarians keeping up with the field, the literature is often engaged with predicting the effects of culture, policy or technology on libraries, sometimes with a great deal of hyperbole. For this article, two librarians and a digital humanities researcher formed a research team to determine if the digital humanities technique of distant reading through topic modeling would reveal interesting patterns in a Digital humanities project corpus of library-themed literature engaged in predicting the future and/or demise of libraries

    Discovering latent topical structure by second-order similarity analysis

    Get PDF
    This is the post-print of the Article - Copyright @ 2011 ASIS&TDocument similarity models are typically derived from a term-document vector space representation by comparing all vector-pairs using some similarity measure. Computing similarity directly from a ‘bag of words’ model can be problematic because term independence causes the relationships between synonymous and related terms and the contextual influences that determine the ‘sense’ of polysemous terms to be ignored. This paper compares two methods that potentially address these problems by modelling the higher-order relationships that lie latent within the original vector space. The first is latent semantic analysis (LSA), a dimension reduction method which is a well known means of addressing the vocabulary mismatch problem in information retrieval systems. The second is the lesser known, yet conceptually simple approach of second-order similarity (SOS) analysis, where similarity is measured in terms of profiles of first-order similarities as computed directly from the term-document space. Nearest neighbour tests show that SOS analysis produces similarity models that are consistently better than both first-order and LSA derived models at resolving both coarse and fine level semantic clusters. SOS analysis has been criticised for its cubic complexity. A second contribution is the novel application of vector truncation to reduce the run-time by a constant factor. Speed-ups of four to ten times are found to be easily achievable without losing the structural benefits associated with SOS analysis

    Mining Chat Logs to Extract Information about Authors and Topics for Crime Investigation

    Get PDF
    Cybercriminals have been using the Internet to accomplish illegitimate activities and to execute catastrophic attacks. Computer Mediated Communication, such as online chat, provides an anonymous channel for predators to exploit victims. In order to prosecute criminals in a court of law, an investigator often needs to extract evidence from a large volume of chat messages. Most of the existing search tools are keyword-based, and the search terms are provided by an investigator. The quality of the retrieved results depends on the search terms provided. Due to the large volume of chat messages and the large number of participants in public chat rooms, the process is usually time-consuming and error-prone. This thesis presents a topic search model to analyze archives of chat logs for segregating crime-relevant logs from others. Specifically, we propose an extension of the Latent Dirichlet Allocation (LDA)-based model to extract topics, compute the contribution of authors in these topics, and study the transitions of these topics over time. In addition, we present another unique model for characterizing authors-topics over time. This is crucial for investigation because it provides a view of the activity in which authors are involved in certain topics. Experiments on two real-life datasets suggest that the proposed approach can discover hidden criminal topics and the distribution of authors to these topics

    Semantic enrichment of knowledge sources supported by domain ontologies

    Get PDF
    This thesis introduces a novel conceptual framework to support the creation of knowledge representations based on enriched Semantic Vectors, using the classical vector space model approach extended with ontological support. One of the primary research challenges addressed here relates to the process of formalization and representation of document contents, where most existing approaches are limited and only take into account the explicit, word-based information in the document. This research explores how traditional knowledge representations can be enriched through incorporation of implicit information derived from the complex relationships (semantic associations) modelled by domain ontologies with the addition of information presented in documents. The relevant achievements pursued by this thesis are the following: (i) conceptualization of a model that enables the semantic enrichment of knowledge sources supported by domain experts; (ii) development of a method for extending the traditional vector space, using domain ontologies; (iii) development of a method to support ontology learning, based on the discovery of new ontological relations expressed in non-structured information sources; (iv) development of a process to evaluate the semantic enrichment; (v) implementation of a proof-of-concept, named SENSE (Semantic Enrichment kNowledge SourcEs), which enables to validate the ideas established under the scope of this thesis; (vi) publication of several scientific articles and the support to 4 master dissertations carried out by the department of Electrical and Computer Engineering from FCT/UNL. It is worth mentioning that the work developed under the semantic referential covered by this thesis has reused relevant achievements within the scope of research European projects, in order to address approaches which are considered scientifically sound and coherent and avoid “reinventing the wheel”.European research projects - CoSpaces (IST-5-034245), CRESCENDO (FP7-234344) and MobiS (FP7-318452

    Classifying complex topics using spatial-semantic document visualization : an evaluation of an interaction model to support open-ended search tasks

    Get PDF
    In this dissertation we propose, test and develop a novel search interaction model to address two key problems associated with conducting an open-ended search task within a classical information retrieval system: (i) the need to reformulate the query within the context of a shifting conception of the problem and (ii) the need to integrate relevant results across a number of separate results sets. In our model the user issues just one highrecall query and then performs a sequence of more focused, distinct aspect searches by browsing the static structured context of a spatial-semantic visualization of this retrieved document set. Our thesis is that unsupervised spatial-semantic visualization can automatically classify retrieved documents into a two-level hierarchy of relevance. In particular we hypothesise that the locality of any given aspect exemplar will tend to comprise a sufficient proportion of same-aspect documents to support a visually guided strategy for focused, same-aspect searching that we term the aspect cluster growing strategy. We examine spatial-semantic classification and potential aspect cluster growing performance across three scenarios derived from topics and relevance judgements from the TREC test collection. Our analyses show that the expected classification can be represented in spatial-semantic structures created from document similarities computed by a simple vector space text analysis procedure. We compare two diametrically opposed approaches to layout optimisation: a global approach that focuses on preserving the all similarities and a local approach that focuses only on the strongest similarities. We find that the local approach, based on a minimum spanning tree of similarities, produces a better classification and, as observed from strategy simulation, more efficient aspect cluster growing performance in most situations, compared to the global approach of multidimensional scaling. We show that a small but significant proportion of aspect clustering growing cases can be problematic, regardless of the layout algorithm used. We identify the characteristics of these cases and, on this basis, demonstrate a set of novel interactive tools that provide additional semantic cues to aid the user in locating same-aspect documents.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Classifying complex topics using spatial-semantic document visualization : an evaluation of an interaction model to support open-ended search tasks

    Get PDF
    In this dissertation we propose, test and develop a novel search interaction model to address two key problems associated with conducting an open-ended search task within a classical information retrieval system: (i) the need to reformulate the query within the context of a shifting conception of the problem and (ii) the need to integrate relevant results across a number of separate results sets. In our model the user issues just one highrecall query and then performs a sequence of more focused, distinct aspect searches by browsing the static structured context of a spatial-semantic visualization of this retrieved document set. Our thesis is that unsupervised spatial-semantic visualization can automatically classify retrieved documents into a two-level hierarchy of relevance. In particular we hypothesise that the locality of any given aspect exemplar will tend to comprise a sufficient proportion of same-aspect documents to support a visually guided strategy for focused, same-aspect searching that we term the aspect cluster growing strategy. We examine spatial-semantic classification and potential aspect cluster growing performance across three scenarios derived from topics and relevance judgements from the TREC test collection. Our analyses show that the expected classification can be represented in spatial-semantic structures created from document similarities computed by a simple vector space text analysis procedure. We compare two diametrically opposed approaches to layout optimisation: a global approach that focuses on preserving the all similarities and a local approach that focuses only on the strongest similarities. We find that the local approach, based on a minimum spanning tree of similarities, produces a better classification and, as observed from strategy simulation, more efficient aspect cluster growing performance in most situations, compared to the global approach of multidimensional scaling. We show that a small but significant proportion of aspect clustering growing cases can be problematic, regardless of the layout algorithm used. We identify the characteristics of these cases and, on this basis, demonstrate a set of novel interactive tools that provide additional semantic cues to aid the user in locating same-aspect documents.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
    corecore