92 research outputs found

    The Use of Latent Semantic Indexing to Cluster Documents into Their Subject Areas

    Get PDF
    Keyword matching information retrieval systems areplagued with problems of noise in the document collection, arising from synonymy and polysemy. This noise tends to hide the latent structure of the documents, hence reduing the accuracy of the information retrieval systems, as well asmaking it difficult for clustering algorithms to pick up on shared concepts, and effectively cluster similar documents. Latent Semantic Analysis (LSA) through its use of Singular Value Decomposition reduces the dimension of the document space, mapping it onto a smaller concept space devoid of this noice and making it easier to group similar documents together. This work is an exploratory report of the use of LSA to cluster a small dataset of documents according to their topic areas to see how LSA would fare in comparison to clustering with a clustering package, without LS

    A Model for Automatic Extraction of Slowdowns From Traffic Sensor Data

    Get PDF
    The ability to identify slowdowns from a stream of traffic sensor readings in an automatic fashion is a core building block for any application which incorporates traffic behaviour into its analysis process. The methods proposed in this paper treat slowdowns as valley-shaped data sequences that are found below a normal distribution interval. This paper proposes a model for slowdown identification and partitioning across multiple periods of time and it aims to serve as a first layer of knowledge about the traffic environment. The model can be used to extract the regularities from a set of events of interest with recurring behaviour and to assert the consistency of the extracted patterns. The proposed methods are evaluated using real data collected from highway traffic sensor

    AraNLP: A Java-based library for the processing of Arabic text

    Get PDF
    We present a free, Java-based library named "AraNLP" that covers various Arabic text preprocessing tools. Although a good number of tools for processing Arabic text already exist, integration and compatibility problems continually occur. AraNLP is an attempt to gather most of the vital Arabic text preprocessing tools into one library that can be accessed easily by integrating or accurately adapting existing tools and by developing new ones when required. The library includes a sentence detector, tokenizer, light stemmer, root stemmer, part-of-speech tagger (POS-tagger), word segmenter, normalizer, and a punctuation and diacritic remover

    A semi-supervised learning approach to arabic named entity recognition

    Get PDF
    We present ASemiNER, a semi-supervised algorithm for identifying Named Entities (NEs) in Arabic text. ASemiNER does not require annotated training data, or gazetteers. It also can be easily adapted to handle more than the three standard NE types (Person, Location, and Organisation). To our knowledge, our algorithm is the first study that intensively investigates the semi-supervised pattern-based learning approach to Arabic Named Entity Recognition (NER). We describe ASemiNER and compare its performance with different supervised systems. We evaluate this algorithm by way of experiments to extract the three standard named-entity types. Ultimately, our algorithm outperforms simple supervised systems and also performs well when we evaluate its performance in order to extract three new, specialised types of NEs (Politicians, Sportspersons, and Artists)

    Automatic Creation of Arabic Named Entity Annotated Corpus Using Wikipedia

    Get PDF
    In this paper we propose a new methodology to exploit Wikipedia features and structure to automatically develop an Arabic NE annotated corpus. Each Wikipedia link is transformed into an NE type of the target article in order to produce the NE annotation. Other Wikipedia features - namely redirects, anchor texts, and inter-language links - are used to tag additional NEs, which appear without links in Wikipedia texts. Furthermore, we have developed a filtering algorithm to eliminate ambiguity when tagging candidate NEs. Herein we also introduce a mechanism based on the high coverage of Wikipedia in order to address two challenges particular to tagging NEs in Arabic text: rich morphology and the absence of capitalisation. The corpus created with our new method (WDC) has been used to train an NE tagger which has been tested on different domains. Judging by the results, an NE tagger trained on WDC can compete with those trained on manually annotated corpora

    Moving towards Adaptive Search

    Get PDF
    Information retrieval has become very popular over the last decade with the advent of the Web. Nevertheless, searching on the Web is very different to searching on smaller, often more structured collections such as intranets and digital libraries. Such collections are the focus of the recently started AutoAdapt project1. The project seeks to aid user search by providing well-structured domain knowledge to assist query modification and navigation. There are two challenges: acquiring the domain knowledge and adapting it automatically to the specific interest of the user community. At the workshop we will demonstrate an implemented prototype that serves as a starting point on the way to truly adaptive search

    Using domain models for context-rich user logging

    Get PDF
    This paper describes the prototype interactive search sys- Tem being developed within the AutoAdapt project1. The AutoAdapt project seeks to enhance the user experience in searching for information and navigating within selected do- main collections by providing structured representations of domain knowledge to be directly explored, logged, adapted and updated to refject user needs. We propose that this structure is a valuable stepping-stone in context-rich logging of user activities within the information seeking environment. Here we describe the primary components that have been implemented and the user interactions that it will support

    A Methodology for Simulated Experiments in Interactive Search

    Get PDF
    Interactive information retrieval has received much attention in recent years, e.g. [7]. Furthermore, increased activity in developing interactive features in search systems used across existing popular Web search engines suggests that interactive systems are being recognised as a promising next step in assisting information search. One of the most challenging problems with interactive systems however remains evaluation. We describe the general specifications of a methodology for conducting controlled and reproducible experiments in the context of interactive search. It was developed in the AutoAdapt project1 focusing on search in intranets, but the methodology is more generic than that and can be applied to interactive Web search as well. The goal of this methodology is to evaluate the ability of different algorithms to produce domain models that provide accurate suggestions for query modifications. The AutoAdapt project investigates the application of automatically constructed adaptive domain models for providing suggestions for query modifications to the users of an intranet search engine. This goes beyond static models such as the one employed to guide users who search the Web site of the University of Essex which is based on a domain model that has been built in advance using the documents’ markup structure

    University of Essex at the TAC 2011 Multilingual Summarisation Pilot

    Get PDF
    We present the results of our Arabic and English runs at the TAC 2011 Multilingual summarisation (MultiLing) task. We partic- ipated with centroid-based clustering for multi- document summarisation. The automatically generated Arabic and English summaries were evaluated by human participants and by two automatic evaluation metrics, ROUGE and Au- toSummENG. The results are compared with the other systems that participated in the same track on both Arabic and English languages. Our Arabic summariser performed particularly well in the human evaluation

    Experiment-driven development of a GWAP for marking segments in text

    Get PDF
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee© 2017 Copyright is held by the owner/author(s). This paper describes TileAttack, an innovative highly configurable game-with-a-purpose (GWAP) designed to gather annotations for text segmentation tasks whilst exploring the effects of different game mechanics on GWAP for NLP (Natural Language Processing) problems, with a view to improving both quality of player contributions and player uptake. In this work we present a pilot experiment that shows TileAttack labelling "mentions" and being used to test the effects of in game time constraints on accuracy and player engagement. We present the results of this experiment using a set of metrics derived from those used for evaluating Free-To-Play (F2P) games
    • …
    corecore