109 research outputs found

    Unsupervised Keyword Extraction from Polish Legal Texts

    Full text link
    In this work, we present an application of the recently proposed unsupervised keyword extraction algorithm RAKE to a corpus of Polish legal texts from the field of public procurement. RAKE is essentially a language and domain independent method. Its only language-specific input is a stoplist containing a set of non-content words. The performance of the method heavily depends on the choice of such a stoplist, which should be domain adopted. Therefore, we complement RAKE algorithm with an automatic approach to selecting non-content words, which is based on the statistical properties of term distribution

    A Lexical Database of Portuguese Multiword Expressions

    Get PDF
    This presentation focuses on an ongoing project which aims at the creation of a large lexical database of Portuguese multiword (MW) units, automatically extracted through the analysis of a balanced 50 million word corpus, statistically interpreted with lexical association measures and validated by hand. This database covers different types of MW units, like named entities, and lexical associations ranging from sets of favoured co-occurring forms with high corpus frequency and low cohesion to strongly lexicalized expressions with no, or minimum, variation. This new resource has a two-fold objective: to be an important research tool which supports the development of collocation typologies and their integration in a larger theory of MW units; to be of major help in developing and evaluating language processing tools able of dealing with MW expressions.info:eu-repo/semantics/publishedVersio

    Conquering Language: Using NLP on a Massive Scale to Build High Dimensional Language Models from the Web

    Get PDF
    International audienceDictionaries only contain some of the information we need to know about a language. The growth of the Web, the maturation of linguistic process-ing tools, and the decline in price of memory storage allow us to envision de-scriptions of languages that are much larger than before. We can conceive of building a complete language model for a language using all the text that is found on the Web for this language. This article describes our current project to do just that

    Learning from text-based close call data

    Get PDF
    A key feature of big data is the variety of data sources that are available; which include not just numerical data but also image or video data or even free text. The GB railways collects a large volume of free text data daily from railway workers describing close call hazard reports: instances where an accident could have – but did not – occur. These close call reports contain valuable safety information which could be useful in managing safety on the railway, but which can be lost in the very large volume of data – much larger than is viable for a human analyst to read. This paper describes the application of rudimentary natural language processing (NLP) techniques to uncover safety information from close calls. The analysis has proven that basic information extraction is possible using the rudimentary techniques, but has also identified some limitations that arise using only basic techniques. Using these findings further research in this area intends to look at how the techniques that have been proven to date can be improved with the use of more advanced NLP techniques coupled with machine-learning

    State-of-the-Art in Weighted Finite-State Spell-Checking

    Get PDF
    Proceeding volume: 2The following claims can be made about finite-state methods for spell-checking: 1) Finite-state language models provide support for morphologically complex languages that word lists, affix stripping and similar approaches do not provide; 2) Weighted finite-state models have expressive power equal to other, state-of-the-art string algorithms used by contemporary spell-checkers; and 3) Finite-state models are at least as fast as other string algorithms for lookup and error correction. In this article, we use some contemporary non-finite-state spell-checking methods as a baseline and perform tests in light of the claims, to evaluate state-of-the-art finite-state spell-checking methods. We verify that finite-state spell-checking systems outperform the traditional approaches for English. We also show that the models for morphologically complex languages can be made to perform on par with English systems.Peer reviewe

    Statistical Language Modelling

    Get PDF
    Grammar-based natural language processing has reached a level where it can `understand' language to a limited degree in restricted domains. For example, it is possible to parse textual material very accurately and assign semantic relations to parts of sentences. An alternative approach originates from the work of Shannon over half a century ago [41], [42]. This approach assigns probabilities to linguistic events, where mathematical models are used to represent statistical knowledge. Once models are built, we decide which event is more likely than the others according to their probabilities. Although statistical methods currently use a very impoverished representation of speech and language (typically finite state), it is possible to train the underlying models from large amounts of data. Importantly, such statistical approaches often produce useful results. Statistical approaches seem especially well-suited to spoken language which is often spontaneous or conversational and not readily amenable to standard grammar-based approaches

    Highly-parallelized simulation of a pixelated LArTPC on a GPU

    Get PDF
    The rapid development of general-purpose computing on graphics processing units (GPGPU) is allowing the implementation of highly-parallelized Monte Carlo simulation chains for particle physics experiments. This technique is particularly suitable for the simulation of a pixelated charge readout for time projection chambers, given the large number of channels that this technology employs. Here we present the first implementation of a full microphysical simulator of a liquid argon time projection chamber (LArTPC) equipped with light readout and pixelated charge readout, developed for the DUNE Near Detector. The software is implemented with an end-to-end set of GPU-optimized algorithms. The algorithms have been written in Python and translated into CUDA kernels using Numba, a just-in-time compiler for a subset of Python and NumPy instructions. The GPU implementation achieves a speed up of four orders of magnitude compared with the equivalent CPU version. The simulation of the current induced on 10^3 pixels takes around 1 ms on the GPU, compared with approximately 10 s on the CPU. The results of the simulation are compared against data from a pixel-readout LArTPC prototype

    Analysis and Evaluation of Techniques for the Extraction of Classes in the Ontology Learning Process

    Get PDF
    This paper analyzes and evaluates, in the context of Ontology learning, some techniques to identify and extract candidate terms to classes of a taxonomy. Besides, this work points out some inconsistencies that may be occurring in the preprocessing of text corpus, and proposes techniques to obtain good terms candidate to classes of a taxonomy
    corecore