47 research outputs found
Topian 0.1 Reference Manual
This document describes Topian ("Topic-based Model layer for Xapian"), a software layer intended to add support for topical models to Xapian
Free Software for research in Information Retrieval and Textual Clustering
The document provides an overview of the main Free ("Open Source") software of interest for research in Information Retrieval, as well as some background on the context. I provides a guideline for choosing appropriate tools
Inclusion de sens dans la représentation de documents textuels : état de l'art
Ce document donne un aperçu de l'état de l'art dans le domaine de la représentation du sens dans les documents textuels
Large-scale extraction of brain connectivity from the neuroscientific literature
Motivation: In neuroscience, as in many other scientific domains, the primary form of knowledge dissemination is through published articles. One challenge for modern neuroinformatics is finding methods to make the knowledge from the tremendous backlog of publications accessible for search, analysis and the integration of such data into computational models. A key example of this is metascale brain connectivity, where results are not reported in a normalized repository. Instead, these experimental results are published in natural language, scattered among individual scientific publications. This lack of normalization and centralization hinders the large-scale integration of brain connectivity results. In this article, we present text-mining models to extract and aggregate brain connectivity results from 13.2 million PubMed abstracts and 630 216 full-text publications related to neuroscience. The brain regions are identified with three different named entity recognizers (NERs) and then normalized against two atlases: the Allen Brain Atlas (ABA) and the atlas from the Brain Architecture Management System (BAMS). We then use three different extractors to assess inter-region connectivity. Results: NERs and connectivity extractors are evaluated against a manually annotated corpus. The complete in litero extraction models are also evaluated against invivo connectivity data from ABA with an estimated precision of 78%. The resulting database contains over 4 million brain region mentions and over 100 000 (ABA) and 122 000 (BAMS) potential brain region connections. This database drastically accelerates connectivity literature review, by providing a centralized repository of connectivity data to neuroscientists. Availability and implementation: The resulting models are publicly available at github.com/BlueBrain/bluima. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics onlin
Tool for robust stochastic parsing using optimal maximum coverage
This report presents a robust syntactic parser that is able to return a "correct" derivation tree even if the grammar cannot generate the input sentence. The following two step solution is prop osed: the finest corresponding most probable optimal maximum coverage is generated first, then the trees from this coverage are glued into one resulting tree. We discuss the implementation of this method with the SLP toolkit and libkp library
Large-scale extraction of brain connectivity from the neuroscientific literature
In neuroscience, as in many other scientific domains, the primary form of knowledge dissemination is through published articles. One challenge for modern neuroinformatics is finding methods to make the knowledge from the tremendous backlog of publications accessible for search, analysis and the integration of such data into computational models. A key example of this is metascale brain connectivity, where results are not reported in a normalized repository. Instead, these experimental results are published in natural language, scattered among individual scientific publications. This lack of normalization and centralization hinders the large-scale integration of brain connectivity results. In this article, we present text-mining models to extract and aggregate brain connectivity results from 13.2 million PubMed abstracts and 630 216 full-text publications related to neuroscience. The brain regions are identified with three different named entity recognizers (NERs) and then normalized against two atlases: the Allen Brain Atlas (ABA) and the atlas from the Brain Architecture Management System (BAMS). We then use three different extractors to assess inter-region connectivity
INtegrating SPEech acoustic and linguistic Constraints: Baseline System Development
In this report, we discuss the initial issues addressed in a research project aiming at the development of an advanced natural speech recognition system for the automatic processing of telephone directory requests. This multi-faceted project involves (1) text processing (labeling and tagging) of a large database of telephone-based natural voice requests (including all kinds of peculiarities), (2) development of robust acoustic models, (3) integrating advanced natural language (syntactic and semantic) constraints, (4) detecting and dealing with a large number of out-of-vocabulary words (proper names), and (5) testing of the resulting system on natural queries. All this work will be performed on the basis of a database containing prompted (read) speech and (simulated) natural requests to information service. This report describes the initial steps that were required to set up a reasonable baseline system and a good research and evaluation framework. More specifically, a significant amount of time was devoted to proper text processing of speaker request transcriptions, in order to create the basis necessary for the lexical and linguistic modeling, as well as for the evaluation of recognition results
Finding instabilities in the community structure of complex networks
The problem of finding clusters in complex networks has been extensively
studied by mathematicians, computer scientists and, more recently, by
physicists. Many of the existing algorithms partition a network into clear
clusters, without overlap. We here introduce a method to identify the nodes
lying ``between clusters'' and that allows for a general measure of the
stability of the clusters. This is done by adding noise over the weights of the
edges of the network. Our method can in principle be applied with any
clustering algorithm, provided that it works on weighted networks. We present
several applications on real-world networks using the Markov Clustering
Algorithm (MCL).Comment: 4 pages, 5 figure
Offline grammar-based recognition of handwritten sentences
This paper proposes a sequential coupling of a Hidden Markov Model (HMM) recognizer for offline handwritten English sentences with a probabilistic bottom-up chart parser using Stochastic Context-Free Grammars (SCFG) extracted from a text corpus. Based on extensive experiments, we conclude that syntax analysis helps to improve recognition rates significantly