106 research outputs found

    Construction des langages d'indexation: aspects théoriques

    Get PDF
    Cette contribution est l'aboutissement d'une recherche portant sur les études consacrées, dans la littérature spécialisée, aux fondements théoriques des langages d'indexation. Plus d'une centaine de références ont été rassemblées, que l'on trouvera toutes à la fin de cet article. Après avoir présenté sa méthode d'investigation, l'auteur examine ici, en une synthèse des travaux recensés, révolution depuis une quarantaine d'années des principes théoriques présidant à la conception des classifications, des langages d'indexation en chaîne et des thésaurus, ainsi que rapport des théories linguistiques et mathématiques aux langages d'indexation

    Using Search Term Positions for Determining Document Relevance

    Get PDF
    The technological advancements in computer networks and the substantial reduction of their production costs have caused a massive explosion of digitally stored information. In particular, textual information is becoming increasingly available in electronic form. Finding text documents dealing with a certain topic is not a simple task. Users need tools to sift through non-relevant information and retrieve only pieces of information relevant to their needs. The traditional methods of information retrieval (IR) based on search term frequency have somehow reached their limitations, and novel ranking methods based on hyperlink information are not applicable to unlinked documents. The retrieval of documents based on the positions of search terms in a document has the potential of yielding improvements, because other terms in the environment where a search term appears (i.e. the neighborhood) are considered. That is to say, the grammatical type, position and frequency of other words help to clarify and specify the meaning of a given search term. However, the required additional analysis task makes position-based methods slower than methods based on term frequency and requires more storage to save the positions of terms. These drawbacks directly affect the performance of the most user critical phase of the retrieval process, namely query evaluation time, which explains the scarce use of positional information in contemporary retrieval systems. This thesis explores the possibility of extending traditional information retrieval systems with positional information in an efficient manner that permits us to optimize the retrieval performance by handling term positions at query evaluation time. To achieve this task, several abstract representation of term positions to efficiently store and operate on term positional data are investigated. In the Gauss model, descriptive statistics methods are used to estimate term positional information, because they minimize outliers and irregularities in the data. The Fourier model is based on Fourier series to represent positional information. In the Hilbert model, functional analysis methods are used to provide reliable term position estimations and simple mathematical operators to handle positional data. The proposed models are experimentally evaluated using standard resources of the IR research community (Text Retrieval Conference). All experiments demonstrate that the use of positional information can enhance the quality of search results. The suggested models outperform state-of-the-art retrieval utilities. The term position models open new possibilities to analyze and handle textual data. For instance, document clustering and compression of positional data based on these models could be interesting topics to be considered in future research

    Novel database design for extreme scale corpus analysis

    Get PDF
    This thesis presents the patterns and methods uncovered in the development of a new scalable corpus database management system, LexiDB, which can handle the ever-growing size of modern corpus datasets. Initially, an exploration of existing corpus data systems is conducted which examines their usage in corpus linguistics as well as their underlying architectures. From this survey, it is identified that existing systems are designed primarily to be vertically scalable (i.e. scalable through the usage of bigger, better and faster hardware). This motivates a wider examination of modern distributable database management systems and information retrieval techniques used for indexing and retrieval. These techniques are modified and adapted into an architecture that can be horizontally scaled to handle ever bigger corpora. Based on this architecture several new methods for querying and retrieval that improve upon existing techniques are proposed as modern approaches to query extremely large annotated text collections for corpus analysis. The effectiveness of these techniques and the scalability of the architecture is evaluated where it is demonstrated that the architecture is comparably scalable to two modern No-SQL database management systems and outperforms existing corpus data systems in token level pattern querying whilst still supporting character level pattern matching

    DEXTER: A workbench for automatic term extraction with specialized corpora

    Full text link
    [EN] Automatic term extraction has become a priority area of research within corpus processing. Despite the extensive literature in this field, there are still some outstanding issues that should be dealt with during the construction of term extractors, particularly those oriented to support research in terminology and terminography. In this regard, this article describes the design and development of DEXTER, an online workbench for the extraction of simple and complex terms from domain-specific corpora in English, French, Italian and Spanish. In this framework, three issues contribute to placing the most important terms in the foreground. First, unlike the elaborate morphosyntactic patterns proposed by most previous research, shallow lexical filters have been constructed to discard term candidates. Second, a large number of common stopwords are automatically detected by means of a method that relies on the IATE database together with the frequency distribution of the domain-specific corpus and a general corpus. Third, the term-ranking metric, which is grounded on the notions of salience, relevance and cohesion, is guided by the IATE database to display an adequate distribution of terms.Financial support for this research has been provided by the DGI, Spanish Ministry of Education and Science, grant FFI2014-53788-C3-1-P.Periñán-Pascual, C. (2018). DEXTER: A workbench for automatic term extraction with specialized corpora. Natural Language Engineering. 24(2):163-198. https://doi.org/10.1017/S1351324917000365S16319824

    CTX - ein Verfahren zur computergestützten Texterschließung

    Get PDF
    Zusammen mit Edith Kroupa und Gerald Keil hat Zimmermann diesen Forschungsbericht für das BMFT, in dessen Mittelpunkt die Entwicklung des computergestützten Texterschließungssystems CTX steht, herausgegeben. Es wird zunächst ausführlich auf Methoden und Probleme des Information Retrieval eingegangen. Es folgt eine detaillierte Beschreibung der Grundlagen, Funktionen und Aufgaben von CTX. Der anwendungsbezogene Teil stellt eine Laboranwendung im Bereich "Datenschutz" mit Schwerpunkten auf den Themen Textsorte, Wörterbucharbeit und Deskriptorermittlung sowie einem Vergleich mit dem System PASSAT vor

    Approximately Optimum Search Trees in External Memory Models

    Get PDF
    We examine optimal and near optimal solutions to the classic binary search tree problem of Knuth. We are given a set of n keys (originally known as words), B_1, B_2, ..., B_n and 2n+1 frequencies. {p_1, p_2, ..., p_n} represent the probabilities of searching for each given key, and {q_0, q_1, ..., q_n} represent the probabilities of searching in the gaps between and outside of these keys. We have that Σ_{i=0}^n q_i + Σ_{i=1}^n p_i = 1. We also assume without loss of generality that q_{i-1}+p_i+q_i != 0 for any i ϵ {1,...,n}. The keys must make up the internal nodes of the tree while the gaps make up the leaves. Our goal is to construct a binary search tree such that expected cost of search is minimized. First, we re-examine an approximate solution of Guttler, Mehlhorn and Schneider which was shown to have a worst case bound of c * H + 2 where c >= 1/(H(1/3,2/3)) ~ 1.08, and H = Σ_{i=1}^{n} p_i * log_2(1/p_i) + Σ_{j=0}^{n} q_i * log_2(1/q_j) is the entropy of the distribution. We give an improved worst case bound on the heuristic of H+4. Next, we examine the optimum binary search tree problem under a model of external memory. We use the Hierarchical Memory Model of Aggarwal et al. The model has an unlimited number of registers, R_1, R_2, ... each with its own location in memory (a positive integer). We have a set of memory sizes m_1, m_2, ..., m_l which are monotonically increasing. Each memory level has a finite size except m_l which we assume has infinite size. Each memory level has an associated cost of access c_1, c_2, ..., c_l. We assume that c_1 < c_2 < ... < c_l. We propose two approximate solutions which run in O(n) time where n is the number of words in our data set. Using these methods, we improve upon a bound given in Thite's 2001 thesis under the related HMM_2 model in the approximate setting. We also examine the related problem of binary trees on multisets of probabilities where keys are unordered and we do not differentiate between which probabilities must be leaves, and which must be internal nodes. We provide a simple O(n log_2(n)) algorithm that is within an additive (n+1)(2n) of optimal on a multiset of n keys

    An informational approach to document and intelligent retrieval systems : problems and alternatives for representing subjects in the Qur'anic text

    Get PDF
    This study is in response to the widespread dissatisfaction of Information Scientists in the Muslim World, who feel that Islamic literature deserved and required an Islamic classification scheme. Conceptually Muslim Information Scientists have attempted to establish an analytical subject bibliography which could help to develop a particular classification scheme. This has involved providing abstracts, directories and indexes to record the subject contents of all the materials relating to Islamic Studies. However, the classical Qur'anic exegeses and Hadith collections represent a particular problem to Information Scientists: indexing the Qur'anic exegeses and Hadith collections requires an initial operational list of subject headings of both the Qur'anic and Hadith texts. This study is based on an investigation of the terminology in Qur'anic text for the purpose of designing a Qur'anic retrieval system. The study makes use of conceptual verses and words as partial examples for the required task. These examples are used to test the factors affecting the design at both manual and automatic levels. At the manual level - the stage of presenting the Qur'anic text in a printed form - the examples are used to examine the effects of Qur'anic terminology on the commentators and to see how it affects the performance of the retrieval system. Also the characteristics of the Arabic language, as represented in Qur'anic vocabulary, are examined against the problems known to be encountered in constructing an efficient information retrieval. On the automatic level - the stage of presenting the Qur'anic text on a screen - the examples are used to examine the possibility of the Qur'an, in its stylistic form, being processed by the computer. The study is subdivided into six chapters. The first chapter outlines the demands that are placed upon Muslim Information Scientists. Also it gives a brief overview of the background to current researches on the Islamic literature, and shows the methodological framework used in the present study. The second chapter highlights the major philological, historical and theological aspects as indicated by various interpretations and tests the effect of these opinions on the performance of the retrieval system. The third chapter analyzes the function of vocabulary control as applied to the Qur'anic terminology and examines such a control in relation to features of the Arabic (Qur'anic) language. The fourth chapter examines the various treatments in the computational analysis area in relation to the Qur'anic style of calligraphy and structure. The fifth chapter presents the guidelines and recommendations to establish the Qur'anic retrieval system. Finally, the sixth chapter offers two examples of the Qur'anic retrieval system as it is applied to natural and social sciences
    corecore