11 research outputs found

    Representação de conteúdo via indexação automática em textos integrais em língua portuguesa

    Get PDF
    Verifica-se a possibilidade da indexação automática derivativa de textos em língua portuguesa, a partir de seu texto integral. É aplicada a Fórmula de Transição de Goffman a 10 artigos na área de Bibliometria e formulado um algorltimo probabilístico de indexação. A Fórmula de Transição de Goffman ê perfeitamente aplicável à língua portuguesa, apontando para uma região de frequência de palavras onde estão concentradas as palavras indicativas do conteúdo dos artigos analisados. Palavras-chave Recuperação da informação. Indexação automática derivativa. Fórmula de Transição de Goffman. Representation of contents by the automatic indexing process of full texts in Portuguese language Abstract Possibility of automatic derived indexing of full texts in Portuguese is verifyed. Ten papers in Bibliometrics were indexed and their different parts considered for quantitative and qualitative analysis. Structure and disíríbution patterns of words were studied. Goffman's transition formula proved to be adequate as a slarting point for the indexing algorithm, which yielded, in all papers, a concentration zone forsemantic loaded terms. The algorithm worked as an uncenainty reducer, feading to the semantically important words. Keywords Information retrieval. Automatic derived indexing. Goffman's transition formula

    Indexação automática e manual: revisão de literatura

    Get PDF
    Abordam-se as diversas pesquisas nacionais e estrangeiras que avaliam a qualidade da indexação manual e automática, em relação às técnicas e fontes empregadas para a extração dos termos significativos e a capacidade de recuperação da linguagem de indexação, nas bases de dados. Automatic and manual indexing: a literature review

    A Hybrid Model for Document Retrieval Systems.

    Get PDF
    A methodology for the design of document retrieval systems is presented. First, a composite index term weighting model is developed based on term frequency statistics, including document frequency, relative frequency within document and relative frequency within collection, which can be adjusted by selecting various coefficients to fit into different indexing environments. Then, a composite retrieval model is proposed to process a user\u27s information request in a weighted Phrase-Oriented Fixed-Level Expression (POFLE), which may apply more than Boolean operators, through two phases. That is, we have a search for documents which are topically relevant to the information request by means of a descriptor matching mechanism, which incorporate a partial matching facility based on a structurally-restricted relationship imposed by indexing model, and is more general than matching functions of the traditional Boolean model and vector space model, and then we have a ranking of these topically relevant documents, by means of two types of heuristic-based selection rules and a knowledge-based evaluation function, in descending order of a preference score which predicts the combined effect of user preference for quality, recency, fitness and reachability of documents

    Information Retrieval Performance Enhancement Using The Average Standard Estimator And The Multi-criteria Decision Weighted Set

    Get PDF
    Information retrieval is much more challenging than traditional small document collection retrieval. The main difference is the importance of correlations between related concepts in complex data structures. These structures have been studied by several information retrieval systems. This research began by performing a comprehensive review and comparison of several techniques of matrix dimensionality estimation and their respective effects on enhancing retrieval performance using singular value decomposition and latent semantic analysis. Two novel techniques have been introduced in this research to enhance intrinsic dimensionality estimation, the Multi-criteria Decision Weighted model to estimate matrix intrinsic dimensionality for large document collections and the Average Standard Estimator (ASE) for estimating data intrinsic dimensionality based on the singular value decomposition (SVD). ASE estimates the level of significance for singular values resulting from the singular value decomposition. ASE assumes that those variables with deep relations have sufficient correlation and that only those relationships with high singular values are significant and should be maintained. Experimental results over all possible dimensions indicated that ASE improved matrix intrinsic dimensionality estimation by including the effect of both singular values magnitude of decrease and random noise distracters. Analysis based on selected performance measures indicates that for each document collection there is a region of lower dimensionalities associated with improved retrieval performance. However, there was clear disagreement between the various performance measures on the model associated with best performance. The introduction of the multi-weighted model and Analytical Hierarchy Processing (AHP) analysis helped in ranking dimensionality estimation techniques and facilitates satisfying overall model goals by leveraging contradicting constrains and satisfying information retrieval priorities. ASE provided the best estimate for MEDLINE intrinsic dimensionality among all other dimensionality estimation techniques, and further, ASE improved precision and relative relevance by 10.2% and 7.4% respectively. AHP analysis indicates that ASE and the weighted model ranked the best among other methods with 30.3% and 20.3% in satisfying overall model goals in MEDLINE and 22.6% and 25.1% for CRANFIELD. The weighted model improved MEDLINE relative relevance by 4.4%, while the scree plot, weighted model, and ASE provided better estimation of data intrinsic dimensionality for CRANFIELD collection than Kaiser-Guttman and Percentage of variance. ASE dimensionality estimation technique provided a better estimation of CISI intrinsic dimensionality than all other tested methods since all methods except ASE tend to underestimate CISI document collection intrinsic dimensionality. ASE improved CISI average relative relevance and average search length by 28.4% and 22.0% respectively. This research provided evidence supporting a system using a weighted multi-criteria performance evaluation technique resulting in better overall performance than a single criteria ranking model. Thus, the weighted multi-criteria model with dimensionality reduction provides a more efficient implementation for information retrieval than using a full rank model

    Language Models and Smoothing Methods for Information Retrieval

    Get PDF
    Language Models and Smoothing Methods for Information Retrieval (Sprachmodelle und Glättungsmethoden für Information Retrieval) Najeeb A. Abdulmutalib Kurzfassung der Dissertation Retrievalmodelle bilden die theoretische Grundlage für effektive Information-Retrieval-Methoden. Statistische Sprachmodelle stellen eine neue Art von Retrievalmodellen dar, die seit etwa zehn Jahren in der Forschung betrachtet werde. Im Unterschied zu anderen Modellen können sie leichter an spezifische Aufgabenstellungen angepasst werden und liefern häufig bessere Retrievalergebnisse. In dieser Dissertation wird zunächst ein neues statistisches Sprachmodell vorgestellt, das explizit Dokumentlängen berücksichtigt. Aufgrund der spärlichen Beobachtungsdaten spielen Glättungsmethoden bei Sprachmodellen eine wichtige Rolle. Auch hierfür stellen wir eine neue Methode namens 'exponentieller Glättung' vor. Der experimentelle Vergleich mit konkurrierenden Ansätzen zeigt, dass unsere neuen Methoden insbesondere bei Kollektionen mit stark variierenden Dokumentlängen überlegene Ergebnisse liefert. In einem zweiten Schritt erweitern wir unseren Ansatz auf XML-Retrieval, wo hierarchisch strukturierte Dokumente betrachtet werden und beim fokussierten Retrieval möglichst kleine Dokumentteile gefunden werden sollen, die die Anfrage vollständig beantworten. Auch hier demonstriert der experimentelle Vergleich mit anderen Ansätzen die Qualität unserer neu entwickelten Methoden. Der dritte Teil der Arbeit beschäftigt sich mit dem Vergleich von Sprachmodellen und der klassischen tf*idf-Gewichtung. Neben einem besseren Verständnis für die existierenden Glättungsmethoden führt uns dieser Ansatz zur Entwicklung des Verfahrens der 'empirischen Glättung'. Die damit durchgeführten Retrievalerexperimente zeigen Verbesserungen gegenüber anderen Glättungsverfahren.Language Models and Smoothing Methods for Information Retrieval Najeeb A. Abdulmutalib Abstract of the Dissertation Designing an effective retrieval model that can rank documents accurately for a given query has been a central problem in information retrieval for several decades. An optimal retrieval model that is both effective and efficient and that can learn from feedback information over time is needed. Language models are new generation of retrieval models and have been applied since the last ten years to solve many different information retrieval problems. Compared with the traditional models such as the vector space model, they can be more easily adapted to model non traditional and complex retrieval problems and empirically they tend to achieve comparable or better performance than the traditional models. Developing new language models is currently an active research area in information retrieval. In the first stage of this thesis we present a new language model based on an odds formula, which explicitly incorporates document length as a parameter. To address the problem of data sparsity where there is rarely enough data to accurately estimate the parameters of a language model, smoothing gives a way to combine less specific, more accurate information with more specific, but noisier data. We introduce a new smoothing method called exponential smoothing, which can be combined with most language models. We present experimental results for various language models and smoothing methods on a collection with large document length variation, and show that our new methods compare favourably with the best approaches known so far. We discuss the collection effect on the retrieval function, where we investigate the performance of well known models and compare the results conducted using two variant collections. In the second stage we extend the current model from flat text retrieval to XML retrieval since there is a need for content-oriented XML retrieval systems that can efficiently and effectively store, search and retrieve information from XML document collections. Compared to traditional information retrieval, where whole documents are usually indexed and retrieved as single complete units, information retrieval from XML documents creates additional retrieval challenges. By exploiting the logical document structure, XML allows for more focussed retrieval that identifies elements rather than documents as answers to user queries. Finally we show how smoothing plays a role very similar to that of the idf function: beside the obvious role of smoothing, it also improves the accuracy of the estimated language model. The within document frequency and the collection frequency of a term actually influence the probability of relevance, which led us to a new class of smoothing function based on numeric prediction, which we call empirical smoothing. Its retrieval quality outperforms that of other smoothing methods

    IfD - information for discrimination

    Get PDF
    The problem of term mismatch and ambiguity has long been serious and outstanding in IR. The problem can result in the system formulating an incomplete and imprecise query representation, leading to a failure of retrieval. Many query reformulation methods have been proposed to address the problem. These methods employ term classes which are considered as related to individual query terms. They are hindered by the computational cost of term classification, and by the fact that the terms in some class are generally related to some specific query term belonging to the class rather than relevant to the context of the query. In this thesis we propose a series of methods for automatic query reformulation (AQR). The methods constitute a formal model called IfD, standing for Information for Discrimination. In IfD, each discrimination measure is modelled as information contained in terms supporting one of two opposite hypotheses. The extent of association of terms with the query can thus be defined based directly on the discrimination. The strength of association of candidate terms with the query can then be computed, and good terms can be selected to enhance the query. Justifications for IfD are presented from several aspects: formal interpretations of infor­mation for discrimination are introduced to show its soundness; criteria are put forward to show its rationality; properties of discrimination measures are analysed to show its appro­priateness; examples are examined to show its usability; extension is discussed to show its potential; implementation is described to show its feasibility; comparisons with other methods are made to show its flexibility; improvements in retrieval performance are exhibited to show its powerful capability. Our conclusion is that the advantage and promise IfD should make it an indispensable methodology for AQR, which we believe can be an effective technique for improvement in retrieval performance
    corecore