82 research outputs found

    Concepts and Effectiveness of the Cover Coefficient Based Clustering Methodology for Text Databases

    Get PDF
    An algorithm for document clustering is introduced. The base concept of the algorithm, Cover Coefficient (CC) concept, provides means of estimating the number of clusters within a document database. The CC concept is used also to identify the cluster seeds, to form clusters with the seeds, and to calculate Term Discrimination and Document Significance values (TDV, DSV). TDVs and DSVs are used to optimize document descriptions. The CC concept also relates indexing and clustering analytically. Experimental results indicate that the clustering performance in terms of the percentage of useful information accessed (precision) is forty percent higher, with accompanying reduction in search space, than that of random assignment of documents to clusters. The experiments have validated the indexing-clustering relationships and shown improvements in retrieval precision when TDV and DSV optimizations are used

    Recherche d'information dans les documents XML : prise en compte des liens pour la sélection d'éléments pertinents

    Get PDF
    156 p. : ill. ; 30 cmNotre travail se situe dans le contexte de la recherche d'information (RI), plus particulièrement la recherche d'information dans des documents semi structurés de type XML. L'exploitation efficace des documents XML disponibles doit prendre en compte la dimension structurelle. Cette dimension a conduit à l'émergence de nouveaux défis dans le domaine de la RI. Contrairement aux approches classiques de RI qui mettent l'accent sur la recherche des contenus non structurés, la RI XML combine à la fois des informations textuelles et structurelles pour effectuer différentes tâches de recherche. Plusieurs approches exploitant les types d'évidence ont été proposées et sont principalement basées sur les modèles classiques de RI, adaptées à des documents XML. La structure XML a été utilisée pour fournir un accès ciblé aux documents, en retournant des composants de document (par exemple, sections, paragraphes, etc.), au lieu de retourner tout un document en réponse une requête de l'utilisateur. En RI traditionnelle, la mesure de similarité est généralement basée sur l'information textuelle. Elle permetle classement des documents en fonction de leur degré de pertinence en utilisant des mesures comme:" similitude terme " ou " probabilité terme ". Cependant, d'autres sources d'évidence peuvent être considérées pour rechercher des informations pertinentes dans les documents. Par exemple, les liens hypertextes ont été largement exploités dans le cadre de la RI sur le Web.Malgré leur popularité dans le contexte du Web, peud'approchesexploitant cette source d'évidence ont été proposées dans le contexte de la RI XML. Le but de notre travail est de proposer des approches pour l'utilisation de liens comme une source d'évidencedans le cadre de la recherche d'information XML. Cette thèse vise à apporter des réponses aux questions de recherche suivantes : 1. Peut-on considérer les liens comme une source d'évidence dans le contexte de la RIXML? 2. Est-ce que l'utilisation de certains algorithmes d'analyse de liensdans le contexte de la RI XML améliore la qualité des résultats, en particulier dans le cas de la collection Wikipedia? 3. Quels types de liens peuvent être utilisés pour améliorer le mieux la pertinence des résultats de recherche? 4. Comment calculer le score lien des différents éléments retournés comme résultats de recherche? Doit-on considérer lesliens de type "document-document" ou plus précisément les liens de type "élément-élément"? Quel est le poids des liens de navigation par rapport aux liens hiérarchiques? 5. Quel est l'impact d'utilisation de liens dans le contexte global ou local? 6. Comment intégrer le score lien dans le calcul du score final des éléments XML retournés? 7. Quel est l'impact de la qualité des premiers résultats sur le comportement des formules proposées? Pour répondre à ces questions, nous avons mené une étude statistique, sur les résultats de recherche retournés par le système de recherche d'information"DALIAN", qui a clairement montré que les liens représentent un signe de pertinence des éléments dans le contexte de la RI XML, et cecien utilisant la collection de test fournie par INEX. Aussi, nous avons implémenté trois algorithmes d'analyse des liens (Pagerank, HITS et SALSA) qui nous ont permis de réaliser une étude comparative montrant que les approches "query-dependent" sont les meilleures par rapport aux approches "global context" . Nous avons proposé durant cette thèse trois formules de calcul du score lien: Le premièreest appelée "Topical Pagerank"; la seconde est la formule : "distance-based"; et la troisième est :"weighted links based". Nous avons proposé aussi trois formules de combinaison, à savoir, la formule linéaire, la formule Dempster-Shafer et la formule fuzzy-based. Enfin, nous avons mené une série d'expérimentations. Toutes ces expérimentations ont montré que: les approches proposées ont permis d'améliorer la pertinence des résultats pour les différentes configurations testées; les approches "query-dependent" sont les meilleurescomparées aux approches global context; les approches exploitant les liens de type "élément-élément"ont obtenu de bons résultats; les formules de combinaison qui se basent sur le principe de l'incertitude pour le calcul des scores finaux des éléments XML permettent de réaliser de bonnes performance

    The Effectiveness of Query-Based Hierarchic Clustering of Documents for Information Retrieval

    Get PDF
    Hierarchic document clustering has been applied to Information Retrieval (IR) for over three decades. Its introduction to IR was based on the grounds of its potential to improve the effectiveness of IR systems. Central to the issue of improved effectiveness is the Cluster Hypothesis. The hypothesis states that relevant documents tend to be highly similar to each other, and therefore tend to appear in the same clusters. However, previous research has been inconclusive as to whether document clustering does bring improvements. The main motivation for this work has been to investigate methods for the improvement of the effectiveness of document clustering, by challenging some assumptions that implicitly characterise its application. Such assumptions relate to the static manner in which document clustering is typically performed, and include the static application of document clustering prior to querying, and the static calculation of interdocument associations. The type of clustering that is investigated in this thesis is query-based, that is, it incorporates information from the query into the process of generating clusters of documents. Two approaches for incorporating query information into the clustering process are examined: clustering documents which are returned from an IR system in response to a user query (post-retrieval clustering), and clustering documents by using query-sensitive similarity measures. For the first approach, post-retrieval clustering, an analytical investigation into a number of issues that relate to its retrieval effectiveness is presented in this thesis. This is in contrast to most of the research which has employed post-retrieval clustering in the past, where it is mainly viewed as a convenient and efficient means of presenting documents to users. In this thesis, post-retrieval clustering is employed based on its potential to introduce effectiveness improvements compared both to static clustering and best-match IR systems. The motivation for the second approach, the use of query-sensitive measures, stems from the role of interdocument similarities for the validity of the cluster hypothesis. In this thesis, an axiomatic view of the hypothesis is proposed, by suggesting that documents relevant to the same query (co-relevant documents) display an inherent similarity to each other which is dictated by the query itself. Because of this inherent similarity, the cluster hypothesis should be valid for any document collection. Past research has attributed failure to validate the hypothesis for a document collection to characteristics of the collection. Contrary to this, the view proposed in this thesis suggests that failure of a document set to adhere to the hypothesis is attributed to the assumptions made about interdocument similarity. This thesis argues that the query determines the context and the purpose for which the similarity between documents is judged, and it should therefore be incorporated in the similarity calculations. By taking the query into account when calculating interdocument similarities, co-relevant documents can be "forced" to be more similar to each other. This view challenges the typically static nature of interdocument relationships in IR. Specific formulas for the calculation of query-sensitive similarity are proposed in this thesis. Four hierarchic clustering methods and six document collections are used in the experiments. Three main issues are investigated: the effectiveness of hierarchic post-retrieval clustering which uses static similarity measures, the effectiveness of query-sensitive measures at increasing the similarity of pairs of co-relevant documents, and the effectiveness of hierarchic clustering which uses query-sensitive similarity measures. The results demonstrate the effectiveness improvements that are introduced by the use of both approaches of query-based clustering, compared both to the effectiveness of static clustering and to the effectiveness of best-match IR systems. Query-sensitive similarity measures, in particular, introduce significant improvements over the use of static similarity measures for document clustering, and they also significantly improve the structure of the document space in terms of the similarity of pairs of co-relevant documents. The results provide evidence for the effectiveness of hierarchic query-based clustering of documents, and also challenge findings of previous research which had dismissed the potential of hierarchic document clustering as an effective method for information retrieval

    Compound key word generation from document databases using a hierarchical clustering art model

    Get PDF
    The growing availability of databases on the information highways motivates the development of new processing tools able to deal with a heterogeneous and changing information environment. A highly desirable feature of data processing systems handling this type of information is the ability to automatically extract its own key words. In this paper we address the specific problem of creating semantic term associations from a text database. The proposed method uses a hierarchical model made up of Fuzzy Adaptive Resonance Theory (ART) neural networks. First, the system uses several Fuzzy ART modules to cluster isolated words into semantic classes, starting from the database raw text. Next, this knowledge is used together with coocurrence information to extract semantically meaningful term associations. These associations are asymmetric and one-to-many due to the polisemy phenomenon. The strength of the associations between words can be measured numerically. Besides this, they implicitly define a hierarchy between descriptors. The underlying algorithm is appropriate for employment on large databases. The operation of the system is illustrated on several real databases

    Automatic keyword extraction for a partial search engine index

    Get PDF
    Full-text search engines play a critical role in many enterprise applications, where the quantity and complexity of the information are overwhelming. Promptly finding documents that contain relevant information for pressing questions is a necessity for efficient operation. This is especially the case for financial and legal teams executing Mergers and Acquisitions deals. The goal of the thesis is to provide search services for such teams without storing the sensitive documents involved, minimising the risk of potential data leaks. A literature review of related methods and concepts is presented. As search engine technologies that use encrypted indices for commercial applications are still in their early stages, the solution proposed in the thesis is the use of partial indexing by keyword extraction. A cosine similarity-based evaluation was used to measure the performance difference between the keyword-based partial index and the complete index. The partial indices were constructed using unsupervised keyword extraction methods based on term frequency, document graphs, and topic modelling. The frequency-based methods were term frequency, TF-IDF, and YAKE!. The graph-based method was TextRank. The topic modelling-based methods were NMF, LDA, and LSI. The methods were evaluated by running 51 reference queries on the LEDGAR data set, which contains 60,540 contracts. The results show that using only five keywords per document from the TF-IDF or YAKE! methods, the best matching documents in the result lists have a cosine similarity of 0.7 on average. This value is reasonably high, particularly considering the small number of keywords. The topic modelling-based methods were found to perform poorly due to being too general. The term frequency and TextRank methods were mediocre

    Advanced Document Description, a Sequential Approach

    Get PDF
    To be able to perform efficient document processing, information systems need to use simple models of documents that can be treated in a smaller number of operations. This problem of document representation is not trivial. For decades, researchers have tried to combine relevant document representations with efficient processing. Documents are commonly represented by vectors in which each dimension corresponds to a word of the document. This approach is termed “bag of words”, as it entirely ignores the relative positions of words. One natural improvement over this representation is the extraction and use of cohesive word sequences. In this dissertation, we consider the problem of the extraction, selection and exploitation of word sequences, with a particular focus on the applicability of our work to domain-independent document collections written in any language

    Statistical analysis of grouped text documents

    Get PDF
    L'argomento di questa tesi sono i modelli statistici per l'analisi dei dati testuali, con particolare attenzione ai contesti in cui i campioni di testo sono raggruppati. Quando si ha a che fare con dati testuali, il primo problema è quello di elaborarli, per renderli compatibili dal punto di vista computazionale e metodologico con i metodi matematici e statistici prodotti e continuamente sviluppati dalla comunità scientifica. Per questo motivo, la tesi passa in rassegna i metodi esistenti per la rappresentazione analitica e l'elaborazione di campioni di dati testuali, compresi i "Vector Space Models", le "rappresentazioni distribuite" di parole e documenti e i "contextualized embeddings". Questa rassegna comporta la standardizzazione di una notazione che, anche all'interno dello stesso approccio di rappresentazione, appare molto eterogenea in letteratura. Vengono poi esplorati due domini di applicazione: i social media e il turismo culturale. Per quanto riguarda il primo, viene proposto uno studio sull'autodescrizione di gruppi diversi di individui sulla piattaforma StockTwits, dove i mercati finanziari sono gli argomenti dominanti. La metodologia proposta ha integrato diversi tipi di dati, sia testuali che variabili categoriche. Questo studio ha agevolato la comprensione sul modo in cui le persone si presentano online e ha trovato stutture di comportamento ricorrenti all'interno di gruppi di utenti. Per quanto riguarda il turismo culturale, la tesi approfondisce uno studio condotto nell'ambito del progetto "Data Science for Brescia - Arts and Cultural Places", in cui è stato addestrato un modello linguistico per classificare le recensioni online scritte in italiano in quattro aree semantiche distinte relative alle attrazioni culturali della città di Brescia. Il modello proposto permette di identificare le attrazioni nei documenti di testo, anche quando non sono esplicitamente menzionate nei metadati del documento, aprendo così la possibilità di espandere il database relativo a queste attrazioni culturali con nuove fonti, come piattaforme di social media, forum e altri spazi online. Infine, la tesi presenta uno studio metodologico che esamina la specificità di gruppo delle parole, analizzando diversi stimatori di specificità di gruppo proposti in letteratura. Lo studio ha preso in considerazione documenti testuali raggruppati con variabile di "outcome" e variabile di gruppo. Il suo contributo consiste nella proposta di modellare il corpus di documenti come una distribuzione multivariata, consentendo la simulazione di corpora di documenti di testo con caratteristiche predefinite. La simulazione ha fornito preziose indicazioni sulla relazione tra gruppi di documenti e parole. Inoltre, tutti i risultati possono essere liberamente esplorati attraverso un'applicazione web, i cui componenti sono altresì descritti in questo manoscritto. In conclusione, questa tesi è stata concepita come una raccolta di studi, ognuno dei quali suggerisce percorsi di ricerca futuri per affrontare le sfide dell'analisi dei dati testuali raggruppati.The topic of this thesis is statistical models for the analysis of textual data, emphasizing contexts in which text samples are grouped. When dealing with text data, the first issue is to process it, making it computationally and methodologically compatible with the existing mathematical and statistical methods produced and continually developed by the scientific community. Therefore, the thesis firstly reviews existing methods for analytically representing and processing textual datasets, including Vector Space Models, distributed representations of words and documents, and contextualized embeddings. It realizes this review by standardizing a notation that, even within the same representation approach, appears highly heterogeneous in the literature. Then, two domains of application are explored: social media and cultural tourism. About the former, a study is proposed about self-presentation among diverse groups of individuals on the StockTwits platform, where finance and stock markets are the dominant topics. The methodology proposed integrated various types of data, including textual and categorical data. This study revealed insights into how people present themselves online and found recurring patterns within groups of users. About the latter, the thesis delves into a study conducted as part of the "Data Science for Brescia - Arts and Cultural Places" Project, where a language model was trained to classify Italian-written online reviews into four distinct semantic areas related to cultural attractions in the Italian city of Brescia. The model proposed allows for the identification of attractions in text documents, even when not explicitly mentioned in document metadata, thus opening possibilities for expanding the database related to these cultural attractions with new sources, such as social media platforms, forums, and other online spaces. Lastly, the thesis presents a methodological study examining the group-specificity of words, analyzing various group-specificity estimators proposed in the literature. The study considered grouped text documents with both outcome and group variables. Its contribution consists of the proposal of modeling the corpus of documents as a multivariate distribution, enabling the simulation of corpora of text documents with predefined characteristics. The simulation provided valuable insights into the relationship between groups of documents and words. Furthermore, all its results can be freely explored through a web application, whose components are also described in this manuscript. In conclusion, this thesis has been conceived as a collection of papers. It aimed to contribute to the field with both applications and methodological proposals, and each study presented here suggests paths for future research to address the challenges in the analysis of grouped textual data

    Formal concept matching and reinforcement learning in adaptive information retrieval

    Get PDF
    The superiority of the human brain in information retrieval (IR) tasks seems to come firstly from its ability to read and understand the concepts, ideas or meanings central to documents, in order to reason out the usefulness of documents to information needs, and secondly from its ability to learn from experience and be adaptive to the environment. In this work we attempt to incorporate these properties into the development of an IR model to improve document retrieval. We investigate the applicability of concept lattices, which are based on the theory of Formal Concept Analysis (FCA), to the representation of documents. This allows the use of more elegant representation units, as opposed to keywords, in order to better capture concepts/ideas expressed in natural language text. We also investigate the use of a reinforcement leaming strategy to learn and improve document representations, based on the information present in query statements and user relevance feedback. Features or concepts of each document/query, formulated using FCA, are weighted separately with respect to the documents they are in, and organised into separate concept lattices according to a subsumption relation. Furthen-nore, each concept lattice is encoded in a two-layer neural network structure known as a Bidirectional Associative Memory (BAM), for efficient manipulation of the concepts in the lattice representation. This avoids implementation drawbacks faced by other FCA-based approaches. Retrieval of a document for an information need is based on concept matching between concept lattice representations of a document and a query. The learning strategy works by making the similarity of relevant documents stronger and non-relevant documents weaker for each query, depending on the relevance judgements of the users on retrieved documents. Our approach is radically different to existing FCA-based approaches in the following respects: concept formulation; weight assignment to object-attribute pairs; the representation of each document in a separate concept lattice; and encoding concept lattices in BAM structures. Furthermore, in contrast to the traditional relevance feedback mechanism, our learning strategy makes use of relevance feedback information to enhance document representations, thus making the document representations dynamic and adaptive to the user interactions. The results obtained on the CISI, CACM and ASLIB Cranfield collections are presented and compared with published results. In particular, the performance of the system is shown to improve significantly as the system learns from experience.The School of Computing, University of Plymouth, UK
    corecore