378 research outputs found

    A lightweight, graph-theoretic model of class-based similarity to support object-oriented code reuse.

    Get PDF
    The work presented in this thesis is principally concerned with the development of a method and set of tools designed to support the identification of class-based similarity in collections of object-oriented code. Attention is focused on enhancing the potential for software reuse in situations where a reuse process is either absent or informal, and the characteristics of the organisation are unsuitable, or resources unavailable, to promote and sustain a systematic approach to reuse. The approach builds on the definition of a formal, attributed, relational model that captures the inherent structure of class-based, object-oriented code. Based on code-level analysis, it relies solely on the structural characteristics of the code and the peculiarly object-oriented features of the class as an organising principle: classes, those entities comprising a class, and the intra and inter-class relationships existing between them, are significant factors in defining a two-phase similarity measure as a basis for the comparison process. Established graph-theoretic techniques are adapted and applied via this model to the problem of determining similarity between classes. This thesis illustrates a successful transfer of techniques from the domains of molecular chemistry and computer vision. Both domains provide an existing template for the analysis and comparison of structures as graphs. The inspiration for representing classes as attributed relational graphs, and the application of graph-theoretic techniques and algorithms to their comparison, arose out of a well-founded intuition that a common basis in graph-theory was sufficient to enable a reasonable transfer of these techniques to the problem of determining similarity in object-oriented code. The practical application of this work relates to the identification and indexing of instances of recurring, class-based, common structure present in established and evolving collections of object-oriented code. A classification so generated additionally provides a framework for class-based matching over an existing code-base, both from the perspective of newly introduced classes, and search "templates" provided by those incomplete, iteratively constructed and refined classes associated with current and on-going development. The tools and techniques developed here provide support for enabling and improving shared awareness of reuse opportunity, based on analysing structural similarity in past and ongoing development, tools and techniques that can in turn be seen as part of a process of domain analysis, capable of stimulating the evolution of a systematic reuse ethic

    RÎle de la matrice d'information et pondération des composantes dans les noyaux de Fisher pour PLSI

    Get PDF
    ABSTRACT. An information-geometric approach for document similarities in the framework of “Probabilistic Latent Semantic Indexing” was ïŹrst proposed by T. Hofmann (2000) and later extended (“revisited”) by Nyffenegger et al. (2006). This paper presents an in-depth study and revision of these models by (1) providing a simpler uniïŹed description framework, (2) investigating the role of the Fisher Information Matrix G(Ξ), and (3) analyzing the impact of latent “topic” parameters in such models. It furthermore provides new experimental results on larger collections coming from the TREC–AP evaluation corpus

    Exploring the reuse of past search results in information retrieval

    Get PDF
    Les recherches passĂ©es constituent pourtant une source d'information utile pour les nouveaux utilisateurs (nouvelles requĂȘtes). En raison de l'absence de collections ad-hoc de RI, Ă  ce jour il y a un faible intĂ©rĂȘt de la communautĂ© RI autour de l'utilisation des recherches passĂ©es. En effet, la plupart des collections de RI existantes sont composĂ©es de requĂȘtes indĂ©pendantes. Ces collections ne sont pas appropriĂ©es pour Ă©valuer les approches fondĂ©es sur les requĂȘtes passĂ©es parce qu'elles ne comportent pas de requĂȘtes similaires ou qu'elles ne fournissent pas de jugements de pertinence. Par consĂ©quent, il n'est pas facile d'Ă©valuer ce type d'approches. En outre, l'Ă©laboration de ces collections est difficile en raison du coĂ»t et du temps Ă©levĂ©s nĂ©cessaires. Une alternative consiste Ă  simuler les collections. Par ailleurs, les documents pertinents de requĂȘtes passĂ©es similaires peuvent ĂȘtre utilisĂ©es pour rĂ©pondre Ă  une nouvelle requĂȘte. De nombreuses contributions ont Ă©tĂ© proposĂ©es portant sur l'utilisation de techniques probabilistes pour amĂ©liorer les rĂ©sultats de recherche. Des solutions simples Ă  mettre en Ɠuvre pour la rĂ©utilisation de rĂ©sultats de recherches peuvent ĂȘtre proposĂ©es au travers d'algorithmes probabilistes. De plus, ce principe peut Ă©galement bĂ©nĂ©ficier d'un clustering des recherches antĂ©rieures selon leurs similaritĂ©s. Ainsi, dans cette thĂšse un cadre pour simuler des collections pour des approches basĂ©es sur les rĂ©sultats de recherche passĂ©es est mis en Ɠuvre et Ă©valuĂ©. Quatre algorithmes probabilistes pour la rĂ©utilisation des rĂ©sultats de recherches passĂ©es sont ensuite proposĂ©s et Ă©valuĂ©s. Enfin, une nouvelle mesure dans un contexte de clustering est proposĂ©e.Past searches provide a useful source of information for new users (new queries). Due to the lack of ad-hoc IR collections, to this date there is a weak interest of the IR community on the use of past search results. Indeed, most of the existing IR collections are composed of independent queries. These collections are not appropriate to evaluate approaches rooted in past queries because they do not gather similar queries due to the lack of relevance judgments. Therefore, there is no easy way to evaluate the convenience of these approaches. In addition, elaborating such collections is difficult due to the cost and time needed. Thus a feasible alternative is to simulate such collections. Besides, relevant documents from similar past queries could be used to answer the new query. This principle could benefit from clustering of past searches according to their similarities. Thus, in this thesis a framework to simulate ad-hoc approaches based on past search results is implemented and evaluated. Four randomized algorithms to improve precision are proposed and evaluated, finally a new measure in the clustering context is proposed

    Utilisation de PLSI en recherche d'information

    Get PDF
    The PLSI model (“Probabilistic Latent Semantic Indexing”) offers a document indexing scheme based on probabilistic latent category models. It entailed applications in diverse ïŹelds, notably in information retrieval (IR). Nevertheless, PLSI cannot process documents not seen during parameter inference, a major liability for queries in IR. A method known as “folding-in” allows to circumvent this problem up to a point, but has its own weaknesses. The present paper introduces a new document-query similarity measure for PLSI based on language models that entirely avoids the problem a query projection. We compare this similarity to Fisher kernels, the state of the art similarities for PLSI. Moreover, we present an evaluation of PLSI on a particularly large training set of almost 7500 document and over one million term occurrence large, created from the TREC–AP collection

    The Effectiveness of Query-Based Hierarchic Clustering of Documents for Information Retrieval

    Get PDF
    Hierarchic document clustering has been applied to Information Retrieval (IR) for over three decades. Its introduction to IR was based on the grounds of its potential to improve the effectiveness of IR systems. Central to the issue of improved effectiveness is the Cluster Hypothesis. The hypothesis states that relevant documents tend to be highly similar to each other, and therefore tend to appear in the same clusters. However, previous research has been inconclusive as to whether document clustering does bring improvements. The main motivation for this work has been to investigate methods for the improvement of the effectiveness of document clustering, by challenging some assumptions that implicitly characterise its application. Such assumptions relate to the static manner in which document clustering is typically performed, and include the static application of document clustering prior to querying, and the static calculation of interdocument associations. The type of clustering that is investigated in this thesis is query-based, that is, it incorporates information from the query into the process of generating clusters of documents. Two approaches for incorporating query information into the clustering process are examined: clustering documents which are returned from an IR system in response to a user query (post-retrieval clustering), and clustering documents by using query-sensitive similarity measures. For the first approach, post-retrieval clustering, an analytical investigation into a number of issues that relate to its retrieval effectiveness is presented in this thesis. This is in contrast to most of the research which has employed post-retrieval clustering in the past, where it is mainly viewed as a convenient and efficient means of presenting documents to users. In this thesis, post-retrieval clustering is employed based on its potential to introduce effectiveness improvements compared both to static clustering and best-match IR systems. The motivation for the second approach, the use of query-sensitive measures, stems from the role of interdocument similarities for the validity of the cluster hypothesis. In this thesis, an axiomatic view of the hypothesis is proposed, by suggesting that documents relevant to the same query (co-relevant documents) display an inherent similarity to each other which is dictated by the query itself. Because of this inherent similarity, the cluster hypothesis should be valid for any document collection. Past research has attributed failure to validate the hypothesis for a document collection to characteristics of the collection. Contrary to this, the view proposed in this thesis suggests that failure of a document set to adhere to the hypothesis is attributed to the assumptions made about interdocument similarity. This thesis argues that the query determines the context and the purpose for which the similarity between documents is judged, and it should therefore be incorporated in the similarity calculations. By taking the query into account when calculating interdocument similarities, co-relevant documents can be "forced" to be more similar to each other. This view challenges the typically static nature of interdocument relationships in IR. Specific formulas for the calculation of query-sensitive similarity are proposed in this thesis. Four hierarchic clustering methods and six document collections are used in the experiments. Three main issues are investigated: the effectiveness of hierarchic post-retrieval clustering which uses static similarity measures, the effectiveness of query-sensitive measures at increasing the similarity of pairs of co-relevant documents, and the effectiveness of hierarchic clustering which uses query-sensitive similarity measures. The results demonstrate the effectiveness improvements that are introduced by the use of both approaches of query-based clustering, compared both to the effectiveness of static clustering and to the effectiveness of best-match IR systems. Query-sensitive similarity measures, in particular, introduce significant improvements over the use of static similarity measures for document clustering, and they also significantly improve the structure of the document space in terms of the similarity of pairs of co-relevant documents. The results provide evidence for the effectiveness of hierarchic query-based clustering of documents, and also challenge findings of previous research which had dismissed the potential of hierarchic document clustering as an effective method for information retrieval

    Approaches for the clustering of geographic metadata and the automatic detection of quasi-spatial dataset series

    Get PDF
    The discrete representation of resources in geospatial catalogues affects their information retrieval performance. The performance could be improved by using automatically generated clusters of related resources, which we name quasi-spatial dataset series. This work evaluates whether a clustering process can create quasi-spatial dataset series using only textual information from metadata elements. We assess the combination of different kinds of text cleaning approaches, word and sentence-embeddings representations (Word2Vec, GloVe, FastText, ELMo, Sentence BERT, and Universal Sentence Encoder), and clustering techniques (K-Means, DBSCAN, OPTICS, and agglomerative clustering) for the task. The results demonstrate that combining word-embeddings representations with an agglomerative-based clustering creates better quasi-spatial dataset series than the other approaches. In addition, we have found that the ELMo representation with agglomerative clustering produces good results without any preprocessing step for text cleaning

    AFRANCI : multi-layer architecture for cognitive agents

    Get PDF
    Tese de doutoramento. Engenharia Electrotécnica e de Computadores. Faculdade de Engenharia. Universidade do Porto. 201
    • 

    corecore