1,187 research outputs found

    Representing aggregate works in the digital library

    Get PDF
    This paper studies the challenge of representing aggregate works such as encyclopedias, collected poems and journals in heterogenous digital library collections. Reflecting on the materials used by humanities academics, we demonstrate the varied range of aggregate types and the problems of faithfully representing this in the DL interface. Aggregates are complex and pervasive, challenge common assumptions and confuse boundaries within organisational structures. Existing DL systems can only provide imperfect representation of aggregates, and alterations to document encoding are insufficient to create a faithful reproduction of the physical library. The challenge is amplified through concrete examples, and solutions are demonstrated in a well-known DL system and related to standard DL architecture

    Information Extraction from Heterogeneous WWW Resources

    Get PDF
    The information available on the WWW is growing very fast. However, a fundamental problem with the information on the WWW is its lack of structure making its exploitation very difficult. As a result, the desired information is getting more difficult to retrieve and extract. To overcome this problem many tools and techniques are being developed and used for locating the web pages of interest and extracting the desired information from these pages. In this paper we present the first prototype of an Information Extraction (IE) system that attempts to extract information on different Computer Science related courses offered by British Universities

    Optimization of the search engine ElasticSearch

    Get PDF
    This thesis will present the work done in the Search on Demand team at Orange. It will present the optimization of the search engine Elasticsearch, the ways to bring data into it with the mean of an ETL and how relevance can be tuned using Lucene's inverted indices

    Visualizing and Interacting with Concept Hierarchies

    Full text link
    Concept Hierarchies and Formal Concept Analysis are theoretically well grounded and largely experimented methods. They rely on line diagrams called Galois lattices for visualizing and analysing object-attribute sets. Galois lattices are visually seducing and conceptually rich for experts. However they present important drawbacks due to their concept oriented overall structure: analysing what they show is difficult for non experts, navigation is cumbersome, interaction is poor, and scalability is a deep bottleneck for visual interpretation even for experts. In this paper we introduce semantic probes as a means to overcome many of these problems and extend usability and application possibilities of traditional FCA visualization methods. Semantic probes are visual user centred objects which extract and organize reduced Galois sub-hierarchies. They are simpler, clearer, and they provide a better navigation support through a rich set of interaction possibilities. Since probe driven sub-hierarchies are limited to users focus, scalability is under control and interpretation is facilitated. After some successful experiments, several applications are being developed with the remaining problem of finding a compromise between simplicity and conceptual expressivity

    Lexicalização de ontologias : o relacionamento entre conteúdo e significado no contexto da Recuperação da Informação

    Get PDF
    Esta proposta visa representar a linguagem natural na forma adequada às ontologias e vice-versa. Para tanto, propõe-se à criação semiautomática de base de léxicos em português brasileiro, contendo informações morfológicas, sintáticas e semânticas apropriadas para a leitura por máquinas, permitindo vincular dados estruturados e não estruturados, bem como integrar a leitura em modelo de recuperação da informação para aumentar a precisão. Os resultados alcançados demonstram a utilização da metodologia, no domínio de risco financeiro em português, para a elaboração da ontologia, da base léxico-semântica e da proposta do modelo de recuperação da informação semântica. Para avaliar a performance do modelo proposto, foram selecionados documentos contendo as principais definições do domínio de risco financeiro. Esses foram indexados com e sem anotação semântica. Para possibilitar a comparação entre as abordagens, foram criadas duas bases, a primeira representando a busca tradicional, e a segunda contendo o índice construído, a partir dos textos com as anotações semânticas para representar a busca semântica. A avaliação da proposta é baseada na revocação e na precisão. As consultas submetidas ao modelo mostram que a busca semântica supera o desempenho da tradicional e validam a metodologia empregada. O procedimento, embora adicione complexidade em sua elaboração, pode ser reproduzido em qualquer outro domínio.The proposal presented in this study seeks to properly represent natural language to ontologies and vice-versa. Therefore, the semi-automatic creation of a lexical database in Brazilian Portuguese containing morphological, syntactic, and semantic information that can be read by machines was proposed, allowing the link between structured and unstructured data and its integration into an information retrieval model to improve precision. The results obtained demonstrated that the methodology can be used in the risco financeiro (financial risk) domain in Portuguese for the construction of an ontology and the lexical-semantic database and the proposal of a semantic information retrieval model. In order to evaluate the performance of the proposed model, documents containing the main definitions of the financial risk domain were selected and indexed with and without semantic annotation. To enable the comparison between the approaches, two databases were created based on the texts with the semantic annotations to represent the semantic search. The first one represents the traditional search and the second contained the index built based on the texts with the semantic annotations to represent the semantic search. The evaluation of the proposal was based on recall and precision. The queries submitted to the model showed that the semantic search outperforms the traditional search and validates the methodology used. Although more complex, the procedure proposed can be used in all kinds of domains

    Development of a web-based platform for Biomedical Text Mining

    Get PDF
    Dissertação de mestrado em Engenharia InformáticaBiomedical Text Mining (BTM) seeks to derive high-quality information from literature in the biomedical domain, by creating tools/methodologies that can automate time-consuming tasks when searching for new information. This encompasses both Information Retrieval, the discovery and recovery of relevant documents, and Information Extraction, the capability to extract knowledge from text. In the last years, SilicoLife, with the collaboration of the University of Minho, has been developing @Note2, an open-source Java-based multiplatform BTM workbench, including libraries to perform the main BTM tasks, also provid ing user-friendly interfaces through a stand-alone application. This work addressed the development of a web-based software platform that is able to address some of the main tasks within BTM, supported by the existing core libraries from the @Note project. This included the improvement of the available RESTful server, providing some new methods and APIs, and improving others, while also developing a web-based application through calls to the API provided by the server and providing a functional user-friendly web-based interface. This work focused on the development of tasks related with Information Retrieval, addressing the efficient search of relevant documents through an integrated interface. Also, at this stage the aim was to have interfaces to visualize and explore the main entities involved in BTM: queries, documents, corpora, annotation processes entities and resources.A mineração de Literatura Biomédica (BioLM) pretende extrair informação de alta qualidade da área biomédica, através da criação de ferramentas/metodologias que consigam automatizar tarefas com elevado dispêndio de tempo. As tarefas subjacentes vão desde recuperação de informação, descoberta e recuperação de documentos relevantes para a extração de informação pertinente e a capacidade de extrair conhecimento de texto. Nos últimos anos a SilicoLife tem vindo a desenvolver uma ferramenta, o @Note2, uma BioLM Workbench multiplataforma baseada em JAVA, que executa as principais tarefas inerentes a BioLM. Também possui uma versão autónoma com uma interface amigável para o utilizador. Esta tese desenvolveu uma plataforma de software baseada na web, que é capaz de executar algumas das tarefas de BioLM, com suporte num núcleo de bibliotecas do projeto @Note. Para tal foi necessário melhorar o servidor RESTfid atual, criando novos métodos e APIs, como também desenvolver a aplicação baseada na web, com uma interface amigável para o utilizador, que comunicará com o servidor através de chamadas à sua APL Este trabalho focou o seu desenvolvimento em tarefas relacionadas com recuperação de informação, focando na pesquisa eficiente de documentos de interesse através de uma interface integrada. Nesta fase, o objetivo foi também ter um conjunto de interfaces capazes de visualizar e explorar as principais entidades envolvidas em BioLM: pesquisas, documentos, corpora, entidades relacionadas com processos de anotações e recursos

    A comprehensive analysis of acknowledgement texts in Web of Science: a case study on four scientific domains

    Get PDF
    Analysis of acknowledgments is particularly interesting as acknowledgments may give information not only about funding, but they are also able to reveal hidden contributions to authorship and the researcher’s collaboration patterns, context in which research was conducted, and specific aspects of the academic work. The focus of the present research is the analysis of a large sample of acknowledgement texts indexed in the Web of Science (WoS) Core Collection. Record types "article" and "review" from four different scientific domains, namely social sciences, economics, oceanography and computer science, published from 2014 to 2019 in a scientific journal in English were considered. Six types of acknowledged entities, i.e., funding agency, grant number, individuals, university, corporation and miscellaneous, were extracted from the acknowledgement texts using a named entity recognition tagger and subsequently examined. A general analysis of the acknowledgement texts showed that indexing of funding information in WoS is incomplete. The analysis of the automatically extracted entities revealed differences and distinct patterns in the distribution of acknowledged entities of different types between different scientific domains. A strong association was found between acknowledged entity and scientific domain, and acknowledged entity and entity type. Only negligible correlation was found between the number of citations and the number of acknowledged entities. Generally, the number of words in the acknowledgement texts positively correlates with the number of acknowledged funding organizations, universities, individuals and miscellaneous entities. At the same time, acknowledgement texts with the larger number of sentences have more acknowledged individuals and miscellaneous categories.Die Analyse von Danksagungstexten in wissenschaftlichen Veröffentlichungen ist besonders interessant, da sie nicht nur Aufschluss über die Finanzierung geben, sondern auch verborgene Beiträge zur Autorenschaft und zu den Kooperationsmustern der Forschenden, zum Kontext, in dem die Forschung durchgeführt wurde, sowie zu bestimmten Aspekten der wissenschaftlichen Arbeit offenlegen können. Der Schwerpunkt dieser Publikation liegt auf der Analyse einer großen Stichprobe von Danksagungstexten, die in der Web of Science (WoS) Core Collection indexiert sind
    corecore