10 research outputs found

    The Closer the Better: Similarity of Publication Pairs at Different Co-Citation Levels

    Full text link
    We investigate the similarities of pairs of articles which are co-cited at the different co-citation levels of the journal, article, section, paragraph, sentence and bracket. Our results indicate that textual similarity, intellectual overlap (shared references), author overlap (shared authors), proximity in publication time all rise monotonically as the co-citation level gets lower (from journal to bracket). While the main gain in similarity happens when moving from journal to article co-citation, all level changes entail an increase in similarity, especially section to paragraph and paragraph to sentence/bracket levels. We compare results from four journals over the years 2010-2015: Cell, the European Journal of Operational Research, Physics Letters B and Research Policy, with consistent general outcomes and some interesting differences. Our findings motivate the use of granular co-citation information as defined by meaningful units of text, with implications for, among others, the elaboration of maps of science and the retrieval of scholarly literature

    Measuring academic influence: Not all citations are equal

    Get PDF
    The importance of a research article is routinely measured by counting how many times it has been cited. However, treating all citations with equal weight ignores the wide variety of functions that citations perform. We want to automatically identify the subset of references in a bibliography that have a central academic influence on the citing paper. For this purpose, we examine the effectiveness of a variety of features for determining the academic influence of a citation. By asking authors to identify the key references in their own work, we created a data set in which citations were labeled according to their academic influence. Using automatic feature selection with supervised machine learning, we found a model for predicting academic influence that achieves good performance on this data set using only four features. The best features, among those we evaluated, were those based on the number of times a reference is mentioned in the body of a citing paper. The performance of these features inspired us to design an influence-primed h-index (the hip-index). Unlike the conventional h-index, it weights citations by how many times a reference is mentioned. According to our experiments, the hip-index is a better indicator of researcher performance than the conventional h-index

    CiteFinder: a System to Find and Rank Medical Citations

    Get PDF
    This thesis presents CiteFinder, a system to find relevant citations for clinicians\u27 written content. Inclusion of citations for clinical information content makes the content more reliable through the provision of scientific articles as references, and enables clinicians to easily update their written content using new information. The proposed approach splits the content into sentences, identifies the sentences that need to be supported with citations by applying classification algorithms, and uses information retrieval and ranking techniques to extract and rank relevant citations from MEDLINE for any given sentence. Additionally, this system extracts snippets from the retrieved articles. We assessed our approach on 3,699 MEDLINE papers on the subject of Heart Failure . We implemented multi-level and weight ranking algorithms to rank the citations. This study shows that using Journal priority and Study Design type significantly improves results obtained with the traditional approach of only using the text of articles, by approximately 63%. We also show that using the full-text, rather than just the abstract text, leads to extraction of higher quality snippets

    Selecci贸n de art铆culos de investigaci贸n relevantes y no relevantes con base en resultados de Scopus y visualizaci贸n por grupos de documentos

    Get PDF
    Este art铆culo presenta una aplicaci贸n web que busca facilitar la selecci贸n de art铆culos de investigaci贸n relevantes o no para una tem谩tica. El proceso inicia cuando un investigador escribe una cadena de b煤squeda y esta se env铆a a la API de Scopus. Con los resultados obtenidos, se realiza un proceso de agrupamiento para generar una visualizaci贸n por grupos o t贸picos en lugar de las cl谩sicas listas ordenadas de resultados, facilitando al usuario descartar grupos de art铆culos irrelevantes a su consulta. La propuesta utiliza cinco algoritmos de agrupamiento, entre los cuales Spectral y K-means obtuvieron el mejor rendimiento en m茅tricas cl谩sicas de recuperaci贸n de informaci贸n sobre cuatro conjuntos de datos del estado del arte. La aplicaci贸n fue evaluada en dos rondas por investigadores de la Universidad del Cauca, quienes consideraron en la ronda final que el 71.4 % de los grupos ten铆an un buen t铆tulo, el 92.9 % de los grupos ten铆an un buen orden de los documentos y el 65.8 % de los art铆culos estaban bien agrupados. Se destaca la implementaci贸n del solapamiento en el agrupamiento, pues permite a los art铆culos pertenecer a varios t贸picos. Finalmente, los resultados son prometedores, y la aplicaci贸n constituye una valiosa contribuci贸n para los investigadores en el desarrollo de sus proyectos. Sin embargo, los resultados no son generalizables, y se evidencia la necesidad de crear mejores algoritmos de etiquetado para generar t铆tulos m谩s descriptivos, as铆 como el uso de herramientas para asistir al usuario en la construcci贸n de las consultas

    Multi-Agent Modeling of Risk-Aware and Privacy-Preserving Recommender Systems

    Get PDF
    Recent progress in the field of recommender systems has led to increases in the accuracy and significant improvements in the personalization of recommendations. These results are being achieved in general by gathering more user data and generating relevant insights from it. However, user privacy concerns are often underestimated and recommendation risks are not usually addressed. In fact, many users are not sufficiently aware of what data is collected about them and how the data is collected (e.g., whether third parties are collecting and selling their personal information). Research in the area of recommender systems should strive towards not only achieving high accuracy of the generated recommendations but also protecting the user鈥檚 privacy and making recommender systems aware of the user鈥檚 context, which involves the user鈥檚 intentions and the user鈥檚 current situation. Through research it has been established that a tradeoff is required between the accuracy, the privacy and the risks in a recommender system and that it is highly unlikely to have recommender systems completely satisfying all the context-aware and privacy-preserving requirements. Nonetheless, a significant attempt can be made to describe a novel modeling approach that supports designing a recommender system encompassing some of these previously mentioned requirements. This thesis focuses on a multi-agent based system model of recommender systems by introducing both privacy and risk-related abstractions into traditional recommender systems and breaking down the system into three different subsystems. Such a description of the system will be able to represent a subset of recommender systems which can be classified as both risk-aware and privacy-preserving. The applicability of the approach is illustrated by a case study involving a job recommender system in which the general design model is instantiated to represent the required domain-specific abstractions

    Elaboraci贸n de un Sistema de Recomendaci贸n de Publicaciones Cient铆ficas Nacionales de Acceso Abierto para los investigadores calificados del SINACYT

    Get PDF
    Actualmente existe un crecimiento sostenido sobre la producci贸n cient铆fica mundial. Esta producci贸n cient铆fica es preservada a trav茅s de repositorios de acceso abierto digitales, los cuales se crean como herramientas de apoyo para el desarrollo de producci贸n cient铆fica. Sin embargo, existen deficiencias en la funcionalidad de los mismos como herramientas de apoyo para el aumento de la visibilidad, uso e impacto de la producci贸n cient铆fica que albergan. El Per煤, no es ajeno al crecimiento de la producci贸n cient铆fica mundial. Con el avance del mismo, se implementaron nuevas plataformas (ALICIA y DINA) de difusi贸n y promoci贸n del intercambio de informaci贸n entre las distintas instituciones y universidades locales. No obstante, estas plataformas se muestran como plataformas aisladas dentro del sistema cient铆fico-investigador, ya que no se encuentran integradas con las herramientas y procesos de los investigadores. El objetivo de este Proyecto es el de presentar una alternativa de soluci贸n para la resoluci贸n del problema de carencia de mecanismos adecuados para la visualizaci贸n de la producci贸n cient铆fica peruana a trav茅s de la implementaci贸n de un Sistema de Recomendaci贸n de Publicaciones Cient铆ficas Nacionales de Acceso Abierto para los investigadores calificados del SINACYT. Esta alternativa se basa en la generaci贸n de recomendaciones personalizadas de publicaciones en ALICIA, a trav茅s del uso del filtrado basado en contenido tomando en cuenta un perfil de investigador. Este perfil se construy贸 a partir de la informaci贸n relevante sobre su producci贸n cient铆fica publicada en Scopus y Orcid. La generaci贸n de recomendaciones se bas贸 en la t茅cnica de LSA (Latent Semantic Analysis), para descubrir estructuras sem谩nticas escondidas sobre un conjunto de publicaciones cient铆ficas, y la t茅cnica de Similitud Coseno, para encontrar aquellas publicaciones cient铆ficas con el mayor nivel de similitud. Para el Proyecto, se implementaron los m贸dulos de extracci贸n, en donde se recoge la data de las publicaciones en ALICIA y las publicaciones en Scopus y Orcid para cada uno de los investigadores registrados en DINA a trav茅s de la t茅cnica de extracci贸n de datos de sitios web (web scrapping); de pre procesamiento, en donde se busca la mejora de la calidad de la data previamente extra铆da para su posterior uso en el modelo anal铆tico dentro del marco de la miner铆a de texto; de recomendaci贸n, en donde se capacita un modelo LSA y se generan recomendaciones sobre qu茅 publicaciones cient铆ficas pueden interesar a los usuarios basado en sus publicaciones cient铆ficas en Scopus y Orcid; y de servicio, en donde se permite a otras aplicaciones consumir las recomendaciones generadas por el sistema.Tesi

    Intelligent methods for information filtering of research resources

    Get PDF
    This thesis presents several content-based methods to address the task of filtering research resources. The explosive growth of the Web in the last decades has led to an important increase in available scientific information. This has contributed to the need for tools which help researchers to deal with huge amounts of data. Examples of such tools are digital libraries, dedicated search engines, and personalized information filters. The latter, also known as recommenders, have proved useful for non-academic purposes and in the last years have started to be considered for recommendation of scholarly resources. This thesis explores new developments in this context. In particular, we focus on two different tasks. First we explore how to make maximal use of the semi-structured information typically available for research papers, such as keywords, authors, or journal, to assess research paper similarity. This is important since in many cases the full text of the articles is not available and the information used for tasks such as article recommendation is often limited to the abstracts. To exploit all the available information, we propose several methods based on both the vector space model and language modeling. In the first case, we study how the popular combination of tf-idf and cosine similarity can be used not only with the abstract, but also with the keywords and the authors. We also combine the abstract and these extra features by using Explicit Semantic Analysis. In the second case, we estimate separate language models based on each of the features to subsequently interpolate them. Moreover, we employ Latent Dirichlet Allocation (LDA) to discover latent topics which can enrich the models, and we explore how to use the keywords and the authors to improve the performance of the standard LDA algorithm. Next, we study the information available in call for papers (CFPs) of conferences to exploit it in content-based methods to match users with CFPs. Specifically, we distinguish between textual content such as the introductory text and topics in the scope of the conference, and names of the program committee. This second type of information can be used to retrieve the research papers written by these people, which provides the system with new data about the conference. Moreover, the research papers written by the users are employed to represent their interests. Again, we explore methods based on both the vector space model and language modeling to combine the different types of information. The experimental results indicate that the use of these extra features can lead to significant improvements. In particular, our methods based on interpolation of language models perform well for the task of assessing the similarity between research papers. On the contrary, when addressing the problem of filtering CFPs the methods based on the vector space model are shown to be more robust.Dit proefschrift stelt verschillende content-gebaseerde methoden voor om het probleem van het filteren van onderzoeksgerelateerde resources aan te pakken. De explosieve groei van het internet in de laatste decennia heeft geleid tot een belangrijke toename van de beschikbare wetenschappelijke informatie. Dit heeft bijgedragen aan de behoefte aan tools die onderzoekers helpen om om te gaan met grote hoeveelheden van data. Voorbeelden van dergelijke tools zijn digitale bibliotheken, specifieke zoekmachines, en gepersonaliseerde informatiefilters. Deze laatste, ook gekend als aanbevelingssystemen, hebben ruimschoots hun nut bewezen voor niet-academische doeleinden, en in de laatste jaren is men ze ook beginnen inzetten voor de aanbeveling van wetenschappelijke resources. Dit proefschrift exploreert nieuwe ontwikkelingen in deze context. In het bijzonder richten we ons op twee verschillende taken. Eerst onderzoeken we hoe we maximaal gebruik kunnen maken van de semigestructureerde informatie die doorgaans beschikbaar is voor wetenschappelijke artikels, zoals trefwoorden, auteurs, of tijdschrift, om de gelijkenis tussen wetenschappelijke artikels te beoordelen. Dit is belangrijk omdat in veel gevallen de volledige tekst van de artikelen niet beschikbaar is en de informatie gebruikt voor taken zoals aanbeveling van artikels vaak beperkt is tot de abstracts. Om alle beschikbare informatie te benutten, stellen we een aantal methoden voor op basis van zowel het vector space model en language models. In het eerste geval bestuderen we hoe de populaire combinatie van tf-idf en cosinussimilariteit gebruikt kan worden met niet alleen de abstract, maar ook met de trefwoorden en de auteurs. We combineren ook de abstract met deze extra informatie door het gebruik van Explicit Semantic Analysis. In het tweede geval schatten we afzonderlijke taalmodellen die gebaseerd zijn op de verschillende soorten informatie om ze daarna te interpoleren. Bovendien maken we gebruik van Latent Dirichlet Allocation (LDA) om latente onderwerpen te ontdekken die de modellen kunnen verrijken, en we onderzoeken hoe de trefwoorden en de auteurs gebruikt kunnen worden om de prestaties van de standaard LDA algoritme te verbeteren. Vervolgens bestuderen we de informatie beschikbaar in de call for papers (CFPs) van conferenties om deze te exploiteren in content-gebaseerde methoden om gebruikers te matchen met CFPs. Met name maken we onderscheid tussen tekstuele inhoud, zoals de inleidende tekst en onderwerpen in het kader van de conferentie, en de namen van de programmacommissie. Dit tweede type informatie kan gebruikt worden om de artikels geschreven door deze mensen te achterhalen, wat het systeem voorziet van bijkomende gegevens over de conferentie. Bovendien worden de artikels geschreven door de gebruikers gebruikt om hun interesses te voorstellen. Opnieuw onderzoeken we methoden gebaseerd op zowel het vector space model als op language models om de verschillende soorten informatie te combineren. De experimentele resultaten tonen aan dat het gebruik van deze extra informatie kan leiden tot significante verbeteringen. In het bijzonder presteren onze methoden op basis van interpolatie van taalmodellen goed voor de taak van het beoordelen van de gelijkenis tussen wetenschappelijke artikels. Daarentegen zijn de methoden gebaseerd op het vector space model meer robuust voor het probleem van het filteren van CFPs

    Finding Relevant Papers Based on Citation Relations

    No full text
    Abstract. With the tremendous amount of research publications, recommending relevant papers to researchers to fulfill their information need becomes a signif-icant problem. The major challenge to be tackled by our work is that given a targeting paper, how to effectively recommend a set of relevant papers from an existing citation network. In this paper, we propose a novel method to address the problem by incorporating various citation relations for a proper set of papers, which are more relevant but with a very limited size. The method has two unique properties. Firstly, a metric called Local Relation Strength is defined to measure the dependency between cited and citing papers. Secondly, a model called Global Relation Strength is proposed to capture the relevance between two papers in the whole citation graph. We evaluate our proposed model on a real-world publica-tion dataset and conduct an extensive comparison with five baseline approaches. The experimental results demonstrate that our method can have a promising im-provement over these well-known techniques
    corecore