40 research outputs found

    How reliable are unsupervised author disambiguation algorithms in the assessment of research organization performance?

    Get PDF
    The paper examines extent of bias in the performance rankings of research organisations when the assessments are based on unsupervised author-name disambiguation algorithms. It compares the outcomes of a research performance evaluation exercise of Italian universities using the unsupervised approach by Caron and van Eck (2014) for derivation of the universities' research staff, with those of a benchmark using the supervised algorithm of D'Angelo, Giuffrida, and Abramo (2011), which avails of input data. The methodology developed could be replicated for comparative analyses in other frameworks of national or international interest, meaning that practitioners would have a precise measure of the extent of distortions inherent in any evaluation exercises using unsupervised algorithms. This could in turn be useful in informing policy-makers' decisions on whether to invest in building national research staff databases, instead of settling for the unsupervised approaches with their measurement biases

    Effect of forename string on author name disambiguation

    Full text link
    In author name disambiguation, author forenames are used to decide which name instances are disambiguated together and how much they are likely to refer to the same author. Despite such a crucial role of forenames, their effect on the performance of heuristic (string matching) and algorithmic disambiguation is not well understood. This study assesses the contributions of forenames in author name disambiguation using multiple labeled data sets under varying ratios and lengths of full forenames, reflecting real‐world scenarios in which an author is represented by forename variants (synonym) and some authors share the same forenames (homonym). The results show that increasing the ratios of full forenames substantially improves both heuristic and machine‐learning‐based disambiguation. Performance gains by algorithmic disambiguation are pronounced when many forenames are initialized or homonyms are prevalent. As the ratios of full forenames increase, however, they become marginal compared to those by string matching. Using a small portion of forename strings does not reduce much the performances of both heuristic and algorithmic disambiguation methods compared to using full‐length strings. These findings provide practical suggestions, such as restoring initialized forenames into a full‐string format via record linkage for improved disambiguation performances.Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/155924/1/asi24298.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/155924/2/asi24298_am.pd

    Recherche d'information dans les documents XML : prise en compte des liens pour la sélection d'éléments pertinents

    Get PDF
    156 p. : ill. ; 30 cmNotre travail se situe dans le contexte de la recherche d'information (RI), plus particuliĂšrement la recherche d'information dans des documents semi structurĂ©s de type XML. L'exploitation efficace des documents XML disponibles doit prendre en compte la dimension structurelle. Cette dimension a conduit Ă  l'Ă©mergence de nouveaux dĂ©fis dans le domaine de la RI. Contrairement aux approches classiques de RI qui mettent l'accent sur la recherche des contenus non structurĂ©s, la RI XML combine Ă  la fois des informations textuelles et structurelles pour effectuer diffĂ©rentes tĂąches de recherche. Plusieurs approches exploitant les types d'Ă©vidence ont Ă©tĂ© proposĂ©es et sont principalement basĂ©es sur les modĂšles classiques de RI, adaptĂ©es Ă  des documents XML. La structure XML a Ă©tĂ© utilisĂ©e pour fournir un accĂšs ciblĂ© aux documents, en retournant des composants de document (par exemple, sections, paragraphes, etc.), au lieu de retourner tout un document en rĂ©ponse une requĂȘte de l'utilisateur. En RI traditionnelle, la mesure de similaritĂ© est gĂ©nĂ©ralement basĂ©e sur l'information textuelle. Elle permetle classement des documents en fonction de leur degrĂ© de pertinence en utilisant des mesures comme:" similitude terme " ou " probabilitĂ© terme ". Cependant, d'autres sources d'Ă©vidence peuvent ĂȘtre considĂ©rĂ©es pour rechercher des informations pertinentes dans les documents. Par exemple, les liens hypertextes ont Ă©tĂ© largement exploitĂ©s dans le cadre de la RI sur le Web.MalgrĂ© leur popularitĂ© dans le contexte du Web, peud'approchesexploitant cette source d'Ă©vidence ont Ă©tĂ© proposĂ©es dans le contexte de la RI XML. Le but de notre travail est de proposer des approches pour l'utilisation de liens comme une source d'Ă©videncedans le cadre de la recherche d'information XML. Cette thĂšse vise Ă  apporter des rĂ©ponses aux questions de recherche suivantes : 1. Peut-on considĂ©rer les liens comme une source d'Ă©vidence dans le contexte de la RIXML? 2. Est-ce que l'utilisation de certains algorithmes d'analyse de liensdans le contexte de la RI XML amĂ©liore la qualitĂ© des rĂ©sultats, en particulier dans le cas de la collection Wikipedia? 3. Quels types de liens peuvent ĂȘtre utilisĂ©s pour amĂ©liorer le mieux la pertinence des rĂ©sultats de recherche? 4. Comment calculer le score lien des diffĂ©rents Ă©lĂ©ments retournĂ©s comme rĂ©sultats de recherche? Doit-on considĂ©rer lesliens de type "document-document" ou plus prĂ©cisĂ©ment les liens de type "Ă©lĂ©ment-Ă©lĂ©ment"? Quel est le poids des liens de navigation par rapport aux liens hiĂ©rarchiques? 5. Quel est l'impact d'utilisation de liens dans le contexte global ou local? 6. Comment intĂ©grer le score lien dans le calcul du score final des Ă©lĂ©ments XML retournĂ©s? 7. Quel est l'impact de la qualitĂ© des premiers rĂ©sultats sur le comportement des formules proposĂ©es? Pour rĂ©pondre Ă  ces questions, nous avons menĂ© une Ă©tude statistique, sur les rĂ©sultats de recherche retournĂ©s par le systĂšme de recherche d'information"DALIAN", qui a clairement montrĂ© que les liens reprĂ©sentent un signe de pertinence des Ă©lĂ©ments dans le contexte de la RI XML, et cecien utilisant la collection de test fournie par INEX. Aussi, nous avons implĂ©mentĂ© trois algorithmes d'analyse des liens (Pagerank, HITS et SALSA) qui nous ont permis de rĂ©aliser une Ă©tude comparative montrant que les approches "query-dependent" sont les meilleures par rapport aux approches "global context" . Nous avons proposĂ© durant cette thĂšse trois formules de calcul du score lien: Le premiĂšreest appelĂ©e "Topical Pagerank"; la seconde est la formule : "distance-based"; et la troisiĂšme est :"weighted links based". Nous avons proposĂ© aussi trois formules de combinaison, Ă  savoir, la formule linĂ©aire, la formule Dempster-Shafer et la formule fuzzy-based. Enfin, nous avons menĂ© une sĂ©rie d'expĂ©rimentations. Toutes ces expĂ©rimentations ont montrĂ© que: les approches proposĂ©es ont permis d'amĂ©liorer la pertinence des rĂ©sultats pour les diffĂ©rentes configurations testĂ©es; les approches "query-dependent" sont les meilleurescomparĂ©es aux approches global context; les approches exploitant les liens de type "Ă©lĂ©ment-Ă©lĂ©ment"ont obtenu de bons rĂ©sultats; les formules de combinaison qui se basent sur le principe de l'incertitude pour le calcul des scores finaux des Ă©lĂ©ments XML permettent de rĂ©aliser de bonnes performance

    Semantics-Driven Aspect-Based Sentiment Analysis

    Get PDF
    People using the Web are constantly invited to share their opinions and preferences with the rest of the world, which has led to an explosion of opinionated blogs, reviews of products and services, and comments on virtually everything. This type of web-based content is increasingly recognized as a source of data that has added value for multiple application domains. While the large number of available reviews almost ensures that all relevant parts of the entity under review are properly covered, manually reading each and every review is not feasible. Aspect-based sentiment analysis aims to solve this issue, as it is concerned with the development of algorithms that can automatically extract fine-grained sentiment information from a set of reviews, computing a separate sentiment value for the various aspects of the product or service being reviewed. This dissertation focuses on which discriminants are useful when performing aspect-based sentiment analysis. What signals for sentiment can be extracted from the text itself and what is the effect of using extra-textual discriminants? We find that using semantic lexicons or ontologies, can greatly improve the quality of aspect-based sentiment analysis, especially with limited training data. Additionally, due to semantics driving the analysis, the algorithm is less of a black box and results are easier to explain

    Exploring the value of big data analysis of Twitter tweets and share prices

    Get PDF
    Over the past decade, the use of social media (SM) such as Facebook, Twitter, Pinterest and Tumblr has dramatically increased. Using SM, millions of users are creating large amounts of data every day. According to some estimates ninety per cent of the content on the Internet is now user generated. Social Media (SM) can be seen as a distributed content creation and sharing platform based on Web 2.0 technologies. SM sites make it very easy for its users to publish text, pictures, links, messages or videos without the need to be able to program. Users post reviews on products and services they bought, write about their interests and intentions or give their opinions and views on political subjects. SM has also been a key factor in mass movements such as the Arab Spring and the Occupy Wall Street protests and is used for human aid and disaster relief (HADR). There is a growing interest in SM analysis from organisations for detecting new trends, getting user opinions on their products and services or finding out about their online reputation. Companies such as Amazon or eBay use SM data for their recommendation engines and to generate more business. TV stations buy data about opinions on their TV programs from Facebook to find out what the popularity of a certain TV show is. Companies such as Topsy, Gnip, DataSift and Zoomph have built their entire business models around SM analysis. The purpose of this thesis is to explore the economic value of Twitter tweets. The economic value is determined by trying to predict the share price of a company. If the share price of a company can be predicted using SM data, it should be possible to deduce a monetary value. There is limited research on determining the economic value of SM data for “nowcasting”, predicting the present, and for forecasting. This study aims to determine the monetary value of Twitter by correlating the daily frequencies of positive and negative Tweets about the Apple company and some of its most popular products with the development of the Apple Inc. share price. If the number of positive tweets about Apple increases and the share price follows this development, the tweets have predictive information about the share price. A literature review has found that there is a growing interest in analysing SM data from different industries. A lot of research is conducted studying SM from various perspectives. Many studies try to determine the impact of online marketing campaigns or try to quantify the value of social capital. Others, in the area of behavioural economics, focus on the influence of SM on decision-making. There are studies trying to predict financial indicators such as the Dow Jones Industrial Average (DJIA). However, the literature review has indicated that there is no study correlating sentiment polarity on products and companies in tweets with the share price of the company. The theoretical framework used in this study is based on Computational Social Science (CSS) and Big Data. Supporting theories of CSS are Social Media Mining (SMM) and sentiment analysis. Supporting theories of Big Data are Data Mining (DM) and Predictive Analysis (PA). Machine learning (ML) techniques have been adopted to analyse and classify the tweets. In the first stage of the study, a body of tweets was collected and pre-processed, and then analysed for their sentiment polarity towards Apple Inc., the iPad and the iPhone. Several datasets were created using different pre-processing and analysis methods. The tweet frequencies were then represented as time series. The time series were analysed against the share price time series using the Granger causality test to determine if one time series has predictive information about the share price time series over the same period of time. For this study, several Predictive Analytics (PA) techniques on tweets were evaluated to predict the Apple share price. To collect and analyse the data, a framework has been developed based on the LingPipe (LingPipe 2015) Natural Language Processing (NLP) tool kit for sentiment analysis, and using R, the functional language and environment for statistical computing, for correlation analysis. Twitter provides an API (Application Programming Interface) to access and collect its data programmatically. Whereas no clear correlation could be determined, at least one dataset was showed to have some predictive information on the development of the Apple share price. The other datasets did not show to have any predictive capabilities. There are many data analysis and PA techniques. The techniques applied in this study did not indicate a direct correlation. However, some results suggest that this is due to noise or asymmetric distributions in the datasets. The study contributes to the literature by providing a quantitative analysis of SM data, for example tweets about Apple and its most popular products, the iPad and iPhone. It shows how SM data can be used for PA. It contributes to the literature on Big Data and SMM by showing how SM data can be collected, analysed and classified and explore if the share price of a company can be determined based on sentiment time series. It may ultimately lead to better decision making, for instance for investments or share buyback
    corecore