9 research outputs found

    Characterizing Interdisciplinarity of Researchers and Research Topics Using Web Search Engines

    Get PDF
    Researchers' networks have been subject to active modeling and analysis. Earlier literature mostly focused on citation or co-authorship networks reconstructed from annotated scientific publication databases, which have several limitations. Recently, general-purpose web search engines have also been utilized to collect information about social networks. Here we reconstructed, using web search engines, a network representing the relatedness of researchers to their peers as well as to various research topics. Relatedness between researchers and research topics was characterized by visibility boost-increase of a researcher's visibility by focusing on a particular topic. It was observed that researchers who had high visibility boosts by the same research topic tended to be close to each other in their network. We calculated correlations between visibility boosts by research topics and researchers' interdisciplinarity at individual level (diversity of topics related to the researcher) and at social level (his/her centrality in the researchers' network). We found that visibility boosts by certain research topics were positively correlated with researchers' individual-level interdisciplinarity despite their negative correlations with the general popularity of researchers. It was also found that visibility boosts by network-related topics had positive correlations with researchers' social-level interdisciplinarity. Research topics' correlations with researchers' individual- and social-level interdisciplinarities were found to be nearly independent from each other. These findings suggest that the notion of "interdisciplinarity" of a researcher should be understood as a multi-dimensional concept that should be evaluated using multiple assessment means.Comment: 20 pages, 7 figures. Accepted for publication in PLoS On

    Evaluating human versus machine learning performance in classifying research abstracts

    Get PDF
    We study whether humans or machine learning (ML) classification models are better at classifying scientific research abstracts according to a fixed set of discipline groups. We recruit both undergraduate and postgraduate assistants for this task in separate stages, and compare their performance against the support vectors machine ML algorithm at classifying European Research Council Starting Grant project abstracts to their actual evaluation panels, which are organised by discipline groups. On average, ML is more accurate than human classifiers, across a variety of training and test datasets, and across evaluation panels. ML classifiers trained on different training sets are also more reliable than human classifiers, meaning that different ML classifiers are more consistent in assigning the same classifications to any given abstract, compared to different human classifiers. While the top five percentile of human classifiers can outperform ML in limited cases, selection and training of such classifiers is likely costly and difficult compared to training ML models. Our results suggest ML models are a cost effective and highly accurate method for addressing problems in comparative bibliometric analysis, such as harmonising the discipline classifications of research from different funding agencies or countries.National Research Foundation (NRF)Published versionThe study was partially funded by the Singapore National Research Foundation, Grant No. NRF2014-NRF-SRIE001-02

    Measuring Author Research Relatedness: A Comparison of Word-based,Topic-based and Author Cocitation Approaches

    Get PDF
    Relationships between authors based on characteristics of published literature have been studied for decades. Author cocitation analysis using mapping techniques has been most frequently used to study how closely two authors are thought to be in intellectual space based on how members of the research community co-cite their works. Other approaches exist to study author relatedness based more directly on the text of their published works. In this study we present static and dynamic word-based approaches using vector space modeling, as well as a topic-based approach based on Latent Dirichlet Allocation for mapping author research relatedness. Vector space modeling is used to define an author space consisting of works by a given author. Outcomes for the two word-based approaches and a topic-based approach for 50 prolific authors in library and information science are compared with more traditional author cocitation analysis using multidimensional scaling and hierarchical cluster analysis. The two word-based approaches produced similar outcomes except where two authors were frequent co-authors for the majority of their articles. The topic-based approach produced the most distinctive map

    CITREC: An Evaluation Framework for Citation-Based Similarity Measures based on TREC Genomics and PubMed Central

    Get PDF
    Citation-based similarity measures such as Bibliographic Coupling and Co-Citation are an integral component of many information retrieval systems. However, comparisons of the strengths and weaknesses of measures are challenging due to the lack of suitable test collections. This paper presents CITREC, an open evaluation framework for citation-based and text-based similarity measures. CITREC prepares the data from the PubMed Central Open Access Subset and the TREC Genomics collection for a citation-based analysis and provides tools necessary for performing evaluations of similarity measures. To account for different evaluation purposes, CITREC implements 35 citation-based and text-based similarity measures, and features two gold standards. The first gold standard uses the Medical Subject Headings (MeSH) thesaurus and the second uses the expert relevance feedback that is part of the TREC Genomics collection to gauge similarity. CITREC additionally offers a system that allows creating user defined gold standards to adapt the evaluation framework to individual information needs and evaluation purposes.ye

    An academic perspective on the entrepreneurship policy agenda: themes, geographies and evolution

    Get PDF
    Text mining is being increasingly used for the automatic analysis of different corpus of documents, either standalone or complementarily to other bibliometric techniques. The case of academic research into entrepreneurship policy is particularly interesting due to the increasing relevance of the topic and since the knowledge about the evolution of themes in this field is still rather limited. Consequently, this paper analyses the key topics, trends and shifts that have shaped the entrepreneurship policy research agenda to date using text mining techniques, cluster analysis and complementary bibliographic data to examine the evolution of a corpus of 1,048 academic papers focused on entrepreneurial-related policies and published during the period 1990-2016 in ten of the most relevant entrepreneurship journals. The results of the analysis show that inclusion, employment and regulation-related papers have largely dominated the research in the field, evolving from an initial classical approach about the relationship between entrepreneurship and employment to a wider and multidisciplinary perspective, including the relevance of management, geographies, and narrower topics such as agglomeration economics or internationalization instead of previous generic sectorial approaches. Overall, the text mining analysis reveals how entrepreneurship policy research has gained increasing attention and has become both more open, with a growing cooperation among researchers from different affiliations; and more sophisticated, with concepts and themes that moved forward the research agenda closer to the priorites of policies implementatio

    An academic perspective on the entrepreneurship policy agenda: themes, geographies and evolution

    Get PDF
    Text mining is being increasingly used for the automatic analysis of different corpus of documents, either standalone or complementarily to other bibliometric techniques. The case of academic research into entrepreneurship policy is particularly interesting due to the increasing relevance of the topic and since the knowledge about the evolution of themes in this field is still rather limited. Consequently, this paper analyses the key topics, trends and shifts that have shaped the entrepreneurship policy research agenda to date using text mining techniques, cluster analysis and complementary bibliographic data to examine the evolution of a corpus of 1,048 academic papers focused on entrepreneurial-related policies and published during the period 1990-2016 in ten of the most relevant entrepreneurship journals. The results of the analysis show that inclusion, employment and regulation-related papers have largely dominated the research in the field, evolving from an initial classical approach about the relationship between entrepreneurship and employment to a wider and multidisciplinary perspective, including the relevance of management, geographies, and narrower topics such as agglomeration economics or internationalization instead of previous generic sectorial approaches. Overall, the text mining analysis reveals how entrepreneurship policy research has gained increasing attention and has become both more open, with a growing cooperation among researchers from different affiliations; and more sophisticated, with concepts and themes that moved forward the research agenda closer to the priorites of policies implementatio

    Rare Feature Selection in High Dimensions

    Full text link
    It is common in modern prediction problems for many predictor variables to be counts of rarely occurring events. This leads to design matrices in which many columns are highly sparse. The challenge posed by such "rare features" has received little attention despite its prevalence in diverse areas, ranging from natural language processing (e.g., rare words) to biology (e.g., rare species). We show, both theoretically and empirically, that not explicitly accounting for the rareness of features can greatly reduce the effectiveness of an analysis. We next propose a framework for aggregating rare features into denser features in a flexible manner that creates better predictors of the response. Our strategy leverages side information in the form of a tree that encodes feature similarity. We apply our method to data from TripAdvisor, in which we predict the numerical rating of a hotel based on the text of the associated review. Our method achieves high accuracy by making effective use of rare words; by contrast, the lasso is unable to identify highly predictive words if they are too rare. A companion R package, called rare, implements our new estimator, using the alternating direction method of multipliers.Comment: 42 pages, 10 figure