599 research outputs found

    A hybrid similarity measure method for patent portfolio analysis

    Full text link
    © 2016 Elsevier Ltd Similarity measures are fundamental tools for identifying relationships within or across patent portfolios. Many bibliometric indicators are used to determine similarity measures; for example, bibliographic coupling, citation and co-citation, and co-word distribution. This paper aims to construct a hybrid similarity measure method based on multiple indicators to analyze patent portfolios. Two models are proposed: categorical similarity and semantic similarity. The categorical similarity model emphasizes international patent classifications (IPCs), while the semantic similarity model emphasizes textual elements. We introduce fuzzy set routines to translate the rough technical (sub-) categories of IPCs into defined numeric values, and we calculate the categorical similarities between patent portfolios using membership grade vectors. In parallel, we identify and highlight core terms in a 3-level tree structure and compute the semantic similarities by comparing the tree-based structures. A weighting model is designed to consider: 1) the bias that exists between the categorical and semantic similarities, and 2) the weighting or integrating strategy for a hybrid method. A case study to measure the technological similarities between selected firms in China's medical device industry is used to demonstrate the reliability our method, and the results indicate the practical meaning of our method in a broad range of informetric applications

    Evaluating human versus machine learning performance in classifying research abstracts

    Get PDF
    We study whether humans or machine learning (ML) classification models are better at classifying scientific research abstracts according to a fixed set of discipline groups. We recruit both undergraduate and postgraduate assistants for this task in separate stages, and compare their performance against the support vectors machine ML algorithm at classifying European Research Council Starting Grant project abstracts to their actual evaluation panels, which are organised by discipline groups. On average, ML is more accurate than human classifiers, across a variety of training and test datasets, and across evaluation panels. ML classifiers trained on different training sets are also more reliable than human classifiers, meaning that different ML classifiers are more consistent in assigning the same classifications to any given abstract, compared to different human classifiers. While the top five percentile of human classifiers can outperform ML in limited cases, selection and training of such classifiers is likely costly and difficult compared to training ML models. Our results suggest ML models are a cost effective and highly accurate method for addressing problems in comparative bibliometric analysis, such as harmonising the discipline classifications of research from different funding agencies or countries.National Research Foundation (NRF)Published versionThe study was partially funded by the Singapore National Research Foundation, Grant No. NRF2014-NRF-SRIE001-02

    Human resources mining for examination of R&D progress and requirements

    Get PDF

    Semantic Distances for Technology Landscape Visualization

    Get PDF
    This paper presents a novel approach to the visualization and subsequent elucidation of research domains in science and technology. The proposed methodology is based on the use of bibliometrics; i.e., analysis is conducted using information regarding trends and patterns of publication rather than the contents of these publications. In particular, we explore the use of term co-occurence frequencies as an indicator of the semantic closeness between pairs of words or phrases. To demonstrate the utility of this approach, a case study on renewable energy technologies is conducted, where the above techniques are used to visualize the interrelationships within a collection of energy-related keywords. As these are regarded as manifestations of the underlying research topics, we contend that the proposed visualizations can be interpreted as representations of the underlying technology landscape. These techniques have many potential applications, but one interesting challenge in which we are particularly interested is the mapping and subsequent prediction of future developments in the technological fields being studied.The research described in this paper was funded by the Masdar Institute of Science and Technology (MIST)

    Does deep learning help topic extraction? A kernel k-means clustering method with word embedding

    Full text link
    © 2018 All rights reserved. Topic extraction presents challenges for the bibliometric community, and its performance still depends on human intervention and its practical areas. This paper proposes a novel kernel k-means clustering method incorporated with a word embedding model to create a solution that effectively extracts topics from bibliometric data. The experimental results of a comparison of this method with four clustering baselines (i.e., k-means, fuzzy c-means, principal component analysis, and topic models) on two bibliometric datasets demonstrate its effectiveness across either a relatively broad range of disciplines or a given domain. An empirical study on bibliometric topic extraction from articles published by three top-tier bibliometric journals between 2000 and 2017, supported by expert knowledge-based evaluations, provides supplemental evidence of the method's ability on topic extraction. Additionally, this empirical analysis reveals insights into both overlapping and diverse research interests among the three journals that would benefit journal publishers, editorial boards, and research communities

    Clustering of scientific fields by integrating text mining and bibliometrics.

    Get PDF
    De toenemende verspreiding van wetenschappelijke en technologische publicaties via het internet, en de beschikbaarheid ervan in grootschalige bibliografische databanken, leiden tot enorme mogelijkheden om de wetenschap en technologie in kaart te brengen. Ook de voortdurende toename van beschikbare rekenkracht en de ontwikkeling van nieuwe algoritmen dragen hiertoe bij. Belangrijke uitdagingen blijven echter bestaan. Dit proefschrift bevestigt de hypothese dat de nauwkeurigheid van zowel het clusteren van wetenschappelijke kennisgebieden als het classificeren van publicaties nog verbeterd kunnen worden door het integreren van tekstontginning en bibliometrie. Zowel de tekstuele als de bibliometrische benadering hebben voor- en nadelen, en allebei bieden ze een andere kijk op een corpus van wetenschappelijke publicaties of patenten. Enerzijds is er een schat aan tekstinformatie aanwezig in dergelijke documenten, anderzijds vormen de onderlinge citaties grote netwerken die extra informatie leveren. We integreren beide gezichtspunten en tonen hoe bestaande tekstuele en bibliometrische methoden kunnen verbeterd worden. De dissertatie is opgebouwd uit drie delen: Ten eerste bespreken we het gebruik van tekstontginningstechnieken voor informatievergaring en voor het in kaart brengen van kennis vervat in teksten. We introduceren en demonstreren het raamwerk voor tekstontginning, evenals het gebruik van agglomeratieve hiërarchische clustering. Voorts onderzoeken we de relatie tussen enerzijds de performantie van het clusteren en anderzijds het gewenste aantal clusters en het aantal factoren bij latent semantische indexering. Daarnaast beschrijven we een samengestelde, semi-automatische strategie om het aantal clusters in een verzameling documenten te bepalen. Ten tweede behandelen we netwerken die bestaan uit citaties tussen wetenschappelijke documenten en netwerken die ontstaan uit onderlinge samenwerkingsverbanden tussen auteurs. Dergelijke netwerken kunnen geanalyseerd worden met technieken van de bibliometrie en de grafentheorie, met als doel het rangschikken van relevante entiteiten, het clusteren en het ontdekken van gemeenschappen. Ten derde tonen we de complementariteit aan van tekstontginning en bibliometrie en stellen we mogelijkheden voor om beide werelden op correcte wijze te integreren. De performantie van ongesuperviseerd clusteren en van classificeren verbetert significant door het samenvoegen van de tekstuele inhoud van wetenschappelijke publicaties en de structuur van citatienetwerken. Een methode gebaseerd op statistische meta-analyse behaalt de beste resultaten en overtreft methoden die enkel gebaseerd zijn op tekst of citaties. Onze geïntegreerde of hybride strategieën voor informatievergaring en clustering worden gedemonstreerd in twee domeinstudies. Het doel van de eerste studie is het ontrafelen en visualiseren van de conceptstructuur van de informatiewetenschappen en het toetsen van de toegevoegde waarde van de hybride methode. De tweede studie omvat de cognitieve structuur, bibliometrische eigenschappen en de dynamica van bio-informatica. We ontwikkelen een methode voor dynamisch en geïntegreerd clusteren van evoluerende bibliografische corpora. Deze methode vergelijkt en volgt clusters doorheen de tijd. Samengevat kunnen we stellen dat we voor de complementaire tekst- en netwerkwerelden een hybride clustermethode ontwerpen die tegelijkertijd rekening houdt met beide paradigma's. We tonen eveneens aan dat de geïntegreerde zienswijze een beter begrip oplevert van de structuur en de evolutie van wetenschappelijke kennisgebieden.SISTA;

    A critical cluster analysis of 44 indicators of author-level performance

    Full text link
    This paper explores the relationship between author-level bibliometric indicators and the researchers the "measure", exemplified across five academic seniorities and four disciplines. Using cluster methodology, the disciplinary and seniority appropriateness of author-level indicators is examined. Publication and citation data for 741 researchers across Astronomy, Environmental Science, Philosophy and Public Health was collected in Web of Science (WoS). Forty-four indicators of individual performance were computed using the data. A two-step cluster analysis using IBM SPSS version 22 was performed, followed by a risk analysis and ordinal logistic regression to explore cluster membership. Indicator scores were contextualized using the individual researcher's curriculum vitae. Four different clusters based on indicator scores ranked researchers as low, middle, high and extremely high performers. The results show that different indicators were appropriate in demarcating ranked performance in different disciplines. In Astronomy the h2 indicator, sum pp top prop in Environmental Science, Q2 in Philosophy and e-index in Public Health. The regression and odds analysis showed individual level indicator scores were primarily dependent on the number of years since the researcher's first publication registered in WoS, number of publications and number of citations. Seniority classification was secondary therefore no seniority appropriate indicators were confidently identified. Cluster methodology proved useful in identifying disciplinary appropriate indicators providing the preliminary data preparation was thorough but needed to be supplemented by other analyses to validate the results. A general disconnection between the performance of the researcher on their curriculum vitae and the performance of the researcher based on bibliometric indicators was observed.Comment: 28 pages, 7 tables, 2 figures, 2 appendice

    Characterizing the potential of being emerging generic technologies: A Bi-Layer Network Analytics-based Prediction Method

    Full text link
    © 2019 17th International Conference on Scientometrics and Informetrics, ISSI 2019 - Proceedings. All rights reserved. Despite tremendous involvement of bibliometrics in profiling technological landscapes and identifying emerging topics, how to predict potential technological change is still unclear. This paper proposes a bi-layer network analytics-based prediction method to characterize the potential of being emerging generic technologies. Initially, based on the innovation literature, three technological characteristics are defined, and quantified by topological indicators in network analytics; a link prediction approach is applied for reconstructing the network with weighted missing links, and such reconstruction will also result in the change of related technological characteristics; the comparison between the two ranking lists of terms can help identify potential emerging generic technologies. A case study on predicting emerging generic technologies in information science demonstrates the feasibility and reliability of the proposed method
    corecore