7 research outputs found
Deep context of citations using machine‑learning models in scholarly full‑text articles
Information retrieval systems for scholarly literature rely heavily not only on text matching but on semantic- and context-based features. Readers nowadays are deeply interested in how important an article is, its purpose and how influential it is in follow-up research work. Numerous techniques to tap the power of machine learning and artificial intelligence have been developed to enhance retrieval of the most influential scientific literature. In this paper, we compare and improve on four existing state-of-the-art techniques designed to identify influential citations. We consider 450 citations from the Association for Computational Linguistics corpus, classified by experts as either important or unimportant, and further extract 64 features based on the methodology of four state-of-the-art techniques. We apply the Extra-Trees classifier to select 29 best features and apply the Random Forest and Support Vector Machine classifiers to all selected techniques. Using the Random Forest classifier, our supervised model improves on the state-of-the-art method by 11.25%, with 89% Precision-Recall area under the curve. Finally, we present our deep-learning model, the Long Short-Term Memory network, that uses all 64 features to distinguish important and unimportant citations with 92.57% accuracy
Beyond S-curves: Recurrent Neural Networks for Technology Forecasting
Because of the considerable heterogeneity and complexity of the technological
landscape, building accurate models to forecast is a challenging endeavor. Due
to their high prevalence in many complex systems, S-curves are a popular
forecasting approach in previous work. However, their forecasting performance
has not been directly compared to other technology forecasting approaches.
Additionally, recent developments in time series forecasting that claim to
improve forecasting accuracy are yet to be applied to technological development
data. This work addresses both research gaps by comparing the forecasting
performance of S-curves to a baseline and by developing an autencoder approach
that employs recent advances in machine learning and time series forecasting.
S-curves forecasts largely exhibit a mean average percentage error (MAPE)
comparable to a simple ARIMA baseline. However, for a minority of emerging
technologies, the MAPE increases by two magnitudes. Our autoencoder approach
improves the MAPE by 13.5% on average over the second-best result. It forecasts
established technologies with the same accuracy as the other approaches.
However, it is especially strong at forecasting emerging technologies with a
mean MAPE 18% lower than the next best result. Our results imply that a simple
ARIMA model is preferable over the S-curve for technology forecasting.
Practitioners looking for more accurate forecasts should opt for the presented
autoencoder approach.Comment: 16 pages, 8 figure
Parsing AUC Result-Figures in Machine Learning Specific Scholarly Documents for Semantically-enriched Summarization
Machine learning specific scholarly full-text documents contain a number of result-figures expressing valuable data, including experimental results, evaluations, and cross-model comparisons. The scholarly search system often overlooks this vital information while indexing important terms using conventional text-based content extraction approaches. In this paper, we propose creating semantically enriched document summaries by extracting meaningful data from the results-figures specific to the evaluation metric of the area under the curve (AUC) and their associated captions from full-text documents. At first, classify the extracted figures and analyze them by parsing the figure text, legends, and data plots – using a convolutional neural network classification model with a pre-trained ResNet-50 on 1.2 million Images from ImageNet. Next, we extract information from the result figures specific to AUC by approximating the region under the function’s graph as a trapezoid and calculating its area, i.e., the trapezoidal rule. Using over 12,000 figures extracted from 1000 scholarly documents, we show that figure specialized summaries contain more enriched terms about figure semantics. Furthermore, we empirically show that the trapezoidal rule can calculate the area under the curve by dividing the curve into multiple intervals. Finally, we measure the quality of specialized summaries using ROUGE, Edit distance, and Jaccard Similarity metrics. Overall, we observed that figure specialized summaries are more comprehensive and semantically enriched. The applications of our research are enormous, including improved document searching, figure searching, and figure focused plagiarism. The data and code used in this paper can be accessed at the following URL: https://github.com/slab-itu/fig-ir/
Natural Language Processing in Information Metric Studies: an analysis of the articles indexed by the Web of Science (2000-2019)
Objetivo: Identificar a estrutura científica internacional das pesquisas que vinculam o uso do Processamento de linguagem natural no campo dos estudos métricos da informação. Método: A pesquisa é baseada em uma perspectiva qualiquantitativa própria dos estudos métricos da informação no domínio da organização do conhecimento. A coleta de dados foi realizada em 02/02/2020 no recurso Webof Science Core Collectioncom a expressão "natural language processing", na categoria artigos e revisão, refinada pelas Categorias da Web of Science Information Science Library Science e limitada à janela temporal dos últimos 20 anos completos (período de 2000 a 2019). A Análise de Redes Sociais é utilizada como método de pesquisa para examinar e visualizar a rede de colaboração científica, de cocitação e de coocorrência de palavras-chave. Resultados: Dos 552 documentos recuperados, após a análise dos resumos, observou-se que 31 estavam inseridos no campo dos estudos métricos. A literatura científica mostra um crescente aumento das publicações nos últimos três anos, com 2018 sendo o ano mais produtivo. Conclusões: Considerando que o conjunto de técnicas de PLN (ex. bag of words, tokenization, word stemming, part-of-speech tagginge SVM) vem permitindo ao pesquisador ir além da análise de citação tradicional, para uma análise mais voltada ao conteúdo e contexto da citação, a literatura científica internacional sobre a aplicação do PLN nos estudos métricos da informação tem se mostrado emergente. A revista Scientometrics configura o meio de disseminação dos trabalhos que alcançaram maior impacto. Finalizando, a análise de cocitação k-core mostra a existência de um importante núcleo teórico, frequentemente citado na comunidade acadêmica internacional.Objective: To identify the international scientific structure of the research on the use of natural languageprocessing in the information metric studies area.Methods: It follows qualitative and quantitative approaches of the information metric studies and the knowledge organization domain. The data was retrieved on 02/02/2020 from the Web of Science Core Collection using the expression "natural language processing", limited to the document types articles and reviews, the category Information Science Library Science, and the timespan of the last 20 complete years (from 2000 to 2019). A Social Networks Analysis was conducted for the visualization of the scientific collaboration, co-citation, and keywords co-occurrence networks. Results: Out of the 552 documents retrieved, 31 papers were identified in the information metric studies area. Bibliometric indicators of production, relationship, and impact were considered in the study and showed an increase of publications in the last three years, being 2018 the most productive year. Conclusions: The international scientific literature on the application of NLP in information metric studies is emerging. Scientometrics was identified as the source that achieved a greatest impact. Finally, the k-core of the co-citation analysis shows the existence of an important theoretical core, often cited in the international academic community. The set of NLP techniques (e.., bag of words, tokenization, word stemming, part-of-speech tagging, and SVM) allows the researcher to go beyond the traditional citation analysis and focus on content and context of the citations.Este trabalho foi realizado com o apoio da Coordenação de Aperfeiçoamento de Pessoal de Nível Superior -Brasil (CAPES) -Código de Financiamento 001, para bolsa de doutorado Proex/Capes, no. Processo 88887.504100/2020-0
Neural Networks based Shunt Hybrid Active Power Filter for Harmonic Elimination
The growing use of nonlinear devices is introducing harmonics in the power system networks that results in distortion of current and voltage signals causing damage to the power distribution system. Therefore, in power systems, the elimination of harmonics is of great concern. This paper presents an efficient techno-economical approach to suppress harmonics and improve the power factor in the power distribution network using neural network algorithms-based Shunt Hybrid Active Power Filter (SHAPF), such as Artificial Neural Network (ANN), Adaptive Neuro-Fuzzy Inference System (ANFIS), and Recurrent Neural Network (RNN). The objective of the proposed algorithms for SHAPF is to reduce Total Harmonic Distortion (THD) within an acceptable range to improve system quality. In our filter design approach, we tested and compared conventional pq0 theory and neural networks to detect the harmonics present in the power system. Moreover, for the regulation of the DC supply to the inverter of the SHAPF, the conventional PI controller and neural networks-based controllers are used and compared. The applicability of the proposed filter is tested for three different nonlinear load cases. The simulation results show that the neural networks-based filter control techniques satisfy all international standards with minimum current THD, neutral wire current elimination, and small DC voltage fluctuations for voltage regulation current. Furthermore, all three neural network architectures are tested and compared based on accuracy and computational complexity, with RNN outperforming the rest
A decade of in-text citation analysis based on natural language processing and machine learning techniques: an overview of empirical studies
In-text citation analysis is one of the most frequently used methods in research evaluation. We are seeing significant growth in citation analysis through bibliometric metadata, primarily due to the availability of citation databases such as the Web of Science, Scopus, Google Scholar, Microsoft Academic, and Dimensions. Due to better access to full-text publication corpora in recent years, information scientists have gone far beyond traditional bibliometrics by tapping into advancements in full-text data processing techniques to measure the impact of scientific publications in contextual terms. This has led to technical developments in citation classifications, citation sentiment analysis, citation summarisation, and citation-based recommendation. This article aims to narratively review the studies on these developments. Its primary focus is on publications that have used natural language processing and machine learning techniques to analyse citations
Recommended from our members
Capturing and Exploiting Citation Knowledge for the Recommendation of Scientific Publications
With the continuous growth of scientific literature, it is becoming increasingly challenging to discover relevant scientific publications from the plethora of available academic digital libraries. Despite the current scale, important efforts have been achieved towards the research and development of academic search engines, reference management tools, review management platforms, scientometrics systems, and recommender systems that help finding a variety of relevant scientific items, such as publications, books, researchers, grants and events, among others.
This thesis focuses on recommender systems for scientific publications. Existing systems do not always provide the most relevant scientific publications to users, despite they are present in the recommendation space. A common limitation is the lack of access to the full content of the publications when designing the recommendation methods. Solutions are largely based on the exploitation of metadata (e.g., titles, abstracts, lists of references, etc.), but rarely with the text of the publications. Another important limitation is the lack of time awareness. Existing works have not addressed the important scenario of recommending the most recent publications to users, due to the challenge of recommending items for which no ratings (i.e., user preferences) have been yet provided. The lack of evaluation benchmarks also limits the evolution and progress of the field.
This thesis investigates the use of fine-grained forms of citation knowledge, extracted from the full textual content of scientific publications, to enhance recommendations: citation proximity, citation context, citation section, citation graph and citation intention. We design and develop new recommendation methods that incorporate such knowledge, individually and in combination.
By conducting offline evaluations, as well as user studies, we show how the use of citation knowledge does help enhancing the performance of existing recommendation methods when addressing two key tasks: (i) recommending scientific publications for a given work, and (ii) recommending recent scientific publications to a user. Two novel evaluation benchmarks have also been generated and made available for the scientific community