27 research outputs found

    k-NN Embedding Stability for word2vec Hyper-Parametrisation in Scientific Text

    Get PDF
    Word embeddings are increasingly attracting the attention of researchers dealing with semantic similarity and analogy tasks. However, finding the optimal hyper-parameters remains an important challenge due to the resulting impact on the revealed analogies mainly for domain-specific corpora. While analogies are highly used for hypotheses synthesis, it is crucial to optimise word embedding hyper-parameters for precise hypothesis synthesis. Therefore, we propose, in this paper, a methodological approach for tuning word embedding hyper-parameters by using the stability of k-nearest neighbors of word vectors within scientific corpora and more specifically Computer Science corpora with Machine learning adopted as a case study. This approach is tested on a dataset created from NIPS (Conference on Neural Information Processing Systems) publications, and evaluated with a curated ACM hierarchy and Wikipedia Machine Learning outline as the gold standard. Our quantitative and qualitative analysis indicate that our approach not only reliably captures interesting patterns like ``unsupervised_learning is to kmeans as supervised_learning is to knn'', but also captures the analogical hierarchy structure of Machine Learning and consistently outperforms the 61\61\backslash%61%sate-of-the-art embeddings on syntactic accuracy with 68\68\backslash%68%

    Leveraging Temporal Word Embeddings for the Detection of Scientific Trends

    Get PDF
    Tracking the dynamics of science and early detection of the emerging research trends could potentially revolutionise the way research is done. For this reason, computational history of science and trend analysis have become an important area in academia and industry. This is due to the significant implications for research funding and public policy. The literature presents several emerging approaches to detecting new research trends. Most of these approaches rely mainly on citation counting. While citations have been widely used as indicators of emerging research topics, they pose several limitations. Most importantly, citations can take months or even years to progress and then to reveal trends. Furthermore, they fail to dig into the paper content. To overcome this problem, this thesis leverages a natural language processing method – namely temporal word embeddings – that learns semantic and syntactic relations among words over time. The principle objective of this method is to study the change in pairwise similarities between pairs of scientific keywords over time, which helps to track the dynamics of science and detect the emerging scientific trends. To this end, this thesis proposes a methodological approach to tune the hyper-parameters of word2vec – the word embedding technique used in this thesis – within the scientific text. Then, it provides a suite of novel approaches that aim to perform the computational history of science by detecting the emerging scientific trends and tracking the dynamics of science. The detection of the emerging scientific trends is performed through the two approaches Hist2vec and Leap2Trend.These two approaches are, respectively, devoted to the detection of converging keywords and contextualising keywords. On the other hand, the dynamics of science is performed by Vec2Dynamics that tracks the evolvement of semantic neighborhood of keywords over time. All of the proposed approaches have been applied to the area of machine learning and validated against different gold standards. The obtained results reveal the effectiveness of the proposed approaches to detect trends and track the dynamics of science. More specifically, Hist2vec strongly correlates with citation counts with 100% Spearman’s positive correlation. Additionally, Leap2Trend performs with more than 80% accuracy and 90% precision in detecting emerging trends. Also, Vec2Dynamics shows great potential to trace the history of machine learning literature exactly as the machine learning timeline does. Such significant findings evidence the utility of the proposed approaches for performing the computational history of science

    DeepHist: Towards a Deep Learning-based Computational History of Trends in the NIPS

    Get PDF
    Research in analysis of big scholarly data has increased in the recent past and it aims to understand research dynamics and forecast research trends. The ultimate objective in this research is to design and implement novel and scalable methods for extracting knowledge and computational history. While citations are highly used to identify emerging/rising research topics, they can take months or even years to stabilise enough to reveal research trends. Consequently, it is necessary to develop faster yet accurate methods for trend analysis and computational history that dig into content and semantics of an article. Therefore, this paper aims to conduct a fine-grained content analysis of scientific corpora from the domain of {\it Machine Learning}. This analysis uses {DeepHist, a deep learning-based computational history approach; the approach relies on a dynamic word embedding that aims to represent words with low-dimensional vectors computed by deep neural networks. The scientific corpora come from 5991 publications from Neural Information Processing Systems (NIPS) conference between 1987 and 2015 which are divided into six 55-year timespans. The analysis of these corpora generates visualisations produced by applying t-distributed stochastic neighbor embedding (t-SNE) for dimensionality reduction. The qualitative and quantitative study reported here reveals the evolution of the prominent Machine Learning keywords; this evolution supports the popularity of current research topics in the field. This support is evident given how well the popularity of the detected keywords correlates with the citation counts received by their corresponding papers: Spearman's positive correlation is 100%. With such a strong result, this work evidences the utility of deep learning techniques for determining the computational history of science

    Vec2Dynamics: A Temporal Word Embedding Approach to Exploring the Dynamics of Scientific Keywords—Machine Learning as a Case Study

    Get PDF
    The study of the dynamics or the progress of science has been widely explored with descriptive and statistical analyses. Also this study has attracted several computational approaches that are labelled together as the Computational History of Science, especially with the rise of data science and the development of increasingly powerful computers. Among these approaches, some works have studied dynamism in scientific literature by employing text analysis techniques that rely on topic models to study the dynamics of research topics. Unlike topic models that do not delve deeper into the content of scientific publications, for the first time, this paper uses temporal word embeddings to automatically track the dynamics of scientific keywords over time. To this end, we propose Vec2Dynamics, a neural-based computational history approach that reports stability of k-nearest neighbors of scientific keywords over time; the stability indicates whether the keywords are taking new neighborhood due to evolution of scientific literature. To evaluate how Vec2Dynamics models such relationships in the domain of Machine Learning (ML), we constructed scientific corpora from the papers published in the Neural Information Processing Systems (NIPS; actually abbreviated NeurIPS) conference between 1987 and 2016. The descriptive analysis that we performed in this paper verify the efficacy of our proposed approach. In fact, we found a generally strong consistency between the obtained results and the Machine Learning timeline

    An Empirical Study on the Fairness of Pre-trained Word Embeddings

    Get PDF
    Pre-trained word embedding models are easily distributed and applied, as they alleviate users from the effort to train models themselves. With widely distributed models, it is important to ensure that they do not exhibit undesired behaviour, such as biases against population groups. For this purpose, we carry out an empirical study on evaluating the bias of 15 publicly available, pre-trained word embeddings model based on three training algorithms (GloVe, word2vec, and fastText) with regard to four bias metrics (WEAT, SEMBIAS, DIRECT BIAS, and ECT). The choice of word embedding models and bias metrics is motivated by a literature survey over 37 publications which quantified bias on pre-trained word embeddings. Our results indicate that fastText is the least biased model (in 8 out of 12 cases) and small vector lengths lead to a higher bias

    Leap2Trend: A Temporal Word Embedding Approach for Instant Detection of Emerging Scientific Trends

    Get PDF
    Early detection of emerging research trends could potentially revolutionise the way research is done. For this reason, trend analysis has become an area of paramount importance in academia and industry. This is due to the significant implications for research funding and public policy. The literature presents several emerging approaches to detecting new research trends. Most of these approaches rely mainly on citation counting. While citations have been widely used as indicators of emerging research topics, they suffer from some limitations. For instance, citations can take months to years to progress and then to reveal trends. Furthermore, they fail to dig into paper content. To overcome this problem, we introduce Leap2Trend, a novel approach to instant detection of research trends. Leap2Trend relies on temporal word embeddings ( word2vec) to track the dynamics of similarities between pairs of keywords, their rankings and respective uprankings (ascents) over time. We applied Leap2Trend to two scientific corpora on different research areas, namely computer science and bioinformatics and we evaluated it against two gold standards Google Trends hits and Google Scholar citations. The obtained results reveal the effectiveness of our approach to detect trends with more than 80% accuracy and 90% precision in some cases. Such significant findings evidence the utility of our Leap2Trend approach for tracking and detecting emerging research trends instantly

    Managing credit risk and the cost of equity with machine learning techniques

    Get PDF
    Credit risks and the cost of equity can influence market participants' activities in many ways. Providing in-depth analysis can help participants reduce potential costs and make profitable strategies. This kind of study is usually armed with conventional statistical models built with researchers' knowledge. However, with the advancement of technology, a massive amount of financial data increasing in volume, subjectivity, and heterogeneity becomes challenging to process conventionally. Machine learning (ML) techniques have been utilised to handle this difficulty in real-life applications. This PhD thesis consists of three major empirical essays. We employ state-of-art machine learning techniques to predict peer-to-peer (P2P) lending default risk, P2P lending decisions, and Environmental, Social, Corporate Governance (ESG) effects on firms' cost of equity. In the era of financial technology, P2P lending has gained considerable attention among academics and market participants. In the first essay (Chapter 2), we investigate the determinants of P2P lending default prediction in relation to borrowers' characteristics and credit history. Applying machine learning techniques, we document substantial predictive ability compared with the benchmark logit model. Further, we find that the LightGBM has superior predictive power and outperforms all other models in all out-of-sample predictions. Finally, we offer insights into different levels of uncertainty in P2P loan groups and the value of machine learning in credit risk mitigation of P2P loan providers. Macroeconomic impact on funding decisions or lending standards reflects the risk-taking behaviour of market participants. It has been widely discussed by academics. But in the era of financial technology, it leaves a gap in the evidence of lending standards change in a FinTech nonbank financial organisation. The second essay (Chapter 3) aims to fill the gap by introducing loan-level and macroeconomic variables into the predictive models to estimate the P2P loan funding decision. Over 12 million empirical instances are under study while big data techniques, including text mining and five state-of-the-art approaches, are utilised. We note that macroeconomic condition affects individual risk-taking and reaching-for-yield behaviour. Finally, we offer insight into macroeconomic impact in terms of different levels of uncertainty in different P2P loan application groups. In the third essay (Chapter 4), we use up-to-date machine learning techniques to provide new evidence for the impact of ESG on the cost of equity. Using 15,229 firm-year observations from 51 different countries over the past 18 years, we document negative causal effects on the cost of equity. In addition, we uncover non-linear effects because the level of ESG effects on the equity cost decrease with the enhancements of ESG performance. Furthermore, we note the heterogeneity in ESG effects in different regions by breaking down our sample. Finally, we find that global crises change the sensitivity of the equity cost towards ESG, and the change varies in areas

    Gaining Insight into Determinants of Physical Activity using Bayesian Network Learning

    Get PDF
    Contains fulltext : 228326pre.pdf (preprint version ) (Open Access) Contains fulltext : 228326pub.pdf (publisher's version ) (Open Access)BNAIC/BeneLearn 202

    On the role of Computational Logic in Data Science: representing, learning, reasoning, and explaining knowledge

    Get PDF
    In this thesis we discuss in what ways computational logic (CL) and data science (DS) can jointly contribute to the management of knowledge within the scope of modern and future artificial intelligence (AI), and how technically-sound software technologies can be realised along the path. An agent-oriented mindset permeates the whole discussion, by stressing pivotal role of autonomous agents in exploiting both means to reach higher degrees of intelligence. Accordingly, the goals of this thesis are manifold. First, we elicit the analogies and differences among CL and DS, hence looking for possible synergies and complementarities along 4 major knowledge-related dimensions, namely representation, acquisition (a.k.a. learning), inference (a.k.a. reasoning), and explanation. In this regard, we propose a conceptual framework through which bridges these disciplines can be described and designed. We then survey the current state of the art of AI technologies, w.r.t. their capability to support bridging CL and DS in practice. After detecting lacks and opportunities, we propose the notion of logic ecosystem as the new conceptual, architectural, and technological solution supporting the incremental integration of symbolic and sub-symbolic AI. Finally, we discuss how our notion of logic ecosys- tem can be reified into actual software technology and extended towards many DS-related directions

    Uncertainty in Artificial Intelligence: Proceedings of the Thirty-Fourth Conference

    Get PDF
    corecore