15 research outputs found

    Vec2Dynamics: A Temporal Word Embedding Approach to Exploring the Dynamics of Scientific Keywords—Machine Learning as a Case Study

    Get PDF
    The study of the dynamics or the progress of science has been widely explored with descriptive and statistical analyses. Also this study has attracted several computational approaches that are labelled together as the Computational History of Science, especially with the rise of data science and the development of increasingly powerful computers. Among these approaches, some works have studied dynamism in scientific literature by employing text analysis techniques that rely on topic models to study the dynamics of research topics. Unlike topic models that do not delve deeper into the content of scientific publications, for the first time, this paper uses temporal word embeddings to automatically track the dynamics of scientific keywords over time. To this end, we propose Vec2Dynamics, a neural-based computational history approach that reports stability of k-nearest neighbors of scientific keywords over time; the stability indicates whether the keywords are taking new neighborhood due to evolution of scientific literature. To evaluate how Vec2Dynamics models such relationships in the domain of Machine Learning (ML), we constructed scientific corpora from the papers published in the Neural Information Processing Systems (NIPS; actually abbreviated NeurIPS) conference between 1987 and 2016. The descriptive analysis that we performed in this paper verify the efficacy of our proposed approach. In fact, we found a generally strong consistency between the obtained results and the Machine Learning timeline

    Leveraging Temporal Word Embeddings for the Detection of Scientific Trends

    Get PDF
    Tracking the dynamics of science and early detection of the emerging research trends could potentially revolutionise the way research is done. For this reason, computational history of science and trend analysis have become an important area in academia and industry. This is due to the significant implications for research funding and public policy. The literature presents several emerging approaches to detecting new research trends. Most of these approaches rely mainly on citation counting. While citations have been widely used as indicators of emerging research topics, they pose several limitations. Most importantly, citations can take months or even years to progress and then to reveal trends. Furthermore, they fail to dig into the paper content. To overcome this problem, this thesis leverages a natural language processing method – namely temporal word embeddings – that learns semantic and syntactic relations among words over time. The principle objective of this method is to study the change in pairwise similarities between pairs of scientific keywords over time, which helps to track the dynamics of science and detect the emerging scientific trends. To this end, this thesis proposes a methodological approach to tune the hyper-parameters of word2vec – the word embedding technique used in this thesis – within the scientific text. Then, it provides a suite of novel approaches that aim to perform the computational history of science by detecting the emerging scientific trends and tracking the dynamics of science. The detection of the emerging scientific trends is performed through the two approaches Hist2vec and Leap2Trend.These two approaches are, respectively, devoted to the detection of converging keywords and contextualising keywords. On the other hand, the dynamics of science is performed by Vec2Dynamics that tracks the evolvement of semantic neighborhood of keywords over time. All of the proposed approaches have been applied to the area of machine learning and validated against different gold standards. The obtained results reveal the effectiveness of the proposed approaches to detect trends and track the dynamics of science. More specifically, Hist2vec strongly correlates with citation counts with 100% Spearman’s positive correlation. Additionally, Leap2Trend performs with more than 80% accuracy and 90% precision in detecting emerging trends. Also, Vec2Dynamics shows great potential to trace the history of machine learning literature exactly as the machine learning timeline does. Such significant findings evidence the utility of the proposed approaches for performing the computational history of science

    A Sociology of Gab: A Computational Analysis of a Far-Right Social Network

    Full text link
    This dissertation examines the racial discourse circulated on Gab, a microblogging and social networking platform, by the far right to proliferate hate speech, and how the far-right discourse has evolved on the platform. Gab was created 2016 in response to mainstream social media’s increase of content moderation and deplatforming of extremist users to curtail hate speech and harassment. The platform gained a substantial number of new users after the Charlotteville incident of 2017. In this thesis, I examine the creation of Gab, as an alternative social media platform, as a strategic site of socio-technical innovation, as well as the important part far-right discourse on Gab plays in the asymmetric polarization phenomenon of the media ecosystem. This project asks: How has Gab developed and what discourses about race circulate on Gab? To answer these questions, I draw on a large dataset of digital trace data of the entire Gab platform of approximately 10 million posts on Gab from 2016 to 2019. This constitutes an archive of the far-right activities on an important alternative social media platform. I use computational methodologies including structural topic modeling, word embeddings and qualitative analysis to examine the materials, and form conclusions about the impacts of asymmetric polarization of the far right on social media. I argue that Gab occupies an important position in the social media ecosystem in the context of mainstream social media platforms’ deplatforming post- Charlottesville, and the absence of legislation that regulates hate speech. Gab, as an opportunistic innovation, is emblematic of an alternative social media ecosystem that flourishes online, drawing in users who are rejecting, and rejected by, mainstream social media platforms and searching for platforms with looser content moderation that reflect their absolutist freedom-of-speech stance. This environment is conducive to asymmetric polarization, spreading anti-liberal, anti-mainstream media, anti-Semitic, and anti-immigration discourses. These ideological strings are not new but consistent with earlier articulations of the ideology of white supremacy ideology, just taking place on a newer platform that itself may enable new variations of the same theme

    Detecting Cryptojacking Web Threats: An Approach with Autoencoders and Deep Dense Neural Networks

    Get PDF
    With the growing popularity of cryptocurrencies, which are an important part of day-to-day transactions over the Internet, the interest in being part of the so-called cryptomining service has attracted the attention of investors who wish to quickly earn profits by computing powerful transactional records towards the blockchain network. Since most users cannot afford the cost of specialized or standardized hardware for mining purposes, new techniques have been developed to make the latter easier, minimizing the computational cost required. Developers of large cryptocurrency houses have made available executable binaries and mainly browser-side scripts in order to authoritatively tap into users’ collective resources and effectively complete the calculation of puzzles to complete a proof of work. However, malicious actors have taken advantage of this capability to insert malicious scripts and illegally mine data without the user’s knowledge. This cyber-attack, also known as cryptojacking, is stealthy and difficult to analyze, whereby, solutions based on anti-malware extensions, blocklists, JavaScript disabling, among others, are not sufficient for accurate detection, creating a gap in multi-layer security mechanisms. Although in the state-of-the-art there are alternative solutions, mainly using machine learning techniques, one of the important issues to be solved is still the correct characterization of network and host samples, in the face of the increasing escalation of new tampering or obfuscation techniques. This paper develops a method that performs a fingerprinting technique to detect possible malicious sites, which are then characterized by an autoencoding algorithm that preserves the best information of the infection traces, thus, maximizing the classification power by means of a deep dense neural network

    Word Sense Disambiguation Using Cosine Similarity Collaborates with Word2vec and WordNet

    No full text
    Words have different meanings (i.e., senses) depending on the context. Disambiguating the correct sense is important and a challenging task for natural language processing. An intuitive way is to select the highest similarity between the context and sense definitions provided by a large lexical database of English, WordNet. In this database, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms interlinked through conceptual semantics and lexicon relations. Traditional unsupervised approaches compute similarity by counting overlapping words between the context and sense definitions which must match exactly. Similarity should compute based on how words are related rather than overlapping by representing the context and sense definitions on a vector space model and analyzing distributional semantic relationships among them using latent semantic analysis (LSA). When a corpus of text becomes more massive, LSA consumes much more memory and is not flexible to train a huge corpus of text. A word-embedding approach has an advantage in this issue. Word2vec is a popular word-embedding approach that represents words on a fix-sized vector space model through either the skip-gram or continuous bag-of-words (CBOW) model. Word2vec is also effectively capturing semantic and syntactic word similarities from a huge corpus of text better than LSA. Our method used Word2vec to construct a context sentence vector, and sense definition vectors then give each word sense a score using cosine similarity to compute the similarity between those sentence vectors. The sense definition also expanded with sense relations retrieved from WordNet. If the score is not higher than a specific threshold, the score will be combined with the probability of that sense distribution learned from a large sense-tagged corpus, SEMCOR. The possible answer senses can be obtained from high scores. Our method shows that the result (50.9% or 48.7% without the probability of sense distribution) is higher than the baselines (i.e., original, simplified, adapted and LSA Lesk) and outperforms many unsupervised systems participating in the SENSEVAL-3 English lexical sample task

    Study on open science: The general state of the play in Open Science principles and practices at European life sciences institutes

    Get PDF
    Nowadays, open science is a hot topic on all levels and also is one of the priorities of the European Research Area. Components that are commonly associated with open science are open access, open data, open methodology, open source, open peer review, open science policies and citizen science. Open science may a great potential to connect and influence the practices of researchers, funding institutions and the public. In this paper, we evaluate the level of openness based on public surveys at four European life sciences institute

    The Palgrave Handbook of Digital Russia Studies

    Get PDF
    This open access handbook presents a multidisciplinary and multifaceted perspective on how the ‘digital’ is simultaneously changing Russia and the research methods scholars use to study Russia. It provides a critical update on how Russian society, politics, economy, and culture are reconfigured in the context of ubiquitous connectivity and accounts for the political and societal responses to digitalization. In addition, it answers practical and methodological questions in handling Russian data and a wide array of digital methods. The volume makes a timely intervention in our understanding of the changing field of Russian Studies and is an essential guide for scholars, advanced undergraduate and graduate students studying Russia today

    The Palgrave Handbook of Digital Russia Studies

    Get PDF
    This open access handbook presents a multidisciplinary and multifaceted perspective on how the ‘digital’ is simultaneously changing Russia and the research methods scholars use to study Russia. It provides a critical update on how Russian society, politics, economy, and culture are reconfigured in the context of ubiquitous connectivity and accounts for the political and societal responses to digitalization. In addition, it answers practical and methodological questions in handling Russian data and a wide array of digital methods. The volume makes a timely intervention in our understanding of the changing field of Russian Studies and is an essential guide for scholars, advanced undergraduate and graduate students studying Russia today
    corecore