5 research outputs found

    Vec2Dynamics: A Temporal Word Embedding Approach to Exploring the Dynamics of Scientific Keywords—Machine Learning as a Case Study

    Get PDF
    The study of the dynamics or the progress of science has been widely explored with descriptive and statistical analyses. Also this study has attracted several computational approaches that are labelled together as the Computational History of Science, especially with the rise of data science and the development of increasingly powerful computers. Among these approaches, some works have studied dynamism in scientific literature by employing text analysis techniques that rely on topic models to study the dynamics of research topics. Unlike topic models that do not delve deeper into the content of scientific publications, for the first time, this paper uses temporal word embeddings to automatically track the dynamics of scientific keywords over time. To this end, we propose Vec2Dynamics, a neural-based computational history approach that reports stability of k-nearest neighbors of scientific keywords over time; the stability indicates whether the keywords are taking new neighborhood due to evolution of scientific literature. To evaluate how Vec2Dynamics models such relationships in the domain of Machine Learning (ML), we constructed scientific corpora from the papers published in the Neural Information Processing Systems (NIPS; actually abbreviated NeurIPS) conference between 1987 and 2016. The descriptive analysis that we performed in this paper verify the efficacy of our proposed approach. In fact, we found a generally strong consistency between the obtained results and the Machine Learning timeline

    Word Sense Disambiguation Using Cosine Similarity Collaborates with Word2vec and WordNet

    No full text
    Words have different meanings (i.e., senses) depending on the context. Disambiguating the correct sense is important and a challenging task for natural language processing. An intuitive way is to select the highest similarity between the context and sense definitions provided by a large lexical database of English, WordNet. In this database, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms interlinked through conceptual semantics and lexicon relations. Traditional unsupervised approaches compute similarity by counting overlapping words between the context and sense definitions which must match exactly. Similarity should compute based on how words are related rather than overlapping by representing the context and sense definitions on a vector space model and analyzing distributional semantic relationships among them using latent semantic analysis (LSA). When a corpus of text becomes more massive, LSA consumes much more memory and is not flexible to train a huge corpus of text. A word-embedding approach has an advantage in this issue. Word2vec is a popular word-embedding approach that represents words on a fix-sized vector space model through either the skip-gram or continuous bag-of-words (CBOW) model. Word2vec is also effectively capturing semantic and syntactic word similarities from a huge corpus of text better than LSA. Our method used Word2vec to construct a context sentence vector, and sense definition vectors then give each word sense a score using cosine similarity to compute the similarity between those sentence vectors. The sense definition also expanded with sense relations retrieved from WordNet. If the score is not higher than a specific threshold, the score will be combined with the probability of that sense distribution learned from a large sense-tagged corpus, SEMCOR. The possible answer senses can be obtained from high scores. Our method shows that the result (50.9% or 48.7% without the probability of sense distribution) is higher than the baselines (i.e., original, simplified, adapted and LSA Lesk) and outperforms many unsupervised systems participating in the SENSEVAL-3 English lexical sample task

    Detecting Cryptojacking Web Threats: An Approach with Autoencoders and Deep Dense Neural Networks

    Get PDF
    With the growing popularity of cryptocurrencies, which are an important part of day-to-day transactions over the Internet, the interest in being part of the so-called cryptomining service has attracted the attention of investors who wish to quickly earn profits by computing powerful transactional records towards the blockchain network. Since most users cannot afford the cost of specialized or standardized hardware for mining purposes, new techniques have been developed to make the latter easier, minimizing the computational cost required. Developers of large cryptocurrency houses have made available executable binaries and mainly browser-side scripts in order to authoritatively tap into users’ collective resources and effectively complete the calculation of puzzles to complete a proof of work. However, malicious actors have taken advantage of this capability to insert malicious scripts and illegally mine data without the user’s knowledge. This cyber-attack, also known as cryptojacking, is stealthy and difficult to analyze, whereby, solutions based on anti-malware extensions, blocklists, JavaScript disabling, among others, are not sufficient for accurate detection, creating a gap in multi-layer security mechanisms. Although in the state-of-the-art there are alternative solutions, mainly using machine learning techniques, one of the important issues to be solved is still the correct characterization of network and host samples, in the face of the increasing escalation of new tampering or obfuscation techniques. This paper develops a method that performs a fingerprinting technique to detect possible malicious sites, which are then characterized by an autoencoding algorithm that preserves the best information of the infection traces, thus, maximizing the classification power by means of a deep dense neural network

    A Sociology of Gab: A Computational Analysis of a Far-Right Social Network

    Full text link
    This dissertation examines the racial discourse circulated on Gab, a microblogging and social networking platform, by the far right to proliferate hate speech, and how the far-right discourse has evolved on the platform. Gab was created 2016 in response to mainstream social media’s increase of content moderation and deplatforming of extremist users to curtail hate speech and harassment. The platform gained a substantial number of new users after the Charlotteville incident of 2017. In this thesis, I examine the creation of Gab, as an alternative social media platform, as a strategic site of socio-technical innovation, as well as the important part far-right discourse on Gab plays in the asymmetric polarization phenomenon of the media ecosystem. This project asks: How has Gab developed and what discourses about race circulate on Gab? To answer these questions, I draw on a large dataset of digital trace data of the entire Gab platform of approximately 10 million posts on Gab from 2016 to 2019. This constitutes an archive of the far-right activities on an important alternative social media platform. I use computational methodologies including structural topic modeling, word embeddings and qualitative analysis to examine the materials, and form conclusions about the impacts of asymmetric polarization of the far right on social media. I argue that Gab occupies an important position in the social media ecosystem in the context of mainstream social media platforms’ deplatforming post- Charlottesville, and the absence of legislation that regulates hate speech. Gab, as an opportunistic innovation, is emblematic of an alternative social media ecosystem that flourishes online, drawing in users who are rejecting, and rejected by, mainstream social media platforms and searching for platforms with looser content moderation that reflect their absolutist freedom-of-speech stance. This environment is conducive to asymmetric polarization, spreading anti-liberal, anti-mainstream media, anti-Semitic, and anti-immigration discourses. These ideological strings are not new but consistent with earlier articulations of the ideology of white supremacy ideology, just taking place on a newer platform that itself may enable new variations of the same theme
    corecore