223 research outputs found

    Methods of Disambiguating and De-anonymizing Authorship in Large Scale Operational Data

    Get PDF
    Operational data from software development, social networks and other domains are often contaminated with incorrect or missing values. Examples include misspelled or changed names, multiple emails belonging to the same person and user profiles that vary in different systems. Such digital traces are extensively used in research and practice to study collaborating communities of various kinds. To achieve a realistic representation of the networks that represent these communities, accurate identities are essential. In this work, we aim to identify, model, and correct identity errors in data from open-source software repositories, which include more than 23M developer IDs and nearly 1B Git commits (developer activity records). Our investigation into the nature and prevalence of identity errors in software activity data reveals that they are different and occur at much higher rates than other domains. Existing techniques relying on string comparisons can only disambiguate Synonyms, but not Homonyms, which are common in software activity traces. Therefore, we introduce measures of behavioral fingerprinting to improve the accuracy of Synonym resolution, and to disambiguate Homonyms. Fingerprints are constructed from the traces of developers’ activities, such as, the style of writing in commit messages, the patterns in files modified and projects participated in by developers, and the patterns related to the timing of the developers’ activity. Furthermore, to address the lack of training data necessary for the supervised learning approaches that are used in disambiguation, we design a specific active learning procedure that minimizes the manual effort necessary to create training data in the domain of developer identity matching. We extensively evaluate the proposed approach, using over 16,000 OpenStack developers in 1200 projects, against commercial and most recent research approaches, and further on recent research on a much larger sample of over 2,000,000 IDs. Results demonstrate that our method is significantly better than both the recent research and commercial methods. We also conduct experiments to demonstrate that such erroneous data have significant impact on developer networks. We hope that the proposed approach will expedite research progress in the domain of software engineering, especially in applications for which graphs of social networks are critical

    Deriving Verb Predicates By Clustering Verbs with Arguments

    Full text link
    Hand-built verb clusters such as the widely used Levin classes (Levin, 1993) have proved useful, but have limited coverage. Verb classes automatically induced from corpus data such as those from VerbKB (Wijaya, 2016), on the other hand, can give clusters with much larger coverage, and can be adapted to specific corpora such as Twitter. We present a method for clustering the outputs of VerbKB: verbs with their multiple argument types, e.g. "marry(person, person)", "feel(person, emotion)." We make use of a novel low-dimensional embedding of verbs and their arguments to produce high quality clusters in which the same verb can be in different clusters depending on its argument type. The resulting verb clusters do a better job than hand-built clusters of predicting sarcasm, sentiment, and locus of control in tweets

    Sentiment analysis in SemEval: a review of sentiment identification approaches

    Get PDF
    ocial media platforms are becoming the foundations of social interactions including messaging and opinion expression. In this regard, sentiment analysis techniques focus on providing solutions to ensure the retrieval and analysis of generated data including sentiments, emotions, and discussed topics. International competitions such as the International Workshop on Semantic Evaluation (SemEval) have attracted many researchers and practitioners with a special research interest in building sentiment analysis systems. In our work, we study top-ranking systems for each SemEval edition during the 2013-2021 period, a total of 658 teams participated in these editions with increasing interest over years. We analyze the proposed systems marking the evolution of research trends with a focus on the main components of sentiment analysis systems including data acquisition, preprocessing, and classification. Our study shows an active use of preprocessing techniques, an evolution of features engineering and word representation from lexicon-based approaches to word embeddings, and the dominance of neural networks and transformers over the classification phasefostering the use of ready-to-use models. Moreover, we provide researchers with insights based on experimented systems which will allow rapid prototyping of new systems and help practitioners build for future SemEval editions

    Normalisasi Teks Media Sosial Menggunakan Word2vec, Levenshtein Distance, dan Jaro-Winkler Distance

    Get PDF
    Sebagian besar pengguna internet di Indonesia menggunakan media sosial untuk mendapatkan pembaruan informasi secara rutin. Namun, frekuensi penggunaan media sosial yang tinggi, tidak sebanding dengan ejaan kalimat tidak baku (informal) yang digunakan dalam mengisi konten media sosial dengan maksud untuk memudahkan komunikasi. Ejaan bahasa yang digunakan tidak hanya mengganggu pengguna media sosial namun juga mempengaruhi pengolahan terhadap data konten media sosial tersebut yang biasa disebut Natural Language Processing. Penelitian sebelumnya mencoba mengajukan konsep word2vec yang terbukti mampu menemukan cara untuk melakukan representasi vektor dari sebuah kata dengan waktu yang relative cepat dan dengan dataset yang cukup besar dan juga terdapat solusi berupa pembenaran/normalisasi teks menggunakan algoritma edit distance/levenshtein dan jaro-winkler distance. Hasil yang dari training model word2vec yang didapatkan adalah model ke-8 dengan hasil akurasi 25%. Selain itu parameter yang sangat menentukan proses training ialah learning algorithm. Untuk pengujian sampel perngoreksian data, akurasi paling baik adalah 79,56% dengan threshold sebesar 70%. ======================================================================== Most internet users in Indonesia use social media to obtain information regularly. However, high frequency of social media usage, is not comparable to non-standard (informal) sentences spelling that used in the social media content to facilitate communication nowadays. Those spelling language not only disrupts social media users but also updating process of social media content data that commonly called Natural Language Processing. Previous research tried to propose word2vec concept which is proved able to find a way to perform vector representation of a word fairly fast with a large enough data sets. And there is also a solution like justification / normalization of the text using long distance editing algorithm/levenshtein and jaro-winkler distance editing algorithms. The result of word2vec training model is the 8th model with 25% accuracy. In addition, the parameters that determine the training process is learning algorithm. For testing of data correction samples, the best accuracy is 79.56% with a threshold of 70%

    A Comprehensive Survey on Word Representation Models: From Classical to State-Of-The-Art Word Representation Language Models

    Full text link
    Word representation has always been an important research area in the history of natural language processing (NLP). Understanding such complex text data is imperative, given that it is rich in information and can be used widely across various applications. In this survey, we explore different word representation models and its power of expression, from the classical to modern-day state-of-the-art word representation language models (LMS). We describe a variety of text representation methods, and model designs have blossomed in the context of NLP, including SOTA LMs. These models can transform large volumes of text into effective vector representations capturing the same semantic information. Further, such representations can be utilized by various machine learning (ML) algorithms for a variety of NLP related tasks. In the end, this survey briefly discusses the commonly used ML and DL based classifiers, evaluation metrics and the applications of these word embeddings in different NLP tasks

    The Effect of Preprocessing on Short Document Clustering

    Get PDF
    Natural Language Processing has become a common tool to extract relevant information from unstructured data. Messages in social media, customer reviews, and military messages are all very short and therefore harder to handle than longer texts. Document clustering is essential in gaining insight from these unlabeled texts and is typically performed after some preprocessing steps. Preprocessing often removes words. This can become risky in short texts, where the main message is made of only a few words. The effect of preprocessing and feature extraction on these short documents is therefore analyzed in this paper. Six different levels of text normalization are combined with four different feature extraction methods. These setting are all applied on K-means clustering and tested on three different datasets. Anticipated results can not be concluded, however other findings are insightful in terms of the connection between text cleaning and feature extraction

    Real-time road traffic events detection and geo-parsing

    Get PDF
    Indiana University-Purdue University Indianapolis (IUPUI)In the 21st century, there is an increasing number of vehicles on the road as well as a limited road infrastructure. These aspects culminate in daily challenges for the average commuter due to congestion and slow moving traffic. In the United States alone, it costs an average US driver $1200 every year in the form of fuel and time. Some positive steps, including (a) introduction of the push notification system and (b) deploying more law enforcement troops, have been taken for better traffic management. However, these methods have limitations and require extensive planning. Another method to deal with traffic problems is to track the congested area in a city using social media. Next, law enforcement resources can be re-routed to these areas on a real-time basis. Given the ever-increasing number of smartphone devices, social media can be used as a source of information to track the traffic-related incidents. Social media sites allow users to share their opinions and information. Platforms like Twitter, Facebook, and Instagram are very popular among users. These platforms enable users to share whatever they want in the form of text and images. Facebook users generate millions of posts in a minute. On these platforms, abundant data, including news, trends, events, opinions, product reviews, etc. are generated on a daily basis. Worldwide, organizations are using social media for marketing purposes. This data can also be used to analyze the traffic-related events like congestion, construction work, slow-moving traffic etc. Thus the motivation behind this research is to use social media posts to extract information relevant to traffic, with effective and proactive traffic administration as the primary focus. I propose an intuitive two-step process to utilize Twitter users' posts to obtain for retrieving traffic-related information on a real-time basis. It uses a text classifier to filter out the data that contains only traffic information. This is followed by a Part-Of-Speech (POS) tagger to find the geolocation information. A prototype of the proposed system is implemented using distributed microservices architecture
    • …
    corecore