24 research outputs found

    Methods of Disambiguating and De-anonymizing Authorship in Large Scale Operational Data

    Get PDF
    Operational data from software development, social networks and other domains are often contaminated with incorrect or missing values. Examples include misspelled or changed names, multiple emails belonging to the same person and user profiles that vary in different systems. Such digital traces are extensively used in research and practice to study collaborating communities of various kinds. To achieve a realistic representation of the networks that represent these communities, accurate identities are essential. In this work, we aim to identify, model, and correct identity errors in data from open-source software repositories, which include more than 23M developer IDs and nearly 1B Git commits (developer activity records). Our investigation into the nature and prevalence of identity errors in software activity data reveals that they are different and occur at much higher rates than other domains. Existing techniques relying on string comparisons can only disambiguate Synonyms, but not Homonyms, which are common in software activity traces. Therefore, we introduce measures of behavioral fingerprinting to improve the accuracy of Synonym resolution, and to disambiguate Homonyms. Fingerprints are constructed from the traces of developers’ activities, such as, the style of writing in commit messages, the patterns in files modified and projects participated in by developers, and the patterns related to the timing of the developers’ activity. Furthermore, to address the lack of training data necessary for the supervised learning approaches that are used in disambiguation, we design a specific active learning procedure that minimizes the manual effort necessary to create training data in the domain of developer identity matching. We extensively evaluate the proposed approach, using over 16,000 OpenStack developers in 1200 projects, against commercial and most recent research approaches, and further on recent research on a much larger sample of over 2,000,000 IDs. Results demonstrate that our method is significantly better than both the recent research and commercial methods. We also conduct experiments to demonstrate that such erroneous data have significant impact on developer networks. We hope that the proposed approach will expedite research progress in the domain of software engineering, especially in applications for which graphs of social networks are critical

    Reading Polish with Czech Eyes: Distance and Surprisal in Quantitative, Qualitative, and Error Analyses of Intelligibility

    Get PDF
    In CHAPTER I, I first introduce the thesis in the context of the project workflow in section 1. I then summarise the methods and findings from the project publications about the languages in focus. There I also introduce the relevant concepts and terminology viewed in the literature as possible predictors of intercomprehension and processing difficulty. CHAPTER II presents a quantitative (section 4) and a qualitative (section 5) analysis of the results of the cooperative translation experiments. The focus of this thesis – the language pair PL-CS – is explained and the hypotheses are introduced in section 6. The experiment website is introduced in section 7 with an overview over participants, the different experiments conducted and in which section they are discussed. In CHAPTER IV, free translation experiments are discussed in which two different sets of individual word stimuli were presented to Czech readers: (i) Cognates that are transformable with regular PL-CS correspondences (section 12) and (ii) the 100 most frequent PL nouns (section 13). CHAPTER V presents the findings of experiments in which PL NPs in two different linearisation conditions were presented to Czech readers (section 14.1-14.6). A short digression is made when I turn to experiments with PL internationalisms which were presented to German readers (14.7). CHAPTER VI discusses the methods and results of cloze translation experiments with highly predictable target words in sentential context (section 15) and random context with sentences from the cooperative translation experiments (section 16). A final synthesis of the findings, together with an outlook, is provided in CHAPTER VII.In KAPITEL I stelle ich zunächst die These im Kontext des Projektablaufs in Abschnitt 1 vor. Anschließend fasse ich die Methoden und Erkenntnisse aus den Projektpublikationen zu den untersuchten Sprachen zusammen. Dort stelle ich auch die relevanten Konzepte und die Terminologie vor, die in der Literatur als mögliche Prädiktoren für Interkomprehension und Verarbeitungsschwierigkeiten angesehen werden. KAPITEL II enthält eine quantitative (Abschnitt 4) und eine qualitative (Abschnitt 5) Analyse der Ergebnisse der kooperativen Übersetzungsexperimente. Der Fokus dieser Arbeit - das Sprachenpaar PL-CS - wird erläutert und die Hypothesen werden in Abschnitt 6 vorgestellt. Die Experiment-Website wird in Abschnitt 7 mit einer Übersicht über die Teilnehmer, die verschiedenen durchgeführten Experimente und die Abschnitte, in denen sie besprochen werden, vorgestellt. In KAPITEL IV werden Experimente zur freien Übersetzung besprochen, bei denen tschechischen Lesern zwei verschiedene Sätze einzelner Wortstimuli präsentiert wurden: (i) Kognaten, die mit regulären PL-CS-Korrespondenzen umgewandelt werden können (Abschnitt 12) und (ii) die 100 häufigsten PL-Substantive (Abschnitt 13). KAPITEL V stellt die Ergebnisse von Experimenten vor, in denen tschechischen Lesern PL-NP in zwei verschiedenen Linearisierungszuständen präsentiert wurden (Abschnitt 14.1-14.6). Einen kurzen Exkurs mache ich, wenn ich mich den Experimenten mit PL-Internationalismen zuwende, die deutschen Lesern präsentiert wurden (14.7). KAPITEL VI erörtert die Methoden und Ergebnisse von Lückentexten mit hochgradig vorhersehbaren Zielwörtern im Satzkontext (Abschnitt 15) und Zufallskontext mit Sätzen aus den kooperativen Übersetzungsexperimenten (Abschnitt 16). Eine abschließende Synthese der Ergebnisse und ein Ausblick finden sich in KAPITEL VII

    Textualism and Obstacle Preemption

    Full text link
    Commentators, both on the bench and in the academy,have perceived an inconsistency between the SupremeCourt\u27s trend, in recent decades, towards an increasinglyformalist approach to statutory interpretation and theCourt\u27s continued willingness to find state laws preemptedas obstacles to the accomplishment and execution of thefull purposes and objectives of Congress\u27 --so-called obstacle preemption. This Article argues that by givingthe meaning contextually implied in a statutory textordinary, operative legal force, we can justify most of thecurrent scope of obstacle preemption based solely ontheoretical moves textualism already is committed tomaking.The Article first sketches the history of both textualismand obstacle preemption, showing why the two doctrinesseem so obviously to be in tension with one another. It then introduces the field of linguistic pragmatics-thestudy of context\u27s role in determining meaning-payingspecial attention to the theory of scalar implicature, aframework that attempts to systematize our intuitions thatwe often say one thing but imply another. The Article thenproceeds to apply this theory to the obstacle-preemptioncase law, contending that scalar implicature, propertyadjusted to the legal context, can justify the result in mostobstacle preemption cases. Next, the Article argues thattextualists are committed to accepting this justification ofobstacle preemption because of two deep theoreticalpresuppositions of their theory. Finally, the Article closesby suggesting that this justification of obstacle preemptionnot only challenges widely shared assumptions about theinconsistency of textualism and one of the most commontypes of preemption; it also has the potential to reshapeour understanding of both textualism and obstaclepreemption

    Sustainability Conversations for Impact: Transdisciplinarity on Four Scales

    Get PDF
    Sustainability is a dynamic, multi-scale endeavor. Coherence can be lost between scales – from project teams, to organizations, to networks, and, most importantly, down to conversations. Sustainability researchers have embraced transdisciplinarity, as it is grounded in science, shared language, broad participation, and respect for difference. Yet, transdisciplinarity at these four scales is not well-defined. In this dissertation I extend transdisciplinarity out from the project to networks and organizations, and down into conversation, adding novel lenses and quantitative approaches. In Chapter 2, I propose transdisciplinarity incorporate academic disciplines which help cross scales: Organizational Learning, Knowledge Management, Applied Cooperation, and Data Science. In Chapter 3 I then use a mixed-method approach to study a transdisciplinary organization, the Maine Aquaculture Hub, as it develops strategy. Using social network analysis and conversation analytics, I evaluate how the Hub’s network-convening, strategic thinking and conversation practices turn organization-scale transdisciplinarity into strategic advantage. In Chapters 4 and 5, conversation is the nexus of transdisciplinarity. I study seven public aquaculture lease scoping meetings (informal town halls) and classify conversation activity by “discussion discipline,” i.e., rhetorical and social intent. I compute the relationship between discussion discipline proportions and three sustainability outcomes of intent-to-act, options-generation, and relationship-building. I consider exogenous factors, such as signaling, gender balance, timing and location. I show that where inquiry is high, so is innovation. Where acknowledgement is high, so is intent-to-act. Where respect is high, so is relationship-building. Indirectness and sarcasm dampen outcomes. I propose seven interventions to improve sustainability conversation capacity, such as nudging, networks, and using empirical models. Chapter 5 explores those empirical models: I use natural language-processing (NLP) to detect the discussion disciplines by training a model using the previously coded transcripts. Then I use that model to classify 591 open-source conversation transcripts, and regress the sustainability outcomes, per-transcript, on discussion discipline proportions. I show that all three conversation outcomes can be predicted by the discussion disciplines, and most statistically-significant being intent-to-act, which responds directly to acknowledgement and respect. Conversation AI is the next frontier of transdisciplinarity for sustainability solutions

    Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme

    Get PDF
    Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie

    Experiment and bias: the case of parsimony in comparative cognition

    Full text link
    Comparative cognition is the interdisciplinary field of animal cognition and behavior studies, which includes comparative psychology and branches of ethology, biology, and neuroscience. My dissertation shows that the quasi-epistemic value of parsimony plays a problematic role in the experimental setting of comparative cognition. More specifically, I argue that an idiosyncratic interpretation of the statistical hypothesis-testing method, known as the Neyman-Pearson Method (NPM), embeds an Occamist parsimony preference into experimental methodology in comparative cognition, which results in an underattribution bias, or a bias in favor of allegedly simple cognitive ontologies. I trace this parsimony preference to the content of the null hypothesis within the NPM, and defend a strategy for modifying the NPM to guard against the underattribution bias. I recommend adopting an evidence-driven strategy for choosing the null hypothesis. Further, I suggest a role for non-empirical values, such as ethical concerns, in the weighting of Type I and Type II error-rates. I contend that statistical models are deeply embedded in experimental practice and are not value-free. These models provide an often overlooked door through which values, both epistemic and non-epistemic, can enter scientific research. Since statistical models generally, and the NPM in particular, play a role in a wide variety of scientific disciplines, this dissertation can also be seen as a case study illustrating the importance of attending to the choice a particular statistical model. This conclusion suggests that various philosophical investigations of scientific practice - from inquiry into the nature of scientific evidence to analysis of the role of values in science - would be greatly enriched by increased attention to experimental methodology, including the choice and interpretation of statistical models

    Data Science and Knowledge Discovery

    Get PDF
    Data Science (DS) is gaining significant importance in the decision process due to a mix of various areas, including Computer Science, Machine Learning, Math and Statistics, domain/business knowledge, software development, and traditional research. In the business field, DS's application allows using scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data to support the decision process. After collecting the data, it is crucial to discover the knowledge. In this step, Knowledge Discovery (KD) tasks are used to create knowledge from structured and unstructured sources (e.g., text, data, and images). The output needs to be in a readable and interpretable format. It must represent knowledge in a manner that facilitates inferencing. KD is applied in several areas, such as education, health, accounting, energy, and public administration. This book includes fourteen excellent articles which discuss this trending topic and present innovative solutions to show the importance of Data Science and Knowledge Discovery to researchers, managers, industry, society, and other communities. The chapters address several topics like Data mining, Deep Learning, Data Visualization and Analytics, Semantic data, Geospatial and Spatio-Temporal Data, Data Augmentation and Text Mining

    Mathematical linguistics

    Get PDF
    but in fact this is still an early draft, version 0.56, August 1 2001. Please d
    corecore