2,644 research outputs found

    Deep Clustering for Data Cleaning and Integration

    Get PDF
    Deep Learning (DL) techniques now constitute the state-of-theart for important problems in areas such as text and image processing, and there have been impactful results that deploy DL in several data management tasks. Deep Clustering (DC) has recently emerged as a sub-discipline of DL, in which data representations are learned in tandem with clustering, with a view to automatically identifying the features of the data that lead to improved clustering results. While DC has been used to good effect in several domains, particularly in image processing, the potential of DC for data management tasks remains unexplored. In this paper, we address this gap by investigating the suitability of DC for data cleaning and integration tasks, specifically schema inference, entity resolution and domain discovery, from the perspective of tables, rows and columns, respectively. In this setting, we compare and contrast several DC and non-DC clustering algorithms using standard benchmarks. The results show, among other things, that the most effective DC algorithms consistently outperform non-DC clustering algorithms for data integration tasks. Experiments also show consistently strong performance compared with state-of-the-art bespoke algorithms for each of the data integration tasks

    Optimizing digital archiving: An artificial intelligence approach for OCR error correction

    Get PDF
    Project Work presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business AnalyticsThis thesis research scopes the knowledge gap for effective ways to address OCR errors and the importance to have training datasets adequated size and quality, to promote digital documents OCR recognition efficiency. The main goal is to examine the effects regarding the following dimensions of sourcing data: input size vs performance vs time efficiency, and to propose a new design that includes a machine translation model, to automate the errors correction caused by OCR scan. The study implemented various LSTM, with different thresholds, to recover errors generated by OCR systems. However, the results did not overcomed the performance of existing OCR systems, due to dataset size limitations, a step further was achieved. A relationship between performance and input size was established, providing meaningful insights for future digital archiving systems optimisation. This dissertation creates a new approach, to deal with OCR problems and implementation considerations, that can be further followed, to optimise digital archive systems efficiency and results

    How Do We Learn What We Cannot Say?

    Full text link
    The contributions of this thesis are two-fold. First, this thesis presents UDTube, an easily usable software developed to perform morphological analysis in a multi-task fashion. This work shows the strong performance of UDTube versus the current state-of-the-art, UDPipe, across eight languages, primarily in the annotation of morphological features. The second contribution of this thesis is a exploration into the study of defectivity. UDTube is used to annotate a large amount of data in Greek and Russian which is ultimately used to investigate the plausibility of Indirect Negative Evidence (INE), a popular approach to the acquisition of morphological defectivity. The reported findings raise a challenge to INE

    Essays on Corporate Disclosure of Value Creation

    Get PDF
    Information on a firm’s business model helps investors understand an entity’s resource requirements, priorities for action, and prospects (FASB, 2001, pp. 14-15; IASB, 2010, p. 12). Disclosures of strategy and business model (SBM) are therefore considered a central element of effective annual report commentary (Guillaume, 2018; IIRC, 2011). By applying natural language processing techniques, I explore what SBM disclosures look like when management are pressed to say something, analyse determinants of cross-sectional variation in SBM reporting properties, and assess whether and how managers respond to regulatory interventions seeking to promote SBM annual report commentary. This dissertation contains three main chapters. Chapter 2 presents a systematic review of the academic literature on non-financial reporting and the emerging literature on SBM reporting. Here, I also introduce my institutional setting. Chapter 3 and Chapter 4 form the empirical sections of this thesis. In Chapter 3, I construct the first large sample corpus of SBM annual report commentary and provide the first systematic analysis of the properties of such disclosures. My topic modelling analysis rejects the hypothesis that such disclosure is merely padding; instead finding themes align with popular strategy frameworks and management tailor the mix of SBM topics to reflect their unique approach to value creation. However, SBM commentary is less specific, less precise about time horizon (short- and long-term), and less balanced (more positive) in tone relative to general management commentary. My findings suggest symbolic compliance and legitimisation characterize the typical annual report discussion of SBM. Further analysis identifies proprietary cost considerations and obfuscation incentives as key determinants of symbolic reporting. In Chapter 4, I seek evidence on how managers respond to regulatory mandates by adapting the properties of disclosure and investigate whether the form of the mandate matters. Using a differences-in-differences research design, my results suggest a modest incremental response by treatment firms to the introduction of a comply or explain provision to provide disclosure on strategy and business model. In contrast, I find a substantial response to enacting the same requirements in law. My analysis provides clear and consistent evidence that treatment firms incrementally increase the volume of SBM disclosure, improve coverage across a broad range of topics as well as providing commentary with greater focus on the long term. My results point to substantial changes in SBM reporting properties following regulatory mandates, but the form of the mandate does matter. Overall, this dissertation contributes to the accounting literature by examining how firms discuss a central topic to economic decision making in annual reports and how firms respond to different forms of disclosure mandate. Furthermore, the results of my analysis are likely to be of value for regulators and policymakers currently reviewing or considering mandating disclosure requirements. By examining how companies adapt their reporting to different types of regulations, this study provides an empirical basis for recalibrating SBM disclosure mandates, thereby enhancing the information set of capital market participants and promoting stakeholder engagement in a landscape increasingly shaped by non-financial information

    Patterns and Variation in English Language Discourse

    Get PDF
    The publication is reviewed post-conference proceedings from the international 9th Brno Conference on Linguistics Studies in English, held on 16–17 September 2021 and organised by the Faculty of Education, Masaryk University in Brno. The papers revolve around the themes of patterns and variation in specialised discourses (namely the media, academic, business, tourism, educational and learner discourses), effective interaction between the addressor and addressees and the current trends and development in specialised discourses. The principal methodological perspectives are the comparative approach involving discourses in English and another language, critical and corpus analysis, as well as identification of pragmatic strategies and appropriate rhetorical means. The authors of papers are researchers from the Czech Republic, Italy, Luxembourg, Serbia and Georgia

    FIREBALL: A Dataset of Dungeons and Dragons Actual-Play with Structured Game State Information

    Full text link
    Dungeons & Dragons (D&D) is a tabletop roleplaying game with complex natural language interactions between players and hidden state information. Recent work has shown that large language models (LLMs) that have access to state information can generate higher quality game turns than LLMs that use dialog history alone. However, previous work used game state information that was heuristically created and was not a true gold standard game state. We present FIREBALL, a large dataset containing nearly 25,000 unique sessions from real D\&D gameplay on Discord with true game state info. We recorded game play sessions of players who used the Avrae bot, which was developed to aid people in playing D&D online, capturing language, game commands and underlying game state information. We demonstrate that FIREBALL can improve natural language generation (NLG) by using Avrae state information, improving both automated metrics and human judgments of quality. Additionally, we show that LLMs can generate executable Avrae commands, particularly after finetuning.Comment: 21 pages, 2 figures. Accepted at ACL 202

    The Role of Preprocessing for Word Representation Learning in Affective Tasks

    Get PDF
    Affective tasks, including sentiment analysis, emotion classification, and sarcasm detection have drawn a lot of attention in recent years due to a broad range of useful applications in various domains. The main goal of affect detection tasks is to recognize states such as mood, sentiment, and emotions from textual data (e.g., news articles or product reviews). Despite the importance of utilizing preprocessing steps in different stages (i.e., word representation learning and building a classification model) of affect detection tasks, this topic has not been studied well. To that end, we explore whether applying various preprocessing methods (stemming, lemmatization, stopword removal, punctuation removal and so on) and their combinations in different stages of the affect detection pipeline can improve the model performance. The are many preprocessing approaches that can be utilized in affect detection tasks. However, their influence on the final performance depends on the type of preprocessing and the stages that they are applied. Moreover, the preprocessing impacts vary across different affective tasks. Our analysis provides thorough insights into how preprocessing steps can be applied in building an effect detection pipeline and their respective influence on performance

    How complex is professional academic writing? A corpus-based analysis of research articles in 'hard' and 'soft' disciplines

    Get PDF
    This study focuses on the analysis of linguistic complexity in professional academic writing in light of the empirical evidence provided by a 1,597,000-word corpus of ‘hard’ (life and physical sciences) and ‘soft’ (arts and social) scientific research articles published in leading peer-review journals. Specifically, this investigation aims both to describe the complexity features of texts written by professional authors and to test the hypothesis that linguistic complexity varies across disciplines. Since previous studies have revealed that automatic complexity indices do not sufficiently succeed in providing a comprehensive description of complexity of texts, in this paper complexity has been measured in two ways: quantitatively through the indexes provided by Lu’s (2010) L2 Syntactic Complexity Analyser, and through the more qualitative analysis of a selection of metrics associated with clausal and phrasal complexity in seminal studies. The data show, first, that syntactic complexity indices (basically, strategies of coordination and subordination) are statistically relevant to the characterisation of specifically the soft-science disciplines; second, that there is a continuum across subdisciplines within the broad distinction of soft versus hard genres; and, third, that the soft genre demonstrates a more stable productivity of clausal-complexity strategies, while phrasal-complexity features are more pervasive in the hard-science subcorpus.Ministerio de Ciencia e Innovación | Ref. PID2020-117541GB-I00Xunta de Galicia | Ref. ED431C 2021/5

    Referring to discourse participants in Ibero-Romance languages

    Get PDF
    Synopsis: This volume brings together contributions by researchers focusing on personal pronouns in Ibero-Romance languages, going beyond the well-established variable of expressed vs. non-expressed subjects. While factors such as agreement morphology, topic shift and contrast or emphasis have been argued to account for variable subject expression, several corpus studies on Ibero-Romance languages have shown that the expression of subject pronouns goes beyond these traditionally established factors and is also subject to considerable dialectal variation. One of the factors affecting choice and expression of personal pronouns or other referential devices is whether the construction is used personally or impersonally. The use and emergence of new impersonal constructions, eventually also new (im)personal pronouns, as well as the variation found in the expression of human impersonality in different Ibero-Romance language varieties is another interesting research area that has gained ground in the recent years. In addition to variable subject expression, similar methods and theoretical approaches have been applied to study the expression of objects. Finally, the reference to the addressee(s) using different address pronouns and other address forms is an important field of study that is closely connected to the variable expression of pronouns. The present book sheds light on all these aspects of reference to discourse participants. The volume contains contributions with a strong empirical background and various methods and both written and spoken corpus data from Ibero-Romance languages. The focus on discourse participants highlights the special properties of first and second person referents and the factors affecting them that are often different from the anaphoric third person. The chapters are organized into three thematic sections: (i) Variable expression of subjects and objects, (ii) Between personal and impersonal, and (iii) Reference to the addressee
    corecore