45 research outputs found

    Discovery of sensitive data with natural language processing

    Get PDF
    The process of protecting sensitive data is continually growing and becoming increasingly important, especially as a result of the directives and laws imposed by the European Union. The effort to create automatic systems is continuous, but in most cases, the processes behind them are still manual or semi-automatic. In this work, we have developed a component that can extract and classify sensitive data, from unstructured text information in European Portuguese. The objective was to create a system that allows organizations to understand their data and comply with legal and security purposes. We studied a hybrid approach to the problem of Named Entities Recognition for the Portuguese language. This approach combines several techniques such as rule-based/lexical-based models, machine learning algorithms and neural networks. The rule-based and lexical-based approaches were used only for a set of specific classes. For the remaining classes of entities, SpaCy and Stanford NLP tools were tested, two statistical models – Conditional Random Fields and Random Forest – were implemented and, finally, a Bidirectional- LSTM approach as experimented. The best results were achieved with the Stanford NER model (86.41%), from the Stanford NLP tool. Regarding the statistical models, we realized that Conditional Random Fields is the one that can obtain the best results, with a f1-score of 65.50%. With the Bi-LSTM approach, we have achieved a result of 83.01%. The corpora used for training and testing were HAREM Golden Collection, SIGARRA News Corpus and DataSense NER Corpus.O processo de preservação de dados sensíveis está em constante crescimento e cada vez apresenta maior importância, proveniente especialmente das diretivas e leis impostas pela União Europeia. O esforço para criar sistemas automáticos é contínuo, mas o processo é realizado na maioria dos casos de forma manual ou semiautomática. Neste trabalho desenvolvemos um componente de Extração e Classificação de dados sensíveis, que processa textos não-estruturados em Português Europeu. O objetivo consistiu em criar um sistema que permite às organizações compreender os seus dados e cumprir com fins legais de conformidade e segurança. Para resolver este problema, foi estudada uma abordagem híbrida de Reconhecimento de Entidades Mencionadas para a língua Portuguesa. Esta abordagem combina técnicas baseadas em regras e léxicos, algoritmos de aprendizagem automática e redes neuronais. As primeiras abordagens baseadas em regras e léxicos, foram utilizadas apenas para um conjunto de classes especificas. Para as restantes classes de entidades foram utilizadas as ferramentas SpaCy e Stanford NLP, testados dois modelos estatísticos — Conditional Random Fields e Random Forest – e por fim testada uma abordagem baseada em redes neuronais – Bidirectional-LSTM. Ao nível das ferramentas utilizadas os melhores resultados foram conseguidos com o modelo Stanford NER (86,41%). Através dos modelos estatísticos percebemos que o Conditional Random Fields é o que consegue obter melhores resultados, com um f1-score de 65,50%. Com a última abordagem, uma rede neuronal Bi-LSTM, conseguimos resultado de f1-score de aproximadamente 83,01%. Para o treino e teste das diferentes abordagens foram utilizados os conjuntos de dados HAREM Golden Collection, SIGARRA News Corpus e DataSense NER Corpus

    Universal grammar and second language acquisition

    Get PDF
    Esta monografía tiene un propósito, a saber: la investigación sobre cómo la praxis de la enseñanza de segunda lengua puede ser integrada en la base teórica filosófica-lingüística de Noam Chomsky. Sin embargo, este trabajo no es una investigación exhaustiva de todas las preguntas que participan en ella,más bien, se trata de una presentación sucinta de las preguntas, posibles soluciones y datos, cuyo objetivo es ofrecer diferentes puntos de vista sobre el objeto de esta investigación. En este contexto, el presente texto es una contribución al debate sobre cómo los humanos adquieren el lenguaje. En este sentido, el estudio de SLA arrojará luz sobre cómo funciona L2 y la forma en que se entreteje con el funcionamiento de la mente, el cerebro y la enseñanza.This monograph has one purpose, namely: research on how second language teaching praxis can be integrated into the linguistic and philosophical Noam Chomsky´s theoretical background. However, this work is not an exhaustive research of every single question involved on it,rather, it is a succinct presentation of questions, possible solutions, and data, whose aim is to provide different insights concerning the purpose of this investigation. In this context, this text is a contribution to the debate about how humans acquire language. In this sense, studying SLA will shed lights about how L2 works and how it is interwoven with mind functioning, brain, and teaching.Licenciado (a) en Lenguas ModernasPregrad

    Head-driven machine translation

    Get PDF
    Despite initial optimism about the feasibility of Machine Translation, it is now accepted as being an extremely different task to implement. This is due in part to our lack of understanding of the human processes involved in language comprehension and production in general, and translation in particular. In addition, the myriad of problems posed by ambiguities caused by structural differences, category options etc , which in most cases are resolved subconsciously by humans, have slowed down the development of a Fully Automatic, High-Quality Machine Translation System, and have convinced many people that this goal is completely unattainable. This thesis is an investigation of the suitability of Head-Driven Phrase Structure Grammar (HPSG, Pollard and Sag, 1987, 1994) for use in a transfer-based translation environment. It provides an account of some of the problems tackled by such a system, as well as the reasons behind the decisions to chose HPSG and a transfer approach Moreover, some of the possible inadequacies of HPSG’s current semantic framework are addressed and some potential alternatives are suggested, namely the incorporation of case grammars and semantic features to guide lexical selection in the target language. The evaluation of these ideas is based on an implementation of these proposals in a system for translation between German and English, using the Attribute Logic Engine (ALE, Carpenter, 1992) for the purposes of monolingual analysis

    RST Signalling Corpus: A Corpus of Signals of Coherence Relations

    Get PDF
    We present the RST Signalling Corpus (Das et al. in RST signalling corpus, LDC2015T10. https://catalog.ldc.upenn.edu/LDC2015T10, 2015), a corpus annotated for signals of coherence relations. The corpus is developed over the RST Discourse Treebank (Carlson et al. in RST Discourse Treebank, LDC2002T07. https://catalog.ldc.upenn.edu/LDC2002T07, 2002) which is annotated for coherence relations. In the RST Signalling Corpus, these relations are further annotated with signalling information. The corpus includes annotation not only for discourse markers which are considered to be the most typical (or sometimes the only type of) signals in discourse, but also for a wide array of other signals such as reference, lexical, semantic, syntactic, graphical and genre features as potential indicators of coherence relations. We describe the research underlying the development of the corpus and the annotation process, and provide details of the corpus. We also present the results of an inter-annotator agreement study, illustrating the validity and reproducibility of the annotation. The corpus is available through the Linguistic Data Consortium, and can be used to investigate the psycholinguistic mechanisms behind the interpretation of relations through signalling, and also to develop discourse-specific computational systems such as discourse parsing applications

    Right Dislocation and Afterthought in German - Investigations on Multiple Levels

    Get PDF
    When investigating the right sentence periphery in German, two constructions are encountered that appear to be rather similar at first glance: right dislocation and afterthought. Irrespective of this superficial similarity, right dislocation and afterthought can be distinguished at multiple levels of linguistic description. This thesis aims at providing a more nuanced understanding of right dislocation and afterthought by providing empirical investigations, both qualitative and quantitative in nature, employing analyses of experimentally acquired data as well as corpus analyses. It is shown that right dislocation and afterthought are best defined on the basis of the functions they take in discourse rather than on the basis of their prosodic realisations, and that their functional differences are reflected in a number of linguistic parameters, such as their morpho-syntactic con¬straints as well as their degree of syntactic integratedness, their prosodic features, and even their punctuation in written texts

    The lexical classifier parameter & the L2 acquisition of Cantonese nominals.

    Get PDF
    by Wai-Hoo Au Yeung.Thesis submitted in 1997.Thesis (M.Phil.)--Chinese University of Hong Kong, 1998.Includes bibliographical references (leaves i-v (3rd gp.)).Abstract also in Chinese.AcknowledgmentsAbstractContentsAbbreviations & TablesChapter 1. --- INTRODUCTION --- p.1Chapter 1.1. --- What is a parameter?Chapter 1.2. --- Parameter resetting in SLAChapter 1.3. --- Parameter as feature checkingChapter 1.4. --- Research goals and outline of the thesisChapter 2. --- PARAMETERIZATION IN CHINESE NOMINALS --- p.18Chapter 2.1. --- DP-analysis and its parameterizationChapter 2.2. --- Evaluation of the four models of Chinese nominal structuresChapter 2.3. --- Parameterization in Cantonese and Mandarin nominalsChapter 2.4. --- The Lexical CL ParameterChapter 3. --- METHODOLOGY --- p.86Chapter 3.1. --- The subjectChapter 3.2. --- Timing of recordingChapter 3.3. --- What is recordedChapter 3.4. --- The corpusChapter 3.5. --- Criteria of counting utterances and point of acquisitionChapter 4. --- THE ACQUISITION OF CANTONESE NOMINALS --- p.92Chapter 4.1. --- Overall development of Cantonese nominal structureChapter 4.2. --- Acquisition of Cantonese-specific CLP propertiesChapter 4.3. --- Comparison with native Cantonese children's dataChapter 4.4. --- Acquisition by resetting the Lexical CL ParameterChapter 5. --- AN INFORMAL EXPERIMENT ON GENERIC di AND di-N PHRASES --- p.118Chapter 5.1. --- Design of the experimentChapter 5.2. --- MaterialsChapter 5.3. --- Procedures and resultsChapter 5.4. --- Comparison between Ching's and native children's resultsChapter 6. --- CONCLUSION --- p.134Chapter 6.1. --- Theoretical and acquisition findingsChapter 6.2. --- Further evidence for parameter resettingChapter 6.3. --- Implications for future researchChapter 6.4. --- Concluding remarksReferencesAppendix A: 3 sample files of the corpu

    On the left periphery of Latin embedded clauses

    Get PDF
    The main topic of the present thesis is word order in Latin embedded clauses. More specifically, it deals with a specific surface order in which one ore more constituents are found in the left periphery of the embedded clause, to the left of a subordinating conjunction. This particular pattern is referred to as 'Left Edge Fronting', henceforth LEF. The theoretical framework used is the so called 'cartographic' variety of generative grammar, which assumes a richly articulated (functional) structure to form the syntactic backbone of clauses and noun phrases. The first chapter provides some background concerning the theoretical framework one the one hand and the 'discourse configurational' nature of Latin on the other hand. Chapter 2 focuses on the syntax of the particular subtype of embedded clauses that I investigate, namely adverbial clauses (ACs). Special attention is given to the distribution and availability of so called Main Clause Phenomena in ACs. Chapter 3 gives an overview of the results of a large-scale corpus study on word order in ACs, in which texts from 180 BC to 120 AD were taken into account. These results reveal a quantitative left-right asymmetry: it is shown that LEF occurs most frequently in clause-initial ACs. Moreover, relative and demonstrative pronouns are exclusively found in an LEF-position in clause-initial ACs. These two observations give rise to a distinction between two types of LEF: pronoun fronting in initial ACs (LEF1) and XP-fronting in both initial and final ACs (LEF2). The syntax of LEF1 is analyzed in chapters 4 (on relative pronouns) and 5 (on demonstratives): the phenonenon is characterized as a type of topicalization, which is derived in two steps. First, the pronoun undergoes 'internal movement' to the edge of the embedded clause. This step is followed by an operation of clausal pied-piping, targeting the left periphery of the superordinate clause. A derivation along these successfully explains the left-right asymmetry mentioned earlier. LEF2 on the other hand is argued to be a type of non-contrastive focalization (chapter 6), which can occur in initial and final ACs alike. Chapter 7 focuses on the diachronic evolution of LEF2. The observed decline of this phenomenon is related to a change that took place in the same period, viz. the decreasing frequence of INFL-final clauses

    Fuzzy Coherence : Making Sense of Continuity in Hypertext Narratives

    Get PDF
    Hypertexts are digital texts characterized by interactive hyperlinking and a fragmented textual organization. Increasingly prominent since the early 1990s, hypertexts have become a common text type both on the Internet and in a variety of other digital contexts. Although studied widely in disciplines like hypertext theory and media studies, formal linguistic approaches to hypertext continue to be relatively rare. This study examines coherence negotiation in hypertext with particularly reference to hypertext fiction. Coherence, or the quality of making sense, is a fundamental property of textness. Proceeding from the premise that coherence is a subjectively evaluated property rather than an objective quality arising directly from textual cues, the study focuses on the processes through which readers interact with hyperlinks and negotiate continuity between hypertextual fragments. The study begins with a typological discussion of textuality and an overview of the historical and technological precedents of modern hypertexts. Then, making use of text linguistic, discourse analytical, pragmatic, and narratological approaches to textual coherence, the study takes established models developed for analyzing and describing conventional texts, and examines their applicability to hypertext. Primary data derived from a collection of hyperfictions is used throughout to illustrate the mechanisms in practice. Hypertextual coherence negotiation is shown to require the ability to cognitively operate between local and global coherence by means of processing lexical cohesion, discourse topical continuities, inferences and implications, and shifting cognitive frames. The main conclusion of the study is that the style of reading required by hypertextuality fosters a new paradigm of coherence. Defined as fuzzy coherence, this new approach to textual sensemaking is predicated on an acceptance of the coherence challenges readers experience when the act of reading comes to involve repeated encounters with referentially imprecise hyperlinks and discourse topical shifts. A practical application of fuzzy coherence is shown to be in effect in the way coherence is actively manipulated in hypertext narratives.Hyperteksti on yleisnimi digitaaliselle tekstityyppille joka rakentuu yleensä suhteellisen lyhyistä, lukijan tekemien valintojen mukaan järjestyvistä osista. Hypertekstit yleistyivät 1990-luvun alussa erityisesti Internetin vaikutuksesta, ja tänä päivänä suuri osa tietokoneen ruudulta luettavista teksteistä onkin hypertekstejä. Hypertekstien piirteitä on tutkittu viimeisten 20 vuoden aikana erityisesti hypertekstiteorian ja mediatutkimuksen oppialoilla, mutta niiden kielitieteellinen tutkimus on monelta osin edelleen alkuvaiheessa. Tutkimus tarkastelee koherenssin eli tekstin eheyden rakentumista hyperteksteissä ja erityisesti hypertekstitarinoissa. Koherenssilla tarkoitetaan lukijalähtöistä kokemusta tekstin rakenteellisesta mielekkyydestä, eli lukijan vaikutelmaa tekstin eri osien kuulumisesta yhteen mielekkäänä kokonaisuutena. Hypertekstin osalta koherenssimuodostuksen keskeinen ongelma liittyy hyperlinkkien viittaussuhteiden epätarkkuuteen, epäsuoran vuorovaikutustilanteen dynamiikkaan ja tekstin pirstaleisen rakenteen synnyttämään kokemukseen temaattisesta epäjatkuvuudesta. Aihetta tarkastellaan tekstilingvistiikan, diskurssianalyysin, pragmatiikan ja narratologian teorioiden lähtökohdista. Tutkimus esittelee hyperlinkkien viittaussuhteiden tulkintaan liittyvät eri mekanismit primäärimateriaalista nostetuin esimerkein, korostaen erityisesti teoriamallien soveltuvuuden eroja hypertekstien ja perinteisten lineaaristen tekstien välillä. Tutkimuksen tuloksena todetaan että hypertekstuaalinen tekstityyppi on synnyttämässä uuden lukutavan, joka edellyttää koherenssin käsitteen uudelleenarviointia. Tämä uusi koherenssin erityistyyppi, jota tutkimuksessa kutsutaan nimellä fuzzy coherence eli sumea koherenssi, perustuu toistuvien pienten koherenssiongelmien hyväksymiseen osana lukukokemusta. Erityisesti hypertekstikirjallisuuden piirissä sumeaa koherenssia voidaan hyödyntää myös kerronnallisena keinona
    corecore