683 research outputs found

    ETL4PROFILING: extending ETL4LOD to analyze datasets completeness – a DBpedia case study

    Get PDF
    À medida que a quantidade de dados no mundo cresce, é importante mantê-los acessíveis e usáveis, ao mesmo tempo que corretos e confiáveis. Além disso, o princípio R1 (Reuse)1 da FAIR argumenta que é mais fácil encontrar e reusar dados se eles tiverem muitos rótulos atrelados a eles, considerando que ter uma boa qualidade de dados é essencial para qualquer repositório quando se trata de apoiar a sua abertura e reuso. Desta forma, o presente estudo tem a intenção de analisar as atuais condições de diversos conjuntos de dados, com um foco especial para a DBpedia, um projeto aberto que serve como um hub central na nuvem de dados conectados (Linked Open Data Cloud). Apesar de possuir mais de seis milhões de dados estruturados e seu grande uso para pesquisas e processos de aprendizado de máquina, ela contém muitos dados incompletos e recursos classificados erroneamente, o que dificulta a sua abertura e uso em projetos externos. A pesquisa é então baseada na extensão dos plugins ETL4LOD para análise de diferentes versões da DBpedia através de seus templates, fazendo uma caracterização ou perfil dos dados (Data Profiling) detalhado dos mesmos. Através dessa análise foi possível encontrar, dentre outras informações, a completude de 58.3% dos munícios brasileiros na DBpedia pt em comparação a 97.3% das cidades do Japão na DBpedia ja. Resumindo, apesar da DBpedia ser importante para os dados conectados, ela ainda apresentadados incompletos, principalmente na versão portuguesa, que precisam ser trabalhados a fim de ajudar o repositório a se tornar mais completo e consequentemente apoiar o seu reuso em pesquisas e projetos futuros


    Get PDF
    Brazilian Portuguese (henceforth BP) has for long been considered as a Null-subject language due to its variability in regards to subject expression (e.g. Era bom porque eu diminu\xeda de peso... era muito gordinha \u2018That was good because then I could lose some weight\u2026 (I) was a bit chubby. C33:179). Such variability has been attributed to the language\u27s once rich inflectional system, and the reported increase in rate of subject expression has been seen as a result of changes to the system (Barbosa, Duarte, & Kato, 2005; Monteiro, 1994b; Negr\xe3o & Viotti, 2000). Moreover, there is agreement among several scholars that the variability can still be accounted for in terms of traditional factors such as emphasis, clarity, and ambiguity of the Tense, Aspect, and Mood (TAM) system. In this work, I demonstrate that, rather than an effect of such pragmatic factors as these, subject expression in BP is to a large degree an artifact of the frequency of use of certain constructions of different degrees of fixedness. The analysis proposed here falls under the framework of usage-based linguistics in which grammar is believed to be shaped by discourse as speakers produce it online (Bybee, 2006). Thus, any linguistic pattern observed in speech is emergent and a result of repetition (Bybee, 2006; Hopper, 1998). Therefore, it is believed that the patterns of subject expression found in the data are a result of the speaker\u27s experiences with those patterns. The data used for the study are drawn from the corpus of oral Portuguese as spoken by educated speakers from Fortaleza (PORCUFORT) (Monteiro, 1994a). The analysis is based on 8066 tokens of 1sg, 2sg, and animate 3sg subjects culled from three different registers (Conversations, Interviews, and Lectures) across three different age groups (22-35, 36-50, and over 51). These tokens are subjected to a number of multivariate analyses to identify the contexts that significantly contribute to the realization of pronominal subjects in these data. The methodology employed in this study to analyze the data follows the tenets of the Comparative method in Variationist theory in that comparison across the different subjects allows us to identify the contexts that contribute to the overall pattern of pronominal subjects. Moreover, this analysis also takes into account the role of frequency and constructions in shaping the grammar of speakers. These different analyses and approaches yield two major findings from this study, namely (1) that these three persons behave very differently in terms of their patterning with pronominal subjects, they show that there are different factor groups conditioning the realization of pronominal subjects and within these factor groups we see that the factors show different directions of effect depending on the person; (2) that high frequency verbs and constructions also behave differently in their distribution with pronominal subjects. In fact, their behavior is needs to be examined in isolation because some show regular patterning with pronominal subjects while others are realized without pronominal subjects.\u2

    On the nature of crosslinguistic influence: root infinitives revisited

    Get PDF
    Producción CientíficaRoot Infinitives (RI) in Spanish have an infinitival marker, while in English they are bare forms. For languages like English, the RI stage has been said to be longer and to have a higher incidence than in Spanish. Within Liceras, Bel, and Perales’ (2006) typology of an RI universal stage, Spanish is a [+Person (P), +Infinitival marker (R)] language while English is [−P, −R]. Our analysis of the English and Spanish RIs produced by English-Spanish bilingual children and English and Spanish monolingual children reveals no interfering influence from English into Spanish and no positive influence from Spanish into English, which suggests that the degree of lexical transparency of the [+P, +R] features of Spanish is not strong enough to trigger acceleration in overcoming the bilingual English RI stage.Junta de Castilla y León y ERDF (European Regional Development Fund), (programa de apoyo a proyectos de investigación - Ref.VA009P17)Ministerio de Ciencia, Innovación y Universidades y ERDF (European Regional Development Fund), (ref. PGC2018-097693-B-I00)Ministerio de Ciencia y Tecnología (HUM2007-62213) and ERDF (European Regional Development Fund) (BFF2002-00442)

    Automatic Detection of Proverbs and their Variants

    Get PDF
    This article presents the task of automatic detection of proverbs in Brazilian Portuguese, from the intersection of the regular syntactic structure of proverbs and their core elements. We created finite-state automata that enabled us to look for these word combinations in running texts. The rationale behind this method consists in the fact that although proverbs may have a normal sentence structure and often a very commonly used lexicon, their specific word-combinations may enables us to identify them and their variants irrespective of the syntactic or structural changes the proverb may undergo. The goal of this task is to gather the largest number of proverbs and their variants. The results showed precision 60.15%

    Child Second Language Development in Immersion Education

    Get PDF
    Language acquisition has been the subject of decades of research. Most of the previous research on second language acquisition has centered around adult learners, leaving child learners understudied by comparison. This book focuses on child second language development. The cross-sectional empirical study herein investigates the syntax-semantics interface in English speaking children acquiring German and French as second languages. The author discusses variables such as crosslinguistic influence, the complexity of the learning tasks, cognitive maturity and the learning context. By focusing on child second language acquisition in immersion education, this book not only substantially contributes to the field of second language acquisition but also offers important insights into teaching in an immersion context

    Semantic relations between sentences: from lexical to linguistically inspired semantic features and beyond

    Get PDF
    This thesis is concerned with the identification of semantic equivalence between pairs of natural language sentences, by studying and computing models to address Natural Language Processing tasks where some form of semantic equivalence is assessed. In such tasks, given two sentences, our models output either a class label, corresponding to the semantic relation between the sentences, based on a predefined set of semantic relations, or a continuous score, corresponding to their similarity on a predefined scale. The former setup corresponds to the tasks of Paraphrase Identification and Natural Language Inference, while the latter corresponds to the task of Semantic Textual Similarity. We present several models for English and Portuguese, where various types of features are considered, for instance based on distances between alternative representations of each sentence, following lexical and semantic frameworks, or embeddings from pre-trained Bidirectional Encoder Representations from Transformers models. For English, a new set of semantic features is proposed, from the formal semantic representation of Discourse Representation Structure. In Portuguese, suitable corpora are scarce and formal semantic representations are unavailable, hence an evaluation of currently available features and corpora is conducted, following the modelling setup employed for English. Competitive results are achieved on all tasks, for both English and Portuguese, particularly when considering that our models are based on generally available tools and technologies, and that all features and models are suitable for computation in most modern computers, except for those based on embeddings. In particular, for English, our semantic features from DRS are able to improve the performance of other models, when integrated in the feature set of such models, and state of the art results are achieved for Portuguese, with models based on fine tuning embeddings to a specific task; Sumário: Relações semânticas entre frases: de aspectos lexicais a aspectos semânticos inspirados em linguística e além destes Esta tese é dedicada à identificação de equivalência semântica entre frases em língua natural, através do estudo e computação de modelos destinados a tarefas de Processamento de Linguagem Natural relacionadas com alguma forma de equivalência semântica. Em tais tarefas, a partir de duas frases, os nossos modelos produzem uma etiqueta de classificação, que corresponde à relação semântica entre as frases, baseada num conjunto predefinido de possíveis relações semânticas, ou um valor contínuo, que corresponde à similaridade das frases numa escala predefinida. A primeira configuração mencionada corresponde às tarefas de Identificação de Paráfrases e de Inferência em Língua Natural, enquanto que a última configuração mencionada corresponde à tarefa de Similaridade Semântica em Texto. Apresentamos diversos modelos para Inglês e Português, onde vários tipos de aspectos são considerados, por exemplo baseados em distâncias entre representações alternativas para cada frase, seguindo formalismos semânticos e lexicais, ou vectores contextuais de modelos previamente treinados com Representações Codificadas Bidirecionalmente a partir de Transformadores. Para Inglês, propomos um novo conjunto de aspectos semânticos, a partir da representação formal de semântica em Estruturas de Representação de Discurso. Para Português, os conjuntos de dados apropriados são escassos e não estão disponíveis representações formais de semântica, então implementámos uma avaliação de aspectos actualmente disponíveis, seguindo a configuração de modelos aplicada para Inglês. Obtivemos resultados competitivos em todas as tarefas, em Inglês e Português, particularmente considerando que os nossos modelos são baseados em ferramentas e tecnologias disponíveis, e que todos os nossos aspectos e modelos são apropriados para computação na maioria dos computadores modernos, excepto os modelos baseados em vectores contextuais. Em particular, para Inglês, os nossos aspectos semânticos a partir de Estruturas de Representação de Discurso melhoram o desempenho de outros modelos, quando integrados no conjunto de aspectos de tais modelos, e obtivemos resultados estado da arte para Português, com modelos baseados em afinação de vectores contextuais para certa tarefa


    Get PDF
    The article analyzes the relationship between police violence and its repercussion in the narratives exposed by Brazilian rappers in their lyrics. The central argument is that such violence is not a consequence or a cause for the phenomenon, but a structural element in the dynamics of urban violence. Such violence is essential for establishing the “Us VS. Them” logic that objectifies itself in the feuds between policemen and criminals, expressed in the lyrics scrutinized here. Methodologically, the historical-discursive approach of Critical Discourse Analysis was employed. The analytical basis were the procedures suggested for text analysis and the focus of the investigation was the lyrics from the rap band Facção Central. The recursive relation between criminals and policemen was highlighted, in which violence functions as a mediating instrument. Violence between these groups emerges as a solidary element, i.e., it is the social cement providing cohesion to the interaction between these groups

    Long distance agreement in Spanish

    Get PDF
    Treballs Finals del Màster en Ciència Cognitiva i Llenguatge, Facultat de Filosofia, Universitat de Barcelona, Curs: 2016-2017, Tutors: Ángel J. Gallego & M. Lluïsa Hernan