103 research outputs found

    Authorship attribution in portuguese using character N-grams

    Get PDF
    For the Authorship Attribution (AA) task, character n-grams are considered among the best predictive features. In the English language, it has also been shown that some types of character n-grams perform better than others. This paper tackles the AA task in Portuguese by examining the performance of different types of character n-grams, and various combinations of them. The paper also experiments with different feature representations and machine-learning algorithms. Moreover, the paper demonstrates that the performance of the character n-gram approach can be improved by fine-tuning the feature set and by appropriately selecting the length and type of character n-grams. This relatively simple and language-independent approach to the AA task outperforms both a bag-of-words baseline and other approaches, using the same corpus.Mexican Government (Conacyt) [240844, 20161958]; Mexican Government (SIP-IPN) [20171813, 20171344, 20172008]; Mexican Government (SNI); Mexican Government (COFAA-IPN)

    Automatic identification of whole-part relations in Portuguese

    Get PDF
    Dissertação de Mestrado, Ciências da Linguagem, Faculdade de Ciências Humanas e Sociais, Universidade do Algarve, 2014Neste trabalho, procurou-se melhorar a extração de relações semânticas entre elementos textuais tal como é atualmente realizada pela STRING, um sistema híbrido de Processamento de Linguagem Natural (PLN), baseado em métodos estatísticos e regras híbrido, e desenvolvido para o Português. Visaram-se as relações todo-parte (meronímia), que pode ser definida como uma relação semântica entre uma entidade que é percebido como parte integrante de outra entidade, ou a relação entre um membro e um conjunto de elementos. Neste caso, vamos-nos concentrar num tipo de meronímia envolvendo entidades humanas e nomes parte-do-corpo (Npc); e.g., O Pedro partiu uma perna: WHOLE-PART(Pedro,perna). Para extrair este tipo de relações parte-todo, foi construído um módulo de extração de relações meronímicas baseado em regras e que foi integrado na gramática do sistema de STRING. Cerca de 17.000 instâncias de Npc foram extraídas do primeiro fragmento do corpus CETEMPúblico para a avaliação deste trabalho. Foram também recolhidos 79 casos de nomes de doença (Nd), derivados a partir de um Npc subjacente (e.g., gastrite-estômago). A fim de produzir um corpus de referência (golden standard) para a avaliação, foi selecionada uma amostra aleatória estratificada de 1.000 frases, mantendo a proporção da frequência total de Npc no corpus. Esta amostra também inclui um pequeno número de Nd (6 lemas, 17 frases). Essas instâncias foram repartidas e anotadas por quatro falantes nativos de português. 100 frases foram dadas a todos os anotadores a fim de calcular o acordo inter-anotadores, que foi considerado entre “razoável” (fair) e “bom” (good). Comparando a saída do sistema com o corpus de referência, os resultados mostram, para as relações parte-todo envolvendo Npc, 0,57 de precisão, 0,38 de cobertura (recall), 0,46 de medida-F e 0,81 de acurácia. A cobertura foi relativamente pequena (0,38), o que pode ser explicada por vários fatores, tais como o facto de, em muitas frases, o todo e a parte não estarem relacionadas sintaticamente e até se encontrarem por vezes bastante distantes. A precisão é um pouco melhor (0,57). A acurácia é relativamente elevada (0,81), uma vez que existe um grande número de casos verdadeiro-negativos. Os resultados para os nomes de doença, embora o número de casos seja pequeno, mostram uma 0,50 de precisão, 0,11 de cobertura, 0,17 de medida-F e 0,76 de acurácia. A cuidadosa análise de erros realizada permitiu detetar as principais causas para este desempenho, tendo sido possível, em alguns casos, encontrar soluções para diversos problemas. Foi então realizada uma segunda avaliação do desempenho do sistema, verificando-se uma melhoria geral dos resultados: a precisão melhorou +0,13 (de 0,57 para 0,70), a cobertura +0,11 (de 0,38 para 0,49), a medida-F +0,12 (de 0,46 para 0,58) e a acurácia +0,04 (de 0,81 para 0,85). Os resultados para os Nd permaneceram idênticos. Em suma, este trabalho pode ser considerado como uma primeira tentativa de extrair relações partetodo, envolvendo entidades humanas e Npc em Português. Um módulo baseado em regras foi construído e integrado no sistema STRING, tendo sido avaliado com resultados promissores.In this work, we improve the extraction of semantic relations between textual elements as it is currently performed by STRING, a hybrid statistical and rule-based Natural Language Processing (NLP) chain for Portuguese, by targeting whole-part relations (meronymy), that is, a semantic relation between an entity that is perceived as a constituent part of another entity, or a member of a set. In this case, we focus on the type of meronymy involving human entities and body-part nouns (Nbp); e.g., O Pedro partiu uma perna ’Pedro broke a leg’: WHOLE-PART(Pedro,perna). In order to extract this type of whole-part relations, a rule-based meronymy extraction module has been built and integrated in the grammar of the STRING system. Around 17,000 Nbp instances were extracted from the first fragment of the CETEMPúblico corpus for the evaluation of this work. We also retrieved 79 instances of disease nouns (Nsick), which are derived from an underlying Nbp (e.g., gastrite-estômago ’gastritis-stomach’). In order to produce a golden standard for the evaluation, a random stratified sample of 1,000 sentences was selected, keeping the proportion of the total frequency of Nbp in the source corpus. This sample also includes a small number of Nsick (6 lemmas, 17 sentences). These instances were annotated by four native Portuguese speakers, and for 100 of them the inter-annotator agreement was calculated and was deemed from “fair” to “good”. After confronting the produced golden standard against the system’s output, the results for Nbp show 0.57 precision, 0.38 recall, 0.46 F-measure, and 0.81 accuracy. The recall is relatively small (0.38), which can be explained by many factors such as the fact that in many sentences, the whole and the part are not syntactically related. The precision is somewhat better (0.57). The accuracy is relatively high (0.81) since there is a large number of true-negative cases. The results for Nsick, though the number of instances is small, show 0.50 precision, 0.11 recall, 0.17 F-measure, and 0.76 accuracy. A detailed error analysis was performed, some improvements have been made, and a second evaluation of the system’s performance was carried out. It showed that the precision improved by 0.13 (from 0.57 to 0.70), the recall by 0.11 (from 0.38 to 0.49), the F-measure by 0.12 (from 0.46 to 0.58), and the accuracy by 0.04 (from 0.81 to 0.85). The results for Nsick remained the same. In short, this work may be considered as a first attempt to extract whole-part relations, involving human entities and Nbp in Portuguese. A rule-based module was built and integrated in the STRING system, and it was evaluated with promising results.Erasmus Mundus Action 2 2011-2574 Triple I - Integration, Interaction and Institutions scholarship

    Automatic Identification of Whole-Part Relations in Portuguese

    Get PDF
    In this paper, we improve the extraction of semantic relations between textual elements as it is currently performed by STRING, a hybrid statistical and rule-based Natural Language Processing chain for Portuguese, by targeting whole-part relations (meronymy), that is, a semantic relation between an entity that is perceived as a constituent part of another entity, or a member of a set. In this case, we focus on the type of meronymy involving human entities and body-part nouns

    Basic Structure of the General Information Theory

    Get PDF
    The basic structure of the General Information Theory (GIT) is presented in the paper. The main divisions of the GIT are outlined. Some new results are pointed

    The Staple Commodities of the Knowledge Market

    Get PDF
    In this paper, the “Information Market” is introduced as a payable information exchange and based on it information interaction. In addition, special kind of Information Markets - the Knowledge Markets are outlined. The main focus of the paper is concentrated on the investigation of the staple commodities of the knowledge markets. They are introduced as kind of information objects, called “knowledge information objects”. The main theirs distinctive characteristic is that they contain information models, which concern sets of information models and interconnections between them

    CGX: Adaptive System Support for Communication-Efficient Deep Learning

    Full text link
    The ability to scale out training workloads has been one of the key performance enablers of deep learning. The main scaling approach is data-parallel GPU-based training, which has been boosted by hardware and software support for highly efficient point-to-point communication, and in particular via hardware bandwidth overprovisioning. Overprovisioning comes at a cost: there is an order of magnitude price difference between "cloud-grade" servers with such support, relative to their popular "consumer-grade" counterparts, although single server-grade and consumer-grade GPUs can have similar computational envelopes. In this paper, we show that the costly hardware overprovisioning approach can be supplanted via algorithmic and system design, and propose a framework called CGX, which provides efficient software support for compressed communication in ML applications, for both multi-GPU single-node training, as well as larger-scale multi-node training. CGX is based on two technical advances: \emph{At the system level}, it relies on a re-developed communication stack for ML frameworks, which provides flexible, highly-efficient support for compressed communication. \emph{At the application level}, it provides \emph{seamless, parameter-free} integration with popular frameworks, so that end-users do not have to modify training recipes, nor significant training code. This is complemented by a \emph{layer-wise adaptive compression} technique which dynamically balances compression gains with accuracy preservation. CGX integrates with popular ML frameworks, providing up to 3X speedups for multi-GPU nodes based on commodity hardware, and order-of-magnitude improvements in the multi-node setting, with negligible impact on accuracy

    The Information

    Get PDF
    The current formal as well as not formal definitions of the concept "Information” are presented in the paper

    Business Informatics

    Get PDF
    A definition of the concept "business informatics" based on the General Information Theory is discussed in the paper

    L-GreCo: Layerwise-Adaptive Gradient Compression for Efficient and Accurate Deep Learning

    Full text link
    Data-parallel distributed training of deep neural networks (DNN) has gained very widespread adoption, but can still experience communication bottlenecks. To address this issue, entire families of compression mechanisms have been developed, including quantization, sparsification, and low-rank approximation, some of which are seeing significant practical adoption. Despite this progress, almost all known compression schemes apply compression uniformly across DNN layers, although layers are heterogeneous in terms of parameter count and their impact on model accuracy. In this work, we provide a general framework for adapting the degree of compression across the model's layers dynamically during training, improving the overall compression, while leading to substantial speedups, without sacrificing accuracy. Our framework, called L-GreCo, is based on an adaptive algorithm, which automatically picks the optimal compression parameters for model layers guaranteeing the best compression ratio while satisfying an error constraint. Extensive experiments over image classification and language modeling tasks shows that L-GreCo is effective across all existing families of compression methods, and achieves up to 2.5Ă—\times training speedup and up to 5Ă—\times compression improvement over efficient implementations of existing approaches, while recovering full accuracy. Moreover, L-GreCo is complementary to existing adaptive algorithms, improving their compression ratio by 50% and practical throughput by 66%

    Advance of the Access Methods

    Get PDF
    The goal of this paper is to outline the advance of the access methods in the last ten years as well as to make review of all available in the accessible bibliography methods
    • …
    corecore