103 research outputs found
Authorship attribution in portuguese using character N-grams
For the Authorship Attribution (AA) task, character n-grams are considered among the best predictive features. In the English language, it has also been shown that some types of character n-grams perform better than others. This paper tackles the AA task in Portuguese by examining the performance of different types of character n-grams, and various combinations of them. The paper also experiments with different feature representations and machine-learning algorithms. Moreover, the paper demonstrates that the performance of the character n-gram approach can be improved by fine-tuning the feature set and by appropriately selecting the length and type of character n-grams. This relatively simple and language-independent approach to the AA task outperforms both a bag-of-words baseline and other approaches, using the same corpus.Mexican Government (Conacyt) [240844, 20161958]; Mexican Government (SIP-IPN) [20171813, 20171344, 20172008]; Mexican Government (SNI); Mexican Government (COFAA-IPN)
Automatic identification of whole-part relations in Portuguese
Dissertação de Mestrado, Ciências da Linguagem, Faculdade de Ciências Humanas e Sociais, Universidade do Algarve, 2014Neste trabalho, procurou-se melhorar a extração de relações semânticas entre elementos textuais tal
como Ă© atualmente realizada pela STRING, um sistema hĂbrido de Processamento de Linguagem
Natural (PLN), baseado em mĂ©todos estatĂsticos e regras hĂbrido, e desenvolvido para o PortuguĂŞs.
Visaram-se as relações todo-parte (meronĂmia), que pode ser definida como uma relação semântica
entre uma entidade que é percebido como parte integrante de outra entidade, ou a relação entre
um membro e um conjunto de elementos. Neste caso, vamos-nos concentrar num tipo de
meronĂmia envolvendo entidades humanas e nomes parte-do-corpo (Npc); e.g., O Pedro partiu uma perna:
WHOLE-PART(Pedro,perna). Para extrair este tipo de relações parte-todo, foi construĂdo um mĂłdulo
de extração de relações meronĂmicas baseado em regras e que foi integrado na gramática do sistema de
STRING.
Cerca de 17.000 instâncias de Npc foram extraĂdas do primeiro fragmento do corpus CETEMPĂşblico
para a avaliação deste trabalho. Foram também recolhidos 79 casos de nomes de doença (Nd), derivados a
partir de um Npc subjacente (e.g., gastrite-estĂ´mago). A fim de produzir um corpus de referĂŞncia (golden
standard) para a avaliação, foi selecionada uma amostra aleatória estratificada de 1.000 frases, mantendo
a proporção da frequência total de Npc no corpus. Esta amostra também inclui um pequeno número de
Nd (6 lemas, 17 frases). Essas instâncias foram repartidas e anotadas por quatro falantes nativos de
portuguĂŞs. 100 frases foram dadas a todos os anotadores a fim de calcular o acordo inter-anotadores,
que foi considerado entre “razoável” (fair) e “bom” (good).
Comparando a saĂda do sistema com o corpus de referĂŞncia, os resultados mostram, para as relações
parte-todo envolvendo Npc, 0,57 de precisão, 0,38 de cobertura (recall), 0,46 de medida-F e 0,81 de acurácia.
A cobertura foi relativamente pequena (0,38), o que pode ser explicada por vários fatores, tais como
o facto de, em muitas frases, o todo e a parte não estarem relacionadas sintaticamente e até se encontrarem
por vezes bastante distantes. A precisão é um pouco melhor (0,57). A acurácia é relativamente elevada
(0,81), uma vez que existe um grande nĂşmero de casos verdadeiro-negativos. Os resultados para os nomes
de doença, embora o número de casos seja pequeno, mostram uma 0,50 de precisão, 0,11 de cobertura,
0,17 de medida-F e 0,76 de acurácia. A cuidadosa análise de erros realizada permitiu detetar as principais
causas para este desempenho, tendo sido possĂvel, em alguns casos, encontrar soluções para diversos
problemas. Foi então realizada uma segunda avaliação do desempenho do sistema, verificando-se
uma melhoria geral dos resultados: a precisĂŁo melhorou +0,13 (de 0,57 para 0,70), a cobertura +0,11 (de
0,38 para 0,49), a medida-F +0,12 (de 0,46 para 0,58) e a acurácia +0,04 (de 0,81 para 0,85). Os resultados para os Nd permaneceram idênticos.
Em suma, este trabalho pode ser considerado como uma primeira tentativa de extrair relações partetodo,
envolvendo entidades humanas e Npc em PortuguĂŞs. Um mĂłdulo baseado em regras foi construĂdo
e integrado no sistema STRING, tendo sido avaliado com resultados promissores.In this work, we improve the extraction of semantic relations between textual elements as it is currently
performed by STRING, a hybrid statistical and rule-based Natural Language Processing (NLP) chain for
Portuguese, by targeting whole-part relations (meronymy), that is, a semantic relation between an entity
that is perceived as a constituent part of another entity, or a member of a set. In this case, we focus on
the type of meronymy involving human entities and body-part nouns (Nbp); e.g., O Pedro partiu uma perna
’Pedro broke a leg’: WHOLE-PART(Pedro,perna). In order to extract this type of whole-part relations,
a rule-based meronymy extraction module has been built and integrated in the grammar of the STRING
system.
Around 17,000 Nbp instances were extracted from the first fragment of the CETEMPĂşblico corpus
for the evaluation of this work. We also retrieved 79 instances of disease nouns (Nsick), which are derived
from an underlying Nbp (e.g., gastrite-estômago ’gastritis-stomach’). In order to produce a golden
standard for the evaluation, a random stratified sample of 1,000 sentences was selected, keeping the proportion
of the total frequency of Nbp in the source corpus. This sample also includes a small number of
Nsick (6 lemmas, 17 sentences). These instances were annotated by four native Portuguese speakers, and
for 100 of them the inter-annotator agreement was calculated and was deemed from “fair” to “good”.
After confronting the produced golden standard against the system’s output, the results for Nbp
show 0.57 precision, 0.38 recall, 0.46 F-measure, and 0.81 accuracy. The recall is relatively small (0.38),
which can be explained by many factors such as the fact that in many sentences, the whole and the part
are not syntactically related. The precision is somewhat better (0.57). The accuracy is relatively high
(0.81) since there is a large number of true-negative cases. The results for Nsick, though the number of
instances is small, show 0.50 precision, 0.11 recall, 0.17 F-measure, and 0.76 accuracy. A detailed error
analysis was performed, some improvements have been made, and a second evaluation of the system’s
performance was carried out. It showed that the precision improved by 0.13 (from 0.57 to 0.70), the recall
by 0.11 (from 0.38 to 0.49), the F-measure by 0.12 (from 0.46 to 0.58), and the accuracy by 0.04 (from 0.81
to 0.85). The results for Nsick remained the same.
In short, this work may be considered as a first attempt to extract whole-part relations, involving
human entities and Nbp in Portuguese. A rule-based module was built and integrated in the STRING
system, and it was evaluated with promising results.Erasmus Mundus Action 2 2011-2574 Triple I - Integration, Interaction and Institutions scholarship
Automatic Identification of Whole-Part Relations in Portuguese
In this paper, we improve the extraction of semantic relations between textual elements as it is currently performed by STRING, a hybrid statistical and rule-based Natural Language Processing chain for Portuguese, by targeting whole-part relations (meronymy), that is, a semantic relation between an entity that is perceived as a constituent part of another entity, or a member of a set. In this case, we focus on the type of meronymy involving human entities and body-part nouns
Basic Structure of the General Information Theory
The basic structure of the General Information Theory (GIT) is presented in the paper. The main
divisions of the GIT are outlined. Some new results are pointed
The Staple Commodities of the Knowledge Market
In this paper, the “Information Market” is introduced as a payable information exchange and based on it
information interaction. In addition, special kind of Information Markets - the Knowledge Markets are outlined. The
main focus of the paper is concentrated on the investigation of the staple commodities of the knowledge markets.
They are introduced as kind of information objects, called “knowledge information objects”. The main theirs
distinctive characteristic is that they contain information models, which concern sets of information models and
interconnections between them
CGX: Adaptive System Support for Communication-Efficient Deep Learning
The ability to scale out training workloads has been one of the key
performance enablers of deep learning. The main scaling approach is
data-parallel GPU-based training, which has been boosted by hardware and
software support for highly efficient point-to-point communication, and in
particular via hardware bandwidth overprovisioning. Overprovisioning comes at a
cost: there is an order of magnitude price difference between "cloud-grade"
servers with such support, relative to their popular "consumer-grade"
counterparts, although single server-grade and consumer-grade GPUs can have
similar computational envelopes.
In this paper, we show that the costly hardware overprovisioning approach can
be supplanted via algorithmic and system design, and propose a framework called
CGX, which provides efficient software support for compressed communication in
ML applications, for both multi-GPU single-node training, as well as
larger-scale multi-node training. CGX is based on two technical advances:
\emph{At the system level}, it relies on a re-developed communication stack for
ML frameworks, which provides flexible, highly-efficient support for compressed
communication. \emph{At the application level}, it provides \emph{seamless,
parameter-free} integration with popular frameworks, so that end-users do not
have to modify training recipes, nor significant training code. This is
complemented by a \emph{layer-wise adaptive compression} technique which
dynamically balances compression gains with accuracy preservation. CGX
integrates with popular ML frameworks, providing up to 3X speedups for
multi-GPU nodes based on commodity hardware, and order-of-magnitude
improvements in the multi-node setting, with negligible impact on accuracy
The Information
The current formal as well as not formal definitions of the concept "Information” are presented in
the paper
Business Informatics
A definition of the concept "business informatics" based on the General Information Theory is
discussed in the paper
L-GreCo: Layerwise-Adaptive Gradient Compression for Efficient and Accurate Deep Learning
Data-parallel distributed training of deep neural networks (DNN) has gained
very widespread adoption, but can still experience communication bottlenecks.
To address this issue, entire families of compression mechanisms have been
developed, including quantization, sparsification, and low-rank approximation,
some of which are seeing significant practical adoption. Despite this progress,
almost all known compression schemes apply compression uniformly across DNN
layers, although layers are heterogeneous in terms of parameter count and their
impact on model accuracy. In this work, we provide a general framework for
adapting the degree of compression across the model's layers dynamically during
training, improving the overall compression, while leading to substantial
speedups, without sacrificing accuracy. Our framework, called L-GreCo, is based
on an adaptive algorithm, which automatically picks the optimal compression
parameters for model layers guaranteeing the best compression ratio while
satisfying an error constraint. Extensive experiments over image classification
and language modeling tasks shows that L-GreCo is effective across all existing
families of compression methods, and achieves up to 2.5 training
speedup and up to 5 compression improvement over efficient
implementations of existing approaches, while recovering full accuracy.
Moreover, L-GreCo is complementary to existing adaptive algorithms, improving
their compression ratio by 50% and practical throughput by 66%
Advance of the Access Methods
The goal of this paper is to outline the advance of the access methods in the last ten years as well as
to make review of all available in the accessible bibliography methods
- …