12 research outputs found
A Large-Scale Comparison of Historical Text Normalization Systems
There is no consensus on the state-of-the-art approach to historical text
normalization. Many techniques have been proposed, including rule-based
methods, distance metrics, character-based statistical machine translation, and
neural encoder--decoder models, but studies have used different datasets,
different evaluation methods, and have come to different conclusions. This
paper presents the largest study of historical text normalization done so far.
We critically survey the existing literature and report experiments on eight
languages, comparing systems spanning all categories of proposed normalization
techniques, analysing the effect of training data quantity, and using different
evaluation methods. The datasets and scripts are made publicly available.Comment: Accepted at NAACL 201
Techniques for Automatic Normalization of Orthographically Variant Yiddish Texts
Yiddish is characterized by a multitude of orthographic systems. A number of approaches to automatic normalization of variant orthography have been explored for the processing of historic texts of languages whose orthography has since been standardized. However, these approaches have not yet been applied to Yiddish.
Using a manually normalized set of 16 Yiddish documents as a training and test corpus, four techniques for automatic normalization were compared: a hand-crafted set of transformation rules, an off-the-shelf spell checker, edit distance minimization with manually set weights, and edit distance minimization with weights learned through a training set.
Performance was evaluated by calculating the proportion of correctly normalized words in a test set, and by measuring precision and recall in a test of information retrieval.
For the given test corpus, normalization by minimization of edit distance with multi-character edit operations and learned weights was found to perform best in all tests
LL(O)D and NLP perspectives on semantic change for humanities research
CC BY 4.0This paper presents an overview of the LL(O)D and NLP methods, tools and data for detecting and representing semantic change, with its main application in humanities research. The paper’s aim is to provide the starting point for the construction of a workflow and set of multilingual diachronic ontologies within the humanities use case of the COST Action Nexus Linguarum, European network for Web-centred linguistic data science, CA18209. The survey focuses on the essential aspects needed to understand the current trends and to build applications in this area of study
Dealing with orthographic variation in a tagger-lemmatizer for fourteenth century Dutch charters
Contains fulltext :
120250.pdf (publisher's version ) (Closed access)13 november 201327 p
CLARIN
The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium
CLARIN. The infrastructure for language resources
CLARIN, the "Common Language Resources and Technology Infrastructure", has established itself as a major player in the field of research infrastructures for the humanities. This volume provides a comprehensive overview of the organization, its members, its goals and its functioning, as well as of the tools and resources hosted by the infrastructure. The many contributors representing various fields, from computer science to law to psychology, analyse a wide range of topics, such as the technology behind the CLARIN infrastructure, the use of CLARIN resources in diverse research projects, the achievements of selected national CLARIN consortia, and the challenges that CLARIN has faced and will face in the future.
The book will be published in 2022, 10 years after the establishment of CLARIN as a European Research Infrastructure Consortium by the European Commission (Decision 2012/136/EU)
CLARIN
The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium
Sumarização Personalizada e Subjectiva de Texto
Um texto pode ser sumarizado ou resumido, isto é, o seu assunto ou conceito pode ser representado
de uma forma mais sucinta. A representação mais comum de um sumário é a escrita, pois
é constantemente produzida pelas pessoas, quando estas querem descrever uma determinado
assunto.
Ao longo dos últimos anos o uso da Internet tem vindo a massificar-se e com isso a quantidade
de informação disponível nesta enorme rede, aumentou exponencialmente, sendo este acontecimento
denominado como sobrecarga de informação. Isto levanta uma série de problemas,
entre eles a procura de informação relevante, sobre um determinado tema. Quando alguém
procura essa informação pretende encontrá-la de forma eficiente, ou seja, rápido e que aborde
diretamente o assunto pretendido. Quanto ao assunto, existem algumas formas de procurar o
mesmo, já em relação à celeridade da pesquisa, deparamo-nos com uma quantidade enorme de
informação que por vezes difere daquilo que procuramos, sendo muito demoroso o processo de
leitura de toda essa informação.
Uma das formas de resolver esse problema é resumir o conteúdo do texto encontrado, para que
assim possamos de uma forma mais rápida ter uma noção sobre o tema do texto encontrado. Na
área da sumarização existem várias técnicas que possibilitam a obtenção de um sumário mais
específico.
Esta dissertação tem como base a combinação de algumas das técnicas estudadas ao longo do
tempo, tais como, relevância e informatividade das palavras, objetividade, segmentação em
tópicos e no uso de palavras que representem o domínio do texto.
Numa abordagem estatística destacam-se a relevância dos termos de um texto, que é calculada
através da frequência dos termos presentes nesse texto e num corpus,a extração das palavraschave
que serão encontradas através da sua relevância no texto e a posição das frases no documento
que consoante o seu tipo, pode ser calculado de diversas formas, neste caso, sendo
avaliado com textos noticioso, foi implementada uma heurística posicional que atribui mais relevância
a frases cimeiras. A abordagem baseada na subjectividade de um texto é implementada
recorrendo a um conjunto de dados textuais conhecido como SentiWordNet [BES10]. Foi ainda
implementada uma abordagem híbrida em que se combinam total ou parcialmente os métodos
referidos anteriormente.
De modo a proceder à avaliação do sistema foram utilizados dois conjuntos de dados noticiosos.
Um destes conjuntos de dados é proveniente da Document Understanding Conference, datado
de 2001, o outro é o corpus TeMário. Para que os sumários produzidos pudessem ser avaliados
automaticamente, foi utilizada uma implementação em linguagem JAVA da ferramenta ROUGE
(Recall-Oriented Understudy for Gisting Evaluation). Após a comparação dos resultados do método
híbrido com os restantes, com e sem identificação dos tópicos ficou evidenciado que a
heurística posicional das frases obtém melhores resultados, pelo que os métodos híbridos onde
esta característica tem peso superior às restantes, tanto para quando o texto é separado em
tópicos como no caso contrário, de uma forma geral, obtém melhores resultados. O melhor
desempenho no total dos resultados é obtido com o método híbrido, atribuindo maior peso à
componente da heurística posicional da frase, sem identificação dos tópicos.A text can be summarized or abstracted, ie, its subject or concept can be represented in a
more succinct form. The most common representation of a summary is written, because it is
constantly produced by people when they want to describe a particular subject.
Over the last years, the use of Internet has come to popularize and therewith the amount of
information available in this huge network, has increased exponentially, and this event is called
as "information overload". This raises a set of problems, among them the search for relevant
information on a given theme. When someone searches for this information he/she want to find
it efficiently, ie, fast and directly address the intended subject. For the theme, there are some
ways to find it, as compared to the speed of research, we are faced with an enormous amount
of information which sometimes differs from what we search, being very slow the process of
reading all this information.
One way to solve this problem is to summarize the contents of the text found, so we can a
faster way to get a sense on the subject of the text found. In the area of summarization,
various techniques exist which allow to obtain a more specific shape.
This dissertation is based on the combination of some techniques, studied over time, such as
relevance and informativeness of the words, objectivity, segmentation in topics and in the use
a set of words that represent the domain of the text.
In statistical approach is highlighted the relevance of the terms of a text, which is calculated
from the frequency of terms present in a text and corpus, the extraction of domain words
that will be encountered by their relevance in the text and the position of the phrases in the
document, that depending on type, can be calculated in different ways, in this case, being
evaluated with news texts, was implemented a positional heuristic that assigns more importance
to sentences in the text top. The approach based in subjectivity of a text is implemented using a
set of textual data known as SentiWordNet [BES10]. It was also implemented a hybrid approach
that combines all or a set of the methods mentioned above.
In order to realize an evaluatiuon of the system, two sets of news data was used. One of these
data are from the Document Understanding Conference, dated 2001 and other is TeMário corpus.
For summaries produced could be evaluated automatically, was used an implementation in JAVA
language, of tool ROUGE (Recall-Oriented Evaluation Understudy for Gisting). After comparing
the results of the hybrid method with the other, with and without identification of topics, was
showed that the positional heuristic of sentences obtained better results, so that the hybrid
methods where this feature has top weight to the others, both when the text is separated into
topics or not, in general, performs better. The best performance in overall results are obtained
with the hybrid method, assigning greater weight to the positional heuristic phrase, without
identification of the component threads