49,249 research outputs found
Challenges and solutions for Latin named entity recognition
Although spanning thousands of years and genres as diverse as liturgy, historiography, lyric and other forms of prose and poetry, the body of Latin texts is still relatively sparse compared to English. Data sparsity in Latin presents a number of challenges for traditional Named Entity
Recognition techniques. Solving such challenges and enabling reliable Named Entity Recognition in Latin texts can facilitate many down-stream applications, from machine translation to digital historiography, enabling Classicists, historians, and archaeologists for instance, to track
the relationships of historical persons, places, and groups on a large scale. This paper presents the first annotated corpus for evaluating Named Entity Recognition in Latin, as well as a fully supervised model that achieves over 90% F-score on a held-out test set, significantly outperforming a competitive baseline. We also present a novel active learning strategy that predicts how many and which sentences need to be annotated for named entities in order to attain a specified degree
of accuracy when recognizing named entities automatically in a given text. This maximizes the productivity of annotators while simultaneously controlling quality
Spanish named entity recognition in the biomedical domain
Named Entity Recognition in the clinical domain and in languages different from English has the difficulty of the absence of complete dictionaries, the informality of texts, the polysemy of terms, the lack of accordance in the boundaries of an entity, the scarcity of corpora and of other resources available. We present a Named Entity Recognition method for poorly resourced languages. The method was tested with Spanish radiology reports and compared with a conditional random fields system.Peer ReviewedPostprint (author's final draft
Tracking the History and Evolution of Entities: Entity-centric Temporal Analysis of Large Social Media Archives
How did the popularity of the Greek Prime Minister evolve in 2015? How did
the predominant sentiment about him vary during that period? Were there any
controversial sub-periods? What other entities were related to him during these
periods? To answer these questions, one needs to analyze archived documents and
data about the query entities, such as old news articles or social media
archives. In particular, user-generated content posted in social networks, like
Twitter and Facebook, can be seen as a comprehensive documentation of our
society, and thus meaningful analysis methods over such archived data are of
immense value for sociologists, historians and other interested parties who
want to study the history and evolution of entities and events. To this end, in
this paper we propose an entity-centric approach to analyze social media
archives and we define measures that allow studying how entities were reflected
in social media in different time periods and under different aspects, like
popularity, attitude, controversiality, and connectedness with other entities.
A case study using a large Twitter archive of four years illustrates the
insights that can be gained by such an entity-centric and multi-aspect
analysis.Comment: This is a preprint of an article accepted for publication in the
International Journal on Digital Libraries (2018
Readers and Reading in the First World War
This essay consists of three individually authored and interlinked sections. In ‘A Digital Humanities Approach’, Francesca Benatti looks at datasets and databases (including the UK Reading Experience Database) and shows how a systematic, macro-analytical use of digital humanities tools and resources might yield answers to some key questions about reading in the First World War. In ‘Reading behind the Wire in the First World War’ Edmund G. C. King scrutinizes the reading practices and preferences of Allied prisoners of war in Mainz, showing that reading circumscribed by the contingencies of a prison camp created an unique literary community, whose legacy can be traced through their literary output after the war. In ‘Book-hunger in Salonika’, Shafquat Towheed examines the record of a single reader in a specific and fairly static frontline, and argues that in the case of the Salonika campaign, reading communities emerged in close proximity to existing centres of print culture. The focus of this essay moves from the general to the particular, from the scoping of large datasets, to the analyses of identified readers within a specific geographical and temporal space. The authors engage with the wider issues and problems of recovering, interpreting, visualizing, narrating, and representing readers in the First World War
Discovering Power Laws in Entity Length
This paper presents a discovery that the length of the entities in various
datasets follows a family of scale-free power law distributions. The concept of
entity here broadly includes the named entity, entity mention, time expression,
aspect term, and domain-specific entity that are well investigated in natural
language processing and related areas. The entity length denotes the number of
words in an entity. The power law distributions in entity length possess the
scale-free property and have well-defined means and finite variances. We
explain the phenomenon of power laws in entity length by the principle of least
effort in communication and the preferential mechanism
- …