1,303 research outputs found
Reconciling and Using Historical Person Registers as Linked Open Data in the AcademySampo Portal and Data Service
This paper presents a method for extracting and reassembling a genealogical network automatically from a biographical register of historical people. The method is applied to a dataset of short textual biographies about all 28 000 Finnish and Swedish academic people educated in 1640-1899 in Finland. The aim is to connect and disambiguate the relatives mentioned in the biographies in order to build a continuous, genealogical network, which can be used in Digital Humanities for data and network analysis of historical academic people and their lives. An artificial neural network approach is presented for solving a supervised learning task to disambiguate relatives mentioned in the register descriptions using basic biographical information enhanced with an ontology of vocations and additional occasionally sparse genealogical information. Evaluation results of the record linkage are promising and provide novel insights into the problem of historical people register reconciliation. The outcome of the work has been used in practise as part of the in-use AcademySampo portal and linked open data service, a new member in the Sampo series of cultural heritage applications for Digital Humanities.Peer reviewe
Entity reconciliation in big data sources: A systematic mapping study
The entity reconciliation (ER) problem aroused much interest as a research topic in today’s Big Dataera, full of big and open heterogeneous data sources. This problem poses when relevant information ona topic needs to be obtained using methods based on: (i) identifying records that represent the samereal world entity, and (ii) identifying those records that are similar but do not correspond to the samereal-world entity. ER is an operational intelligence process, whereby organizations can unify differentand heterogeneous data sources in order to relate possible matches of non-obvious entities. Besides, thecomplexity that the heterogeneity of data sources involves, the large number of records and differencesamong languages, for instance, must be added. This paper describes a Systematic Mapping Study (SMS) ofjournal articles, conferences and workshops published from 2010 to 2017 to solve the problem describedbefore, first trying to understand the state-of-the-art, and then identifying any gaps in current research.Eleven digital libraries were analyzed following a systematic, semiautomatic and rigorous process thathas resulted in 61 primary studies. They represent a great variety of intelligent proposals that aim tosolve ER. The conclusion obtained is that most of the research is based on the operational phase asopposed to the design phase, and most studies have been tested on real-world data sources, where a lotof them are heterogeneous, but just a few apply to industry. There is a clear trend in research techniquesbased on clustering/blocking and graphs, although the level of automation of the proposals is hardly evermentioned in the research work.Ministerio de Economía y Competitividad TIN2013-46928-C3-3-RMinisterio de Economía y Competitividad TIN2016-76956-C3-2-RMinisterio de Economía y Competitividad TIN2015-71938-RED
Genetic variation in prehistoric Sardinia
We sampled teeth from 53 ancient Sardinian (Nuragic) individuals who lived in the Late Bronze Age and Iron Age, between 3,430 and 2,700 years ago. After eliminating the samples that, in preliminary biochemical tests, did not show a high probability to yield reproducible results, we obtained 23 sequences of the mitochondrial DNA control region, which were associated to haplogroups by comparison with a dataset of modern sequences. The Nuragic samples show a remarkably low genetic diversity, comparable to that observed in ancient Iberians, but much lower than among the Etruscans. Most of these sequences have exact matches in two modern Sardinian populations, supporting a clear genealogical continuity from the Late Bronze Age up to current times. The Nuragic populations appear to be part of a large and geographically unstructured cluster of modern European populations, thus making it difficult to infer their evolutionary relationships. However, the low levels of genetic diversity, both within and among ancient samples, as opposed to the sharp differences among modern Sardinian samples, support the hypothesis of the expansion of a small group of maternally related individuals, and of comparatively recent differentiation of the Sardinian gene pools. © Springer-Verlag 2007
A Tale of Two Transcriptions : Machine-Assisted Transcription of Historical Sources
This article is part of the "Norwegian Historical Population Register" project financed by the Norwegian Research Council (grant # 225950) and the Advanced Grand Project "Five Centuries of Marriages"(2011-2016) funded by the European Research Council (# ERC 2010-AdG_20100407)This article explains how two projects implement semi-automated transcription routines: for census sheets in Norway and marriage protocols from Barcelona. The Spanish system was created to transcribe the marriage license books from 1451 to 1905 for the Barcelona area; one of the world's longest series of preserved vital records. Thus, in the Project "Five Centuries of Marriages" (5CofM) at the Autonomous University of Barcelona's Center for Demographic Studies, the Barcelona Historical Marriage Database has been built. More than 600,000 records were transcribed by 150 transcribers working online. The Norwegian material is cross-sectional as it is the 1891 census, recorded on one sheet per person. This format and the underlining of keywords for several variables made it more feasible to semi-automate data entry than when many persons are listed on the same page. While Optical Character Recognition (OCR) for printed text is scientifically mature, computer vision research is now focused on more difficult problems such as handwriting recognition. In the marriage project, document analysis methods have been proposed to automatically recognize the marriage licenses. Fully automatic recognition is still a challenge, but some promising results have been obtained. In Spain, Norway and elsewhere the source material is available as scanned pictures on the Internet, opening up the possibility for further international cooperation concerning automating the transcription of historic source materials. Like what is being done in projects to digitize printed materials, the optimal solution is likely to be a combination of manual transcription and machine-assisted recognition also for hand-written sources
Web service for 19th century Irish personal name matching
Before the first Irish civil registration on 1864, census materials were
mostly lost or incomplete. So genealogical research uses parish records
and also some ‘census substitute’ documents, such as land ownership
and tenancy records. However, some of these documents may not
contain enough information in identify individuals. Some of them
contains a name and address, whereas others might contain only a
name.
Record linkage is one method to gather scattered information among
many documents. It uses a person's name as a reference to link that
person's information between many documents.With patience, a more
complete information about that person can be obtained.
Therefore linking or matching a person's name is important in the
process. Unfortunately, in the 19th century, in Ireland, there was no
standard spelling of names, handwriting could be difficult to read
and contractions or abbreviations were often used. The names with
the same pronunciation and for the same individual could be written
in many different ways. Moreover, names in the Irish language which
are equivalent to English names were used, for example, Irish version
of ‘Smith’ could be ‘Gowan’. A further complication is that historical
and genealogical research often requires large quantities of names to
be matched.
To handle these name variations, various solutions have been created
to find matching different names that refer to the same person.
However, for our extent knowledge, there is yet no public system
which encodes those solutions together and provides a service of
bulk name matching. Thus, we developed a web service system using
Ruby on Rails framework to achieve our goal. The system is initially
encoded with 4 matching algorithms, Levenshtein distance, soundex,
Irish soundex, and lookup table. We also present a web interface for
a client to use the system from the web browser. It is designed to be
simple and extensible from using inheritance.
The system performs matchings on large quantities of names in
a reasonable time. We test our system with 12,944 name matchings
and the result were completed in no more than half a minute (28,786
milliseconds, to be precise). However, the system consumes a large
amount of memory (around 373 megabytes). We believe that, with
proper optimisation, we would reduce the memory usage along with
a shortened processing time. Further matching algorithms could also
be implemented for names in other languages, so that it can handle a
broader domain of names
Web service for 19th century Irish personal name matching
Before the first Irish civil registration on 1864, census materials were
mostly lost or incomplete. So genealogical research uses parish records
and also some ‘census substitute’ documents, such as land ownership
and tenancy records. However, some of these documents may not
contain enough information in identify individuals. Some of them
contains a name and address, whereas others might contain only a
name.
Record linkage is one method to gather scattered information among
many documents. It uses a person's name as a reference to link that
person's information between many documents.With patience, a more
complete information about that person can be obtained.
Therefore linking or matching a person's name is important in the
process. Unfortunately, in the 19th century, in Ireland, there was no
standard spelling of names, handwriting could be difficult to read
and contractions or abbreviations were often used. The names with
the same pronunciation and for the same individual could be written
in many different ways. Moreover, names in the Irish language which
are equivalent to English names were used, for example, Irish version
of ‘Smith’ could be ‘Gowan’. A further complication is that historical
and genealogical research often requires large quantities of names to
be matched.
To handle these name variations, various solutions have been created
to find matching different names that refer to the same person.
However, for our extent knowledge, there is yet no public system
which encodes those solutions together and provides a service of
bulk name matching. Thus, we developed a web service system using
Ruby on Rails framework to achieve our goal. The system is initially
encoded with 4 matching algorithms, Levenshtein distance, soundex,
Irish soundex, and lookup table. We also present a web interface for
a client to use the system from the web browser. It is designed to be
simple and extensible from using inheritance.
The system performs matchings on large quantities of names in
a reasonable time. We test our system with 12,944 name matchings
and the result were completed in no more than half a minute (28,786
milliseconds, to be precise). However, the system consumes a large
amount of memory (around 373 megabytes). We believe that, with
proper optimisation, we would reduce the memory usage along with
a shortened processing time. Further matching algorithms could also
be implemented for names in other languages, so that it can handle a
broader domain of names
Aspects of Record Linkage
This thesis is an exploration of the subject of historical record linkage. The general goal of historical record linkage is to discover relations between historical entities in a database, for any specific definition of relation, entity and database. Although this task originates from historical research, multiple disciplines are involved. Increasing volumes of data necessitate the use of automated or semi-automated linkage procedures, which is in the domain of computer science. Linkage methodologies depend heavily on the nature of the data itself, often requiring analysis based on onomastics (i.e., the study of person names) or general linguistics. To understand the dynamics of natural language one could be tempted to look at the source of language, i.e., humans, either on the individual cognitive level or as group behaviour. This further increases the multidisciplinarity of the subject by including cognitive psychology. Every discipline addresses a subset of problem aspects, all of which can contribute either to practical solutions for linkage problems or to further insights into the subject matter.Algorithms and the Foundations of Software technolog
- …