1,303 research outputs found

    Reconciling and Using Historical Person Registers as Linked Open Data in the AcademySampo Portal and Data Service

    Get PDF
    This paper presents a method for extracting and reassembling a genealogical network automatically from a biographical register of historical people. The method is applied to a dataset of short textual biographies about all 28 000 Finnish and Swedish academic people educated in 1640-1899 in Finland. The aim is to connect and disambiguate the relatives mentioned in the biographies in order to build a continuous, genealogical network, which can be used in Digital Humanities for data and network analysis of historical academic people and their lives. An artificial neural network approach is presented for solving a supervised learning task to disambiguate relatives mentioned in the register descriptions using basic biographical information enhanced with an ontology of vocations and additional occasionally sparse genealogical information. Evaluation results of the record linkage are promising and provide novel insights into the problem of historical people register reconciliation. The outcome of the work has been used in practise as part of the in-use AcademySampo portal and linked open data service, a new member in the Sampo series of cultural heritage applications for Digital Humanities.Peer reviewe

    Entity reconciliation in big data sources: A systematic mapping study

    Get PDF
    The entity reconciliation (ER) problem aroused much interest as a research topic in today’s Big Dataera, full of big and open heterogeneous data sources. This problem poses when relevant information ona topic needs to be obtained using methods based on: (i) identifying records that represent the samereal world entity, and (ii) identifying those records that are similar but do not correspond to the samereal-world entity. ER is an operational intelligence process, whereby organizations can unify differentand heterogeneous data sources in order to relate possible matches of non-obvious entities. Besides, thecomplexity that the heterogeneity of data sources involves, the large number of records and differencesamong languages, for instance, must be added. This paper describes a Systematic Mapping Study (SMS) ofjournal articles, conferences and workshops published from 2010 to 2017 to solve the problem describedbefore, first trying to understand the state-of-the-art, and then identifying any gaps in current research.Eleven digital libraries were analyzed following a systematic, semiautomatic and rigorous process thathas resulted in 61 primary studies. They represent a great variety of intelligent proposals that aim tosolve ER. The conclusion obtained is that most of the research is based on the operational phase asopposed to the design phase, and most studies have been tested on real-world data sources, where a lotof them are heterogeneous, but just a few apply to industry. There is a clear trend in research techniquesbased on clustering/blocking and graphs, although the level of automation of the proposals is hardly evermentioned in the research work.Ministerio de Economía y Competitividad TIN2013-46928-C3-3-RMinisterio de Economía y Competitividad TIN2016-76956-C3-2-RMinisterio de Economía y Competitividad TIN2015-71938-RED

    Genetic variation in prehistoric Sardinia

    Get PDF
    We sampled teeth from 53 ancient Sardinian (Nuragic) individuals who lived in the Late Bronze Age and Iron Age, between 3,430 and 2,700 years ago. After eliminating the samples that, in preliminary biochemical tests, did not show a high probability to yield reproducible results, we obtained 23 sequences of the mitochondrial DNA control region, which were associated to haplogroups by comparison with a dataset of modern sequences. The Nuragic samples show a remarkably low genetic diversity, comparable to that observed in ancient Iberians, but much lower than among the Etruscans. Most of these sequences have exact matches in two modern Sardinian populations, supporting a clear genealogical continuity from the Late Bronze Age up to current times. The Nuragic populations appear to be part of a large and geographically unstructured cluster of modern European populations, thus making it difficult to infer their evolutionary relationships. However, the low levels of genetic diversity, both within and among ancient samples, as opposed to the sharp differences among modern Sardinian samples, support the hypothesis of the expansion of a small group of maternally related individuals, and of comparatively recent differentiation of the Sardinian gene pools. © Springer-Verlag 2007

    A Tale of Two Transcriptions : Machine-Assisted Transcription of Historical Sources

    Get PDF
    This article is part of the "Norwegian Historical Population Register" project financed by the Norwegian Research Council (grant # 225950) and the Advanced Grand Project "Five Centuries of Marriages"(2011-2016) funded by the European Research Council (# ERC 2010-AdG_20100407)This article explains how two projects implement semi-automated transcription routines: for census sheets in Norway and marriage protocols from Barcelona. The Spanish system was created to transcribe the marriage license books from 1451 to 1905 for the Barcelona area; one of the world's longest series of preserved vital records. Thus, in the Project "Five Centuries of Marriages" (5CofM) at the Autonomous University of Barcelona's Center for Demographic Studies, the Barcelona Historical Marriage Database has been built. More than 600,000 records were transcribed by 150 transcribers working online. The Norwegian material is cross-sectional as it is the 1891 census, recorded on one sheet per person. This format and the underlining of keywords for several variables made it more feasible to semi-automate data entry than when many persons are listed on the same page. While Optical Character Recognition (OCR) for printed text is scientifically mature, computer vision research is now focused on more difficult problems such as handwriting recognition. In the marriage project, document analysis methods have been proposed to automatically recognize the marriage licenses. Fully automatic recognition is still a challenge, but some promising results have been obtained. In Spain, Norway and elsewhere the source material is available as scanned pictures on the Internet, opening up the possibility for further international cooperation concerning automating the transcription of historic source materials. Like what is being done in projects to digitize printed materials, the optimal solution is likely to be a combination of manual transcription and machine-assisted recognition also for hand-written sources

    Web service for 19th century Irish personal name matching

    Get PDF
    Before the first Irish civil registration on 1864, census materials were mostly lost or incomplete. So genealogical research uses parish records and also some ‘census substitute’ documents, such as land ownership and tenancy records. However, some of these documents may not contain enough information in identify individuals. Some of them contains a name and address, whereas others might contain only a name. Record linkage is one method to gather scattered information among many documents. It uses a person's name as a reference to link that person's information between many documents.With patience, a more complete information about that person can be obtained. Therefore linking or matching a person's name is important in the process. Unfortunately, in the 19th century, in Ireland, there was no standard spelling of names, handwriting could be difficult to read and contractions or abbreviations were often used. The names with the same pronunciation and for the same individual could be written in many different ways. Moreover, names in the Irish language which are equivalent to English names were used, for example, Irish version of ‘Smith’ could be ‘Gowan’. A further complication is that historical and genealogical research often requires large quantities of names to be matched. To handle these name variations, various solutions have been created to find matching different names that refer to the same person. However, for our extent knowledge, there is yet no public system which encodes those solutions together and provides a service of bulk name matching. Thus, we developed a web service system using Ruby on Rails framework to achieve our goal. The system is initially encoded with 4 matching algorithms, Levenshtein distance, soundex, Irish soundex, and lookup table. We also present a web interface for a client to use the system from the web browser. It is designed to be simple and extensible from using inheritance. The system performs matchings on large quantities of names in a reasonable time. We test our system with 12,944 name matchings and the result were completed in no more than half a minute (28,786 milliseconds, to be precise). However, the system consumes a large amount of memory (around 373 megabytes). We believe that, with proper optimisation, we would reduce the memory usage along with a shortened processing time. Further matching algorithms could also be implemented for names in other languages, so that it can handle a broader domain of names

    Web service for 19th century Irish personal name matching

    Get PDF
    Before the first Irish civil registration on 1864, census materials were mostly lost or incomplete. So genealogical research uses parish records and also some ‘census substitute’ documents, such as land ownership and tenancy records. However, some of these documents may not contain enough information in identify individuals. Some of them contains a name and address, whereas others might contain only a name. Record linkage is one method to gather scattered information among many documents. It uses a person's name as a reference to link that person's information between many documents.With patience, a more complete information about that person can be obtained. Therefore linking or matching a person's name is important in the process. Unfortunately, in the 19th century, in Ireland, there was no standard spelling of names, handwriting could be difficult to read and contractions or abbreviations were often used. The names with the same pronunciation and for the same individual could be written in many different ways. Moreover, names in the Irish language which are equivalent to English names were used, for example, Irish version of ‘Smith’ could be ‘Gowan’. A further complication is that historical and genealogical research often requires large quantities of names to be matched. To handle these name variations, various solutions have been created to find matching different names that refer to the same person. However, for our extent knowledge, there is yet no public system which encodes those solutions together and provides a service of bulk name matching. Thus, we developed a web service system using Ruby on Rails framework to achieve our goal. The system is initially encoded with 4 matching algorithms, Levenshtein distance, soundex, Irish soundex, and lookup table. We also present a web interface for a client to use the system from the web browser. It is designed to be simple and extensible from using inheritance. The system performs matchings on large quantities of names in a reasonable time. We test our system with 12,944 name matchings and the result were completed in no more than half a minute (28,786 milliseconds, to be precise). However, the system consumes a large amount of memory (around 373 megabytes). We believe that, with proper optimisation, we would reduce the memory usage along with a shortened processing time. Further matching algorithms could also be implemented for names in other languages, so that it can handle a broader domain of names

    Aspects of Record Linkage

    Get PDF
    This thesis is an exploration of the subject of historical record linkage. The general goal of historical record linkage is to discover relations between historical entities in a database, for any specific definition of relation, entity and database. Although this task originates from historical research, multiple disciplines are involved. Increasing volumes of data necessitate the use of automated or semi-automated linkage procedures, which is in the domain of computer science. Linkage methodologies depend heavily on the nature of the data itself, often requiring analysis based on onomastics (i.e., the study of person names) or general linguistics. To understand the dynamics of natural language one could be tempted to look at the source of language, i.e., humans, either on the individual cognitive level or as group behaviour. This further increases the multidisciplinarity of the subject by including cognitive psychology. Every discipline addresses a subset of problem aspects, all of which can contribute either to practical solutions for linkage problems or to further insights into the subject matter.Algorithms and the Foundations of Software technolog
    corecore