8 research outputs found

    Textbasierte Annotation von Abbildungen mit Kategorien von Wikimedia

    Get PDF
    In der vorliegenden Masterarbeit geht es um die automatische Annotation von Bildern mithilfe der Kategoriesystematik der Wikipedia. Die Annotation soll anhand der Bildbeschriftungen und ihren Textreferenzen erfolgen. Hierbei wird für vorhandene Bilder eine passende Kategorie vorgeschlagen. Es handelt sich bei den Bildern um Abbildungen aus naturwissenschaftlichen Artikeln, die in Open Access Journals veröffentlicht wurden. Ziel der Arbeit ist es, ein konzeptionelles Verfahren zu erarbeiten, dieses anhand einer ausgewählten Anzahl von Bildern durchzuführen und zu evaluieren. Die Abbildungen sollen für weitere Forschungsarbeiten und für die Projekte der Wikimedia Foundation zur Verfügung stehen. Das Annotationsverfahren findet im Projekt NOA - Nachnutzung von Open Access Abbildungen Verwendung.This master thesis deals with the automatic annotation of images using the Wikipedia category system. The annotation is carried out using the image’s captions and their respective text references. A suitable category is suggested for existing images. The images are illustrations from scientific articles published in open access journals. The aim of the work is to develop a conceptual procedure and to carry out and evaluate it on the basis of a selected number of images. The images shall be available for further research and for projects of theWikimedia Foundation. The annotation method is used in the NOA project - reuse of open access media

    Structural Analysis of Contract Renewals

    No full text
    In the present paper we sketch an automated procedure to compare different versions of a contract. The contract texts used for this purpose are structurally differently composed PDF files that are converted into structured XML files by identifying and classifying text boxes. A classifier trained on manually annotated contracts achieves an accuracy of 87% on this task. We align contract versions and classify aligned text fragments into different similarity classes that enhance the manual comparison of changes in document versions. The main challenges are to deal with OCR errors and different layout of identical or similar texts. We demonstrate the procedure using some freely available contracts from the City of Hamburg written in German. The methods, however, are language agnostic and can be applied to other contracts as well

    Preparing Legal Documents for NLP Analysis: Improving the Classification of Text Elements by Using Page Features

    No full text
    Legal documents often have a complex layout with many different headings, headers and footers, side notes, etc. For the further processing, it is important to extract these individual components correctly from a legally binding document, for example a signed PDF. A common approach to do so is to classify each (text) region of a page using its geometric and textual features. This approach works well, when the training and test data have a similar structure and when the documents of a collection to be analyzed have a rather uniform layout. We show that the use of global page properties can improve the accuracy of text element classification: we first classify each page into one of three layout types. After that, we can train a classifier for each of the three page types and thereby improve the accuracy on a manually annotated collection of 70 legal documents consisting of 20,938 text elements. When we split by page type, we achieve an improvement from 0.95 to 0.98 for single-column pages with left marginalia and from 0.95 to 0.96 for double-column pages. We developed our own feature-based method for page layout detection, which we benchmark against a standard implementation of a CNN image classifier. The approach presented here is based on corpus of freely available German contracts and general terms and conditions. Both the corpus and all manual annotations are made freely available. The method is language agnostic

    Detecting Paraphrases of Standard Clause Titles in Insurance Contracts

    No full text
    For the analysis of contract texts, validated model texts, such as model clauses, can be used to identify used contract clauses. This paper investigates how the similarity between titles of model clauses and headings extracted from contracts can be computed, and which similarity measure is most suitable for this. For the calculation of the similarities between title pairs we tested various variants of string similarity and token based similarity. We also compare two additional semantic similarity measures based on word embeddings using pre-trained embeddings and word embeddings trained on contract texts. The identification of the model clause title can be used as a starting point for the mapping of clauses found in contracts to verified clauses

    Generalisierung von formelhaften Textbestandteilen in juristischen Korpora: Einsatz- und Entwicklungspotential

    No full text
    Generalisierte Rechtsdokumente, bei denen für die individuellen Ausprägungen eines Vertrages die Positionen im Text bekannt sind, können eingesetzt werden, um erstens das Genehmigungsverfahren von Neuverträgen automatisiert zu unterstützen und zweitens als Vertragsgenerator neue Rechtsdokumente vorausgewählt zur Verfügung zu stellen. In diesem Beitrag wird, mithilfe von bekannten juristischen Texten gezeigt, wie formelhafte Textabschnitte identifiziert und häufige individuelle Ausprägungen klassifiziert werden können, um als Musterabschnitte eingesetzt zu werden. Es werden Einsatzbereiche vorgestellt und vorhandenes Potential für Legal Tech-Anwendungen aufgezeigt

    Representing Standard Text Formulations as Directed Graphs

    No full text
    In order to ensure validity in legal texts like contracts and case law, lawyers rely on standardised formulations that are written carefully but also represent a kind of code with a meaning and function known to all legal experts. Using directed (acyclic) graphs to represent standardized text fragments, we are able to capture variations concerning time specifications, slight rephrasings, names, places and also OCR errors. We show how we can find such text fragments by sentence clustering, pattern detection and clustering patterns. To test the proposed methods, we use two corpora of German contracts and court decisions, specially compiled for this purpose. However, the entire process for representing standardised text fragments is language-agnostic. We analyze and compare both corpora and give an quantitative and qualitative analysis of the text fragments found and present a number of examples from both corpora

    Whole-organism clone tracing using single-cell sequencing

    No full text
    Embryonic development is a crucial period in the life of a multicellular organism, during which limited sets of embryonic progenitors produce all cells in the adult body. Determining which fate these progenitors acquire in adult tissues requires the simultaneous measurement of clonal history and cell identity at single-cell resolution, which has been a major challenge. Clonal history has traditionally been investigated by microscopically tracking cells during development, monitoring the heritable expression of genetically encoded fluorescent proteins and, more recently, using next-generation sequencing technologies that exploit somatic mutations, microsatellite instability, transposon tagging, viral barcoding, CRISPR-Cas9 genome editing and Cre-loxP recombination. Single-cell transcriptomics provides a powerful platform for unbiased cell-type classification. Here we present ScarTrace, a single-cell sequencing strategy that enables the simultaneous quantification of clonal history and cell type for thousands of cells obtained from different organs of the adult zebrafish. Using ScarTrace, we show that a small set of multipotent embryonic progenitors generate all haematopoietic cells in the kidney marrow, and that many progenitors produce specific cell types in the eyes and brain. In addition, we study when embryonic progenitors commit to the left or right eye. ScarTrace reveals that epidermal and mesenchymal cells in the caudal fin arise from the same progenitors, and that osteoblast-restricted precursors can produce mesenchymal cells during regeneration. Furthermore, we identify resident immune cells in the fin with a distinct clonal origin from other blood cell types. We envision that similar approaches will have major applications in other experimental systems, in which the matching of embryonic clonal origin to adult cell type will ultimately allow reconstruction of how the adult body is built from a single cell
    corecore