25 research outputs found

    Mining and Analysing One Billion Requests to Linguistic Services

    Get PDF
    From 2004 to 2016 the Leipzig Linguistic Services (LLS) existed as a SOAP-based cyber infrastructure of atomic micro-services for the Wortschatz project, which covered different-sized textual corpora in more than 230 languages. The LLS were developed in 2004 and went live in 2005 in order to provide a Web service-based API to these corpus databases. In 2006, the LLS infrastructure began to systematically log and store requests made to the text collection, and in August 2016 the LLS were shut down. This article summarises the experience of the past ten years of running such a cyberinfrastructure with a total of nearly one billion requests. It includes an explanation of the technical decisions and limitations but also provides an overview of how the services were used

    Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm

    Get PDF
    This article presents the results of a multidisciplinary project aimed at better understanding the impact of different digitization strategies in computational text analysis. More specifically, it describes an effort to automatically discern the authorship of Jacob and Wilhelm Grimm in a body of uncorrected correspondence processed by HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition), reporting on the effect this noise has on the analyses necessary to computationally identify the different writing style of the two brothers. In summary, our findings show that OCR digitization serves as a reliable proxy for the more painstaking process of manual digitization, at least when it comes to authorship attribution. Our results suggest that attribution is viable even when using training and test sets from different digitization pipelines. With regards to HTR, this research demonstrates that even though automated transcription significantly increases the risk of text misclassification when compared to OCR, a cleanliness above ≈ 20% is already sufficient to achieve a higher-than-chance probability of correct binary attribution

    Open Philology at the University of Leipzig

    Get PDF
    The Open Philology Project at the University of Leipzig aspires to re-assert the value of philology in its broadest sense. Philology signifies the widest possible use of the linguistic record to enable a deep understanding of the complete lived experience of humanity. Pragmatically, we focus on Greek and Latin because (1) substantial collections and services are already available within these languages, (2) substantial user communities exist (c. 35,000 unique users a month at the Perseus Digital Library), and (3) a European-based project is better positioned to process extensive cultural heritage materials in these languages rather than in Chinese or Sanskrit. The Open Philology Project has been designed with the hope that it can contribute to any historical language that survives within the human record. It includes three tasks: (1) the creation of an open, extensible, repurposable collection of machine-readable linguistic sources; (2) the development of dynamic textbooks that use annotated corpora to customize the vocabulary and grammar of texts that learners want to read, and at the same time engage students in collaboratively producing new annotated data; (3) the establishment of new workflows for, and forms of, publication, from individual annotations with argumentation to traditional publications with integrated machine-actionable data

    The impact of surgical delay on resectability of colorectal cancer: An international prospective cohort study

    Get PDF
    AIM: The SARS-CoV-2 pandemic has provided a unique opportunity to explore the impact of surgical delays on cancer resectability. This study aimed to compare resectability for colorectal cancer patients undergoing delayed versus non-delayed surgery. METHODS: This was an international prospective cohort study of consecutive colorectal cancer patients with a decision for curative surgery (January-April 2020). Surgical delay was defined as an operation taking place more than 4 weeks after treatment decision, in a patient who did not receive neoadjuvant therapy. A subgroup analysis explored the effects of delay in elective patients only. The impact of longer delays was explored in a sensitivity analysis. The primary outcome was complete resection, defined as curative resection with an R0 margin. RESULTS: Overall, 5453 patients from 304 hospitals in 47 countries were included, of whom 6.6% (358/5453) did not receive their planned operation. Of the 4304 operated patients without neoadjuvant therapy, 40.5% (1744/4304) were delayed beyond 4 weeks. Delayed patients were more likely to be older, men, more comorbid, have higher body mass index and have rectal cancer and early stage disease. Delayed patients had higher unadjusted rates of complete resection (93.7% vs. 91.9%, P = 0.032) and lower rates of emergency surgery (4.5% vs. 22.5%, P < 0.001). After adjustment, delay was not associated with a lower rate of complete resection (OR 1.18, 95% CI 0.90-1.55, P = 0.224), which was consistent in elective patients only (OR 0.94, 95% CI 0.69-1.27, P = 0.672). Longer delays were not associated with poorer outcomes. CONCLUSION: One in 15 colorectal cancer patients did not receive their planned operation during the first wave of COVID-19. Surgical delay did not appear to compromise resectability, raising the hypothesis that any reduction in long-term survival attributable to delays is likely to be due to micro-metastatic disease

    L’Open Philology Project dell’Università di Lipsia: per una filologia “sostenibile” in un mondo globale

    No full text
    Argomento di questo articolo è la presentazione dell’Open Philology Project della Humboldt Chair in Digital Humanities dell’Università di Lipsia. Il progetto nasce nell’ambito delle attività del Perseus Project della Tufts University e ha come scopo primario lo sviluppo di una collezione di risorse linguistiche greche e latine leggibili dalla macchina, la creazione di manuali dinamici basati su corpora an- notati e l’avvio di nuove forme di pubblicazione riguardanti le lingue classiche, che possono includere sia annotazioni individuali che edizioni tradizionali integrate con dati elaborabili dalla macchina. L’Open Philology Project include tre componenti principali costituite dall’Open Greek and Latin, dall’Historical Languages e-Learning Project, e dall’Open Access Publishing

    Scaling historical text re-use

    No full text
    Text re-use describes the spoken and written repetition of information. Historical text re-use, with its longer time span, embraces a larger set of morphological, linguistic, syntactic, semantic and copying variations, thus adding complication to text-reuse detection. Furthermore, it increases the chances of redundancy in a digital library. In Natural Language Processing it is crucial to remove these redundancies before we can apply any kind of machine learning techniques to the text. In Humanities, these redundancies foreground textual criticism and allow scholars to identify lines of transmission. Identification can be accomplished by way of automatic or semi-automatic methods. Text re-use algorithms, however, are of squared complexity and call for higher computational power. The present paper addresses this issue of complexity, with a particular focus on its algorithmic implications and solutions

    SzerzƑazonosĂ­tĂĄs Jacob Ă©s Wilhelm Grimm zajos, digitalizĂĄlt levelezĂ©sĂ©ben

    Get PDF
    Az alĂĄbbi cikk egy multidiszciplinĂĄris projekt eredmĂ©nyeit mutatja be, amely a kĂŒlönbözƑ digitalizĂĄciĂłs stratĂ©giĂĄk szĂĄmĂ­tĂłgĂ©pes szöveganalĂ­zisben valĂł hasznĂĄlhatĂłsĂĄgĂĄt jĂĄrja körĂŒl. Pontosabban Jacob Ă©s Wilhelm Grimm szerzƑsĂ©gĂ©nek automatizĂĄlt megkĂŒlönböztetĂ©sĂ©re tettĂŒnk kĂ­sĂ©rletet, melyet egy HTR (Handwritten Text Recognition – kĂ©zzel Ă­rott szöveg felismerĂ©se) Ă©s OCR (Optical Character Recognition – optikai karakterfelismerĂ©s) ĂĄltal feldolgozott levelezĂ©skorpuszban hajtottunk vĂ©gre, korrekciĂł nĂ©lkĂŒl – felmĂ©rve, hogy az Ă­gy keletkezett zaj milyen hatĂĄssal van a fivĂ©rek kĂŒlönbözƑ Ă­rĂĄsmĂłdjĂĄnak azonosĂ­tĂĄsĂĄra. Összegezve, Ășgy tƱnik, hogy az OCR megbĂ­zhatĂł helyettesĂ­tƑje lehet a manuĂĄlis ĂĄtĂ­rĂĄsnak, legalĂĄbbis a szerzƑazonosĂ­tĂĄs kĂ©rdĂ©skörĂ©t illetƑen. EredmĂ©nyeink tovĂĄbbĂĄ abba az irĂĄnyba mutatnak, miszerint mĂ©g a kĂŒlönbözƑ digitalizĂĄciĂłs eljĂĄrĂĄsokbĂłl szĂĄrmazĂł tanĂ­tĂł- Ă©s tesztkorpuszok (training and test set) is hasznĂĄlhatĂłk a szerzƑazonosĂ­tĂĄs sorĂĄn. A HTR-t tekintve a kutatĂĄs azt demonstrĂĄlja, hogy ez az automatizĂĄlt ĂĄtĂ­rĂĄs ugyan az OCR-hez kĂ©pest szignifikĂĄnsan növeli a szövegek fĂ©lrecsoportosĂ­tĂĄsĂĄnak veszĂ©lyĂ©t, ĂĄm körĂŒlbelĂŒl 20% feletti tisztasĂĄg mĂĄr önmagĂĄban elegendƑ ahhoz, hogy a vĂ©letlennĂ©l nagyobb esĂ©lye legyen a helyes binĂĄris megfeleltetĂ©snek

    Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm

    No full text
    This article presents the results of a multidisciplinary project aimed at better understanding the impact of different digitization strategies in computational text analysis. More specifically, it describes an effort to automatically discern the authorship of Jacob and Wilhelm Grimm in a body of uncorrected correspondence processed by HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition), reporting on the effect this noise has on the analyses necessary to computationally identify the different writing style of the two brothers. In summary, our findings show that OCR digitization serves as a reliable proxy for the more painstaking process of manual digitization, at least when it comes to authorship attribution. Our results suggest that attribution is viable even when using training and test sets from different digitization pipelines. With regards to HTR, this research demonstrates that even though automated transcription significantly increases the risk of text misclassification when compared to OCR, a cleanliness above 48 20% is already sufficient to achieve a higher-than-chance probability of correct binary attribution

    Table_1.csv

    No full text
    <p>This article presents the results of a multidisciplinary project aimed at better understanding the impact of different digitization strategies in computational text analysis. More specifically, it describes an effort to automatically discern the authorship of Jacob and Wilhelm Grimm in a body of uncorrected correspondence processed by HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition), reporting on the effect this noise has on the analyses necessary to computationally identify the different writing style of the two brothers. In summary, our findings show that OCR digitization serves as a reliable proxy for the more painstaking process of manual digitization, at least when it comes to authorship attribution. Our results suggest that attribution is viable even when using training and test sets from different digitization pipelines. With regards to HTR, this research demonstrates that even though automated transcription significantly increases the risk of text misclassification when compared to OCR, a cleanliness above ≈ 20% is already sufficient to achieve a higher-than-chance probability of correct binary attribution.</p
    corecore