25 research outputs found
Mining and Analysing One Billion Requests to Linguistic Services
From 2004 to 2016 the Leipzig Linguistic Services (LLS) existed as a SOAP-based cyber infrastructure of atomic micro-services for the Wortschatz project, which covered different-sized textual corpora in more than 230 languages. The LLS were developed in 2004 and went live in 2005 in order to provide a Web service-based API to these corpus databases. In 2006, the LLS infrastructure began to systematically log and store requests made to the text collection, and in August 2016 the LLS were shut down. This article summarises the experience of the past ten years of running such a cyberinfrastructure with a total of nearly one billion requests. It includes an explanation of the technical decisions and limitations but also provides an overview of how the services were used
Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm
This article presents the results of a multidisciplinary project aimed at better understanding the impact of different digitization strategies in computational text analysis. More specifically, it describes an effort to automatically discern the authorship of Jacob and Wilhelm Grimm in a body of uncorrected correspondence processed by HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition), reporting on the effect this noise has on the analyses necessary to computationally identify the different writing style of the two brothers. In summary, our findings show that OCR digitization serves as a reliable proxy for the more painstaking process of manual digitization, at least when it comes to authorship attribution. Our results suggest that attribution is viable even when using training and test sets from different digitization pipelines. With regards to HTR, this research demonstrates that even though automated transcription significantly increases the risk of text misclassification when compared to OCR, a cleanliness above â 20% is already sufficient to achieve a higher-than-chance probability of correct binary attribution
Open Philology at the University of Leipzig
The Open Philology Project at the University of Leipzig aspires to re-assert the value of philology in its broadest sense. Philology signifies the widest possible use of the linguistic record to enable a deep understanding of the complete lived experience of humanity. Pragmatically, we focus on Greek and Latin because (1) substantial collections and services are already available within these languages, (2) substantial user communities exist (c. 35,000 unique users a month at the Perseus Digital Library), and (3) a European-based project is better positioned to process extensive cultural heritage materials in these languages rather than in Chinese or Sanskrit. The Open Philology Project has been designed with the hope that it can contribute to any historical language that survives within the human record. It includes three tasks: (1) the creation of an open, extensible, repurposable collection of machine-readable linguistic sources; (2) the development of dynamic textbooks that use annotated corpora to customize the vocabulary and grammar of texts that learners want to read, and at the same time engage students in collaboratively producing new annotated data; (3) the establishment of new workflows for, and forms of, publication, from individual annotations with argumentation to traditional publications with integrated machine-actionable data
The impact of surgical delay on resectability of colorectal cancer: An international prospective cohort study
AIM: The SARS-CoV-2 pandemic has provided a unique opportunity to explore the impact of surgical delays on cancer resectability. This study aimed to compare resectability for colorectal cancer patients undergoing delayed versus non-delayed surgery. METHODS: This was an international prospective cohort study of consecutive colorectal cancer patients with a decision for curative surgery (January-April 2020). Surgical delay was defined as an operation taking place more than 4âweeks after treatment decision, in a patient who did not receive neoadjuvant therapy. A subgroup analysis explored the effects of delay in elective patients only. The impact of longer delays was explored in a sensitivity analysis. The primary outcome was complete resection, defined as curative resection with an R0 margin. RESULTS: Overall, 5453 patients from 304 hospitals in 47 countries were included, of whom 6.6% (358/5453) did not receive their planned operation. Of the 4304 operated patients without neoadjuvant therapy, 40.5% (1744/4304) were delayed beyond 4âweeks. Delayed patients were more likely to be older, men, more comorbid, have higher body mass index and have rectal cancer and early stage disease. Delayed patients had higher unadjusted rates of complete resection (93.7% vs. 91.9%, PÂ =Â 0.032) and lower rates of emergency surgery (4.5% vs. 22.5%, Pâ<â0.001). After adjustment, delay was not associated with a lower rate of complete resection (OR 1.18, 95% CI 0.90-1.55, PÂ =Â 0.224), which was consistent in elective patients only (OR 0.94, 95% CI 0.69-1.27, PÂ =Â 0.672). Longer delays were not associated with poorer outcomes. CONCLUSION: One in 15 colorectal cancer patients did not receive their planned operation during the first wave of COVID-19. Surgical delay did not appear to compromise resectability, raising the hypothesis that any reduction in long-term survival attributable to delays is likely to be due to micro-metastatic disease
LâOpen Philology Project dellâUniversitĂ di Lipsia: per una filologia âsostenibileâ in un mondo globale
Argomento di questo articolo eÌ la presentazione dellâOpen Philology Project della Humboldt Chair in Digital Humanities dellâUniversitaÌ di Lipsia. Il progetto nasce nellâambito delle attivitaÌ del Perseus Project della Tufts University e ha come scopo primario lo sviluppo di una collezione di risorse linguistiche greche e latine leggibili dalla macchina, la creazione di manuali dinamici basati su corpora an- notati e lâavvio di nuove forme di pubblicazione riguardanti le lingue classiche, che possono includere sia annotazioni individuali che edizioni tradizionali integrate con dati elaborabili dalla macchina. LâOpen Philology Project include tre componenti principali costituite dallâOpen Greek and Latin, dallâHistorical Languages e-Learning Project, e dallâOpen Access Publishing
Scaling historical text re-use
Text re-use describes the spoken and written repetition of information. Historical text re-use, with its longer time span, embraces a larger set of morphological, linguistic, syntactic, semantic and copying variations, thus adding complication to text-reuse detection. Furthermore, it increases the chances of redundancy in a digital library. In Natural Language Processing it is crucial to remove these redundancies before we can apply any kind of machine learning techniques to the text. In Humanities, these redundancies foreground textual criticism and allow scholars to identify lines of transmission. Identification can be accomplished by way of automatic or semi-automatic methods. Text re-use algorithms, however, are of squared complexity and call for higher computational power. The present paper addresses this issue of complexity, with a particular focus on its algorithmic implications and solutions
SzerzĆazonosĂtĂĄs Jacob Ă©s Wilhelm Grimm zajos, digitalizĂĄlt levelezĂ©sĂ©ben
Az alĂĄbbi cikk egy multidiszciplinĂĄris projekt eredmĂ©nyeit mutatja be, amely a kĂŒlönbözĆ digitalizĂĄciĂłs stratĂ©giĂĄk szĂĄmĂtĂłgĂ©pes szöveganalĂzisben valĂł hasznĂĄlhatĂłsĂĄgĂĄt jĂĄrja körĂŒl. Pontosabban Jacob Ă©s Wilhelm Grimm szerzĆsĂ©gĂ©nek automatizĂĄlt megkĂŒlönböztetĂ©sĂ©re tettĂŒnk kĂsĂ©rletet, melyet egy HTR (Handwritten
Text Recognition â kĂ©zzel Ărott szöveg felismerĂ©se) Ă©s OCR (Optical Character Recognition â optikai karakterfelismerĂ©s) ĂĄltal feldolgozott levelezĂ©skorpuszban hajtottunk vĂ©gre, korrekciĂł nĂ©lkĂŒl â felmĂ©rve, hogy az Ăgy keletkezett zaj milyen hatĂĄssal van a fivĂ©rek kĂŒlönbözĆ ĂrĂĄsmĂłdjĂĄnak azonosĂtĂĄsĂĄra. Ăsszegezve,
Ășgy tƱnik, hogy az OCR megbĂzhatĂł helyettesĂtĆje lehet a manuĂĄlis ĂĄtĂrĂĄsnak, legalĂĄbbis a szerzĆazonosĂtĂĄs kĂ©rdĂ©skörĂ©t illetĆen. EredmĂ©nyeink tovĂĄbbĂĄ abba az irĂĄnyba mutatnak, miszerint mĂ©g a kĂŒlönbözĆ digitalizĂĄciĂłs eljĂĄrĂĄsokbĂłl szĂĄrmazĂł tanĂtĂł- Ă©s tesztkorpuszok (training and test set) is hasznĂĄlhatĂłk a szerzĆazonosĂtĂĄs sorĂĄn. A HTR-t tekintve a kutatĂĄs azt demonstrĂĄlja, hogy ez az automatizĂĄlt ĂĄtĂrĂĄs ugyan az OCR-hez kĂ©pest szignifikĂĄnsan növeli a szövegek fĂ©lrecsoportosĂtĂĄsĂĄnak veszĂ©lyĂ©t, ĂĄm körĂŒlbelĂŒl 20% feletti tisztasĂĄg mĂĄr önmagĂĄban elegendĆ ahhoz, hogy a vĂ©letlennĂ©l nagyobb esĂ©lye legyen a helyes binĂĄris
megfeleltetésnek
Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm
This article presents the results of a multidisciplinary project aimed at better understanding the impact of different digitization strategies in computational text analysis. More specifically, it describes an effort to automatically discern the authorship of Jacob and Wilhelm Grimm in a body of uncorrected correspondence processed by HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition), reporting on the effect this noise has on the analyses necessary to computationally identify the different writing style of the two brothers. In summary, our findings show that OCR digitization serves as a reliable proxy for the more painstaking process of manual digitization, at least when it comes to authorship attribution. Our results suggest that attribution is viable even when using training and test sets from different digitization pipelines. With regards to HTR, this research demonstrates that even though automated transcription significantly increases the risk of text misclassification when compared to OCR, a cleanliness above 48 20% is already sufficient to achieve a higher-than-chance probability of correct binary attribution
Table_1.csv
<p>This article presents the results of a multidisciplinary project aimed at better understanding the impact of different digitization strategies in computational text analysis. More specifically, it describes an effort to automatically discern the authorship of Jacob and Wilhelm Grimm in a body of uncorrected correspondence processed by HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition), reporting on the effect this noise has on the analyses necessary to computationally identify the different writing style of the two brothers. In summary, our findings show that OCR digitization serves as a reliable proxy for the more painstaking process of manual digitization, at least when it comes to authorship attribution. Our results suggest that attribution is viable even when using training and test sets from different digitization pipelines. With regards to HTR, this research demonstrates that even though automated transcription significantly increases the risk of text misclassification when compared to OCR, a cleanliness above â 20% is already sufficient to achieve a higher-than-chance probability of correct binary attribution.</p