200 research outputs found
Neural network classification of handwritten document images based on probabilistic indexing
[ES] La clasificación de documentos manuscritos basada en el contenido es una importante tarea que generalmente se realiza en archivos y bibliotecas por expertos con un gran
conocimiento sobre el contenido de los documentos. Pero, desafortunadamente, muchas
colecciones de manuscritos son tan vastas que no es factible depender únicamente de expertos para realizar esta tarea. Los enfoques actuales para la clasificación de manuscritos
basada en el contenido textual generalmente requiere que las imágenes de texto manuscrito sean transcritas y convertidas en texto electrónico. Pero para grandes colecciones
de manuscritos históricos la transcripción manual es generalmente inviable. Y debido a
las imprecisiones inherentes al texto, incertidumbres debidas a léxico arcaico y estado de
conservación de los documentos, la transcripción automática no consigue obtener resultados suficientemente precisos.
En este proyecto se propone un nuevo enfoque para realizar esta tarea de clasificación
que no requiere de transcripciones explícitas de las imágenes. Se basa en ’indexación probabilística’, una tecnología relativamente novedosa que permite representar eficazmente
las incertidumbres intrínsecas que están generalmente presentes en los textos manuscritos históricos. Se propone trabajar sobre legajos del siglo XVII del Archivo Histórico
Provincial de Cádiz. Cada legajo contiene centenares de expedientes notariales manuscritos de diversas tipologías (Venta, Arrendamiento, Poder, Testamento, etc.). El objeto es
clasificar cada expediente en su correspondiente tipología. Un sistema que resuelva satisfactoriamente esta tarea tiene una enorme aplicabilidad en cientos, o miles, de archivos
y bibliotecas que custodian millones de documentos que no han podido ser catalogados
adecuadamente a causa de la enorme envergadura de la tarea para un procesado puramente manual.
Por otra parte, la metodología a desarrollar en este proyecto puede abrir puertas para
abordar muchas otras tareas de analítica de texto sobre grandes volúmenes de imágenes
de texto manuscrito sin transcribir. Por tanto, estos desarrollos tienen también un gran
interés científico-técnico y pueden dar lugar a publicaciones académicas relevantes.
Como conclusión, señalar que todo el código desarrollado durante el proyecto será
depositado en un repositorio público, con el objetivo de que futuros trabajos puedan
continuar desde lo hecho en este.[EN] The classification of manuscript documents based on content is an important task
that is usually performed in archives and libraries by experts with a great deal of knowledge about the content of the documents. usually performed in archives and libraries by
experts with a great deal of knowledge about the content of the documents. But, unfortunately, many manuscript collections are so vast that it is not feasible to rely solely on
experts to perform this task. Current approaches to manuscript classification based on
textual content generally require manuscript text images to be transcribed and converted
into electronic text. But for large collections of historical manuscripts manual transcription is generally infeasible. And due to inherent textual inaccuracies, uncertainties due to
archaic lexicon and state of preservation of the documents, automatic transcription fails
to obtain sufficiently accurate results.
This project proposes a new approach to this classification task that does not require
explicit transcriptions of the images. requires explicit transcriptions of the images. It is
based on ’probabilistic indexing’, a relatively novel technology that allows to efficiently
represent the intrinsic uncertainties that are generally present in historical manuscript
texts. that are generally present in historical manuscript texts. It is proposed to work
on 17th century files from the Provincial Historical Archive of Cadiz. Each file contains
hundreds of handwritten notarial records of various types (Sale, Lease, Power of Attorney, Will, etc.). The objective is to classify each file in its corresponding typology. A
system that satisfactorily solves this task has an enormous applicability in hundreds, or
thousands, of archives and hundreds, or thousands, of archives and libraries that hold
millions of documents that have not been properly catalogued because of the enormous
size of the task for a purely manual processing. purely manual processing
On the other hand, the methodology to be developed in this project may open doors
to address many other text analytical tasks on large volumes of untranscribed many other
text analytics tasks on large volumes of untranscribed manuscript text images. Therefore,
these developments are also of great scientific-technical interest and may lead to relevant
academic publications.
In conclusion, all the code developed during the project will be deposited in a public
repository, so that future work can continue from what has been done in this project.[CA] La classificació de documents manuscrits basada en el contingut és una important
tasca que generalment es realitza en arxius i biblioteques per experts amb un gran coneixement sobre el contingut dels documents. Però, desafortunadament, moltes col·leccions
de manuscrits són tan vastes que no és factible dependre únicament d’experts per a fer
aquesta tasca. Els enfocaments actuals per a la classificació de manuscrits basada en el
contingut textual generalment requereix que les imatges de text manuscrit siguen transcrites i convertides en text electrònic. Però per a grans col·leccions de manuscrits històrics
la transcripció manual és generalment inviable. I a causa de les imprecisions inherents al
text, incerteses degudes a lèxic arcaic i estat de conservació dels documents, la transcripció automàtica no aconsegueix obtindre resultats prou precisos.
En aquest projecte es proposa un nou enfocament per a fer aquesta tasca de classificació que no requereix de transcripcions explícites de les imatges. Es basa en ’indexació
probabilística’, una tecnologia relativament nova que permet representar eficaçment les
incerteses intrínseques que estan generalment presents en els textos manuscrits històrics.
Es proposa treballar sobre lligalls del segle XVII de l’Arxiu Històric Provincial de Cadis. Cada lligall conté centenars d’expedients notarials manuscrits de diverses tipologies
(Venda, Arrendament, Poder, Testament, etc.). L’objecte és classificar cada expedient en
la seua corresponent tipologia. Un sistema que resolga satisfactòriament aquesta tasca té
una enorme aplicabilitat en centenars, o milers, d’arxius i biblioteques que custodien milions de documents que no han pogut ser catalogats adequadament a causa de l’enorme
envergadura de la tasca per a un processament purament manual.
D’altra banda, la metodologia a desenvolupar en aquest projecte pot obrir portes per
a abordar moltes altres tasques d’analítica de text sobre grans volums d’imatges de text
manuscrit sense transcriure. Per tant, aquests desenvolupaments tenen també un gran
interés cientificotècnic i poden donar lloc a publicacions acadèmiques rellevants.
Com a conclusió, assenyalar que tot el codi desenvolupat durant el projecte serà dipositat en un repositori públic, amb l’objectiu que futurs treballs puguin continuar des
del fet en aquest.Flores Arellano, JJ. (2021). Clasificación de imágenes de documentos manuscritos a partir de índicesprobabilisticos mediante redes neuronales. Universitat Politècnica de València. http://hdl.handle.net/10251/172234TFG
Jewish Studies in the Digital Age
The digitisation boom of the last two decades, and the rapid advancement of digital tools to analyse data in myriad ways, have opened up new avenues for humanities research. This volume discusses how the so-called digital turn has affected the field of Jewish Studies, explores the current state of the art and probes how digital developments can be harnessed to address the specific questions, challenges and problems in the field
Jewish Studies in the Digital Age
The digitisation boom of the last two decades, and the rapid advancement of digital tools to analyse data in myriad ways, have opened up new avenues for humanities research. This volume discusses how the so-called digital turn has affected the field of Jewish Studies, explores the current state of the art and probes how digital developments can be harnessed to address the specific questions, challenges and problems in the field
Archives, Access and Artificial Intelligence
Digital archives are transforming the Humanities and the Sciences. Digitized collections of newspapers and books have pushed scholars to develop new, data-rich methods. Born-digital archives are now better preserved and managed thanks to the development of open-access and commercial software. Digital Humanities have moved from the fringe to the center of academia. Yet, the path from the appraisal of records to their analysis is far from smooth. This book explores crossovers between various disciplines to improve the discoverability, accessibility, and use of born-digital archives and other cultural assets
Archives, Access and Artificial Intelligence: Working with Born-Digital and Digitized Archival Collections
Digital archives are transforming the Humanities and the Sciences. Digitized collections of newspapers and books have pushed scholars to develop new, data-rich methods. Born-digital archives are now better preserved and managed thanks to the development of open-access and commercial software. Digital Humanities have moved from the fringe to the center of academia. Yet, the path from the appraisal of records to their analysis is far from smooth. This book explores crossovers between various disciplines to improve the discoverability, accessibility, and use of born-digital archives and other cultural assets
Big Data for Qualitative Research
Big Data for Qualitative Research covers everything small data researchers need to know about big data, from the potentials of big data analytics to its methodological and ethical challenges. The data that we generate in everyday life is now digitally mediated, stored, and analyzed by web sites, companies, institutions, and governments. Big data is large volume, rapidly generated, digitally encoded information that is often related to other networked data, and can provide valuable evidence for study of phenomena. This book explores the potentials of qualitative methods and analysis for big data, including text mining, sentiment analysis, information and data visualization, netnography, follow-the-thing methods, mobile research methods, multimodal analysis, and rhythmanalysis. It debates new concerns about ethics, privacy, and dataveillance for big data qualitative researchers. This book is essential reading for those who do qualitative and mixed methods research, and are curious, excited, or even skeptical about big data and what it means for future research. Now is the time for researchers to understand, debate, and envisage the new possibilities and challenges of the rapidly developing and dynamic field of big data from the vantage point of the qualitative researcher
Recommended from our members
Solving the Orphan Works Problem for the United States
Over the last decade, the problem of orphan works—i.e., copyrighted works whose owners cannot be located by a reasonably diligent search—has come sharply into focus as libraries, archives and other large repositories of copyrighted works have sought to digitize and make available their collections online. Combined with new technology that has changed the way that copyrighted works are created and the way that consumers expect to access and use copyrighted works, the orphan works problem has grown into a significant and, as former Register of Copyrights Marybeth Peters observed, a “pervasive” problem. Although this problem is certainly not limited to digital libraries, it has proven especially challenging for these organizations because they hold diverse collections that include millions of books, articles, letters, photographs, home movies, films and other types of works. Many items come with a complex, unknown and often unknowable history of copyright ownership. Because U.S. copyright law provides for both strong injunctive relief and monetary damages (in the form of statutory damages of up to $150,000 per work infringed), organizations that cannot obtain permission often do not make their collections available at all. Large projects, such as Google Book Search and the HathiTrust digital library, which aim in part to address orphan works on a larger scale, have been drawn into litigation
Big Data for Qualitative Research
Big Data for Qualitative Research covers everything small data researchers need to know about big data, from the potentials of big data analytics to its methodological and ethical challenges. The data that we generate in everyday life is now digitally mediated, stored, and analyzed by web sites, companies, institutions, and governments. Big data is large volume, rapidly generated, digitally encoded information that is often related to other networked data, and can provide valuable evidence for study of phenomena. This book explores the potentials of qualitative methods and analysis for big data, including text mining, sentiment analysis, information and data visualization, netnography, follow-the-thing methods, mobile research methods, multimodal analysis, and rhythmanalysis. It debates new concerns about ethics, privacy, and dataveillance for big data qualitative researchers. This book is essential reading for those who do qualitative and mixed methods research, and are curious, excited, or even skeptical about big data and what it means for future research. Now is the time for researchers to understand, debate, and envisage the new possibilities and challenges of the rapidly developing and dynamic field of big data from the vantage point of the qualitative researcher
Archives, Access and Artificial Intelligence
Digital archives are transforming the Humanities and the Sciences. Digitized collections of newspapers and books have pushed scholars to develop new, data-rich methods. Born-digital archives are now better preserved and managed thanks to the development of open-access and commercial software. Digital Humanities have moved from the fringe to the center of academia. Yet, the path from the appraisal of records to their analysis is far from smooth. This book explores crossovers between various disciplines to improve the discoverability, accessibility, and use of born-digital archives and other cultural assets
- …