Search CORE

60 research outputs found

Information retrieval and text mining technologies for chemistry

Author: Abacha A. B.
Alberts D.
Alfonso Valencia
American Chemical Society
Anália Lourenço
Aphinyanaphongs Y.
Appelt D. E.
Aramaki E.
Aronson A. R.
Asahara M.
Babych B.
Baeza-Yates R.
Bambenek J.
Barnard J. M.
Bast H.
Batista-Navarro R.
Batista-Navarro R. T.
Bian J.
Bies A.
Bikel D. M.
Blaschke C.
Brecher J. S.
Brill E.
Bunescu R.
Bunescu R. C.
Califf M. E.
Carpenter B.
Caruana R.
Chee B. W.
Chhieng D.
Chinchor N.
Chiticariu L.
Chowdhury M. F. M.
Chowdhury M. F. M.
Ciravegna F.
Cleverdon C. W.
Coden A.
Cohen R.
Collier N.
Corbett P.
Corbett P.
Cover T. M.
Craven M.
Cummings M. D.
Currano J. N.
Currano J. N.
Currano J. N.
Currano J. N.
Cutting D. R.
Davis C. H.
Dieb T. M.
Dieb T. M.
Dogan R. I.
Downs G. M.
Dunikowski L. G.
Embarek M.
Eom J.-H.
Faber J.
Fall C. J.
Fattore M.
Fennell R. W.
Freund Y.
Fujiyoshi A.
Fukuda K.
Gale W. A.
Garcelon N.
Garnier J.-P.
Garten Y.
Ginn R.
Giuliano C.
Gold S.
Grefenstette G.
Grishman R.
Gurulingappa H.
Gurulingappa H.
Gusfield D.
He Y.
Hearst M. A.
Hersh W.
Hersh W.
Hirschman L.
Hobbs J. R.
Hodge G. M.
Holzinger A.
Hsueh P.-Y.
Huber T.
Iyer S. V
Jackson P.
Joachims T.
Johnson D.
Jonnalagadda S.
Jonnalagadda S.
Julen Oyarzabal
Jurafsky D.
Kaewphan S.
Kaewphan S.
Karkaletsis V.
Katragadda S.
Kazama J.
Kazawa H.
Kelly L.
Kenny P. W.
Kim J.-D.
Kim Y.
Kleene S. C.
Kolárik C.
Kongburan W.
Kornai A.
Kraaij W.
Krallinger M.
Krallinger M.
Krallinger M.
Kremer G.
Kreuzthaler M.
Kucera H.
Lai H.
Lawson A. J.
Leaman R.
Leaman R.
Lee C.-H.
Levenshtein V. I.
Levin M. A.
Li J.
Li N.
Li Y.
Liu X.
Locke W. N.
Lovins J. B.
Lowe D. M.
Lupu M.
Lupu M.
Mackenzie C. E.
Manning C. D.
Mansouri A.
Martin E.
Martin Krallinger
Mattmann C.
Maynard D.
McCallum A.
McEwen L.
McKnight L.
McNaught A.
Meystre S. M.
Michalski S. R.
Michie D.
Mihalcea R.
Mitton R.
Miwa M.
Mollá D.
Murray-Rust P.
Müller B.
Nebel A.
Nikfarjam A.
Névéol A.
Névéol A.
Obdulia Rabal
Pang B.
Panico R.
Perez-Iratxeta C.
Ponomareva N.
Ratinov L.
Ratnaparkhi A.
Read J.
Rebholz-Schuhmann D.
Reeker L. H.
Rocchio J. J.
Rohbeck H.-G.
Rosario B.
Roth D. L.
Rupp C. J.
Rupp C. J.
Sagae K.
Salim N.
Salton G.
Sanchez-Cisneros D.
Saracevic T.
Sasaki Y.
Schapire R. E.
Schenck R.
Schenck R. J.
Schlaf A.
Schuemie M. J.
Segura Bedmar I.
Segura-Bedmar I.
Sekine S.
Sequeira E.
Settles B.
Settles B.
Sewell W.
Shen D.
Shidha M. V
Singhal A.
Smith E. G.
Stamatatos E.
Sutton C.
Sætre R.
Taylor K. T.
Tharatipyakul A.
Tomanek K.
Tomanek K.
Tsuruoka Y.
Tsuruoka Y.
Täger W.
Urbain J.
van Rijsbergen C. J.
Vapnik V. N.
Vasserman A.
Visweswaran S.
Voorhees E. M.
Wang W.
Wang Y.
Wei C.-H.
Wei C.-H.
Wermter J.
Wilbur W. J.
Willett P.
Willett P.
Williams A. J.
Witten I. H.
Workman M. L.
Wrublewski D. T.
Xu R.
Xue N.
Yan S.
Yang C.
Yang C. C.
Yang Y.
Zass E.
Zipf G. K.
Zipf G. K.
Zitnik S.
Publication venue: 'American Chemical Society (ACS)'
Publication date: 01/01/2017
Field of study

Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

Universidade do Minho: RepositoriUM

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

A survey of computer uses in music

Author: Russell Maria N
Publication venue: Digital Scholarship@UNLV
Publication date: 01/01/1997
Field of study

This thesis covers research into the mathematical basis inherent in music including review of projects related to optical character recognition (OCR) of musical symbols. Research was done about fractals creating new pieces by assigning pitches to numbers. Existing musical pieces can be taken apart and reassembled creating new ideas for composers. Musical notation understanding is covered and its requirement for the recognition of a music sheet by the computer for editing and reproduction purposes is explained. The first phase of a musical OCR was created in this thesis with the recognition of staff lines on a good quality image. Modifications will need to be made to take care of noise and tilted images that may result from scanning

University of Nevada, Las Vegas Repository

Data Service Outsourcing and Privacy Protection in Mobile Internet

Author: Deng Fuhu
Ding Yi
Qin Zhen
Xiong Hu
Zhao Yang
Zhou Erqiang
Publication venue: 'IntechOpen'
Publication date: 01/01/2018
Field of study

Mobile Internet data have the characteristics of large scale, variety of patterns, and complex association. On the one hand, it needs efficient data processing model to provide support for data services, and on the other hand, it needs certain computing resources to provide data security services. Due to the limited resources of mobile terminals, it is impossible to complete large-scale data computation and storage. However, outsourcing to third parties may cause some risks in user privacy protection. This monography focuses on key technologies of data service outsourcing and privacy protection, including the existing methods of data analysis and processing, the fine-grained data access control through effective user privacy protection mechanism, and the data sharing in the mobile Internet

IntechOpen

Crossref

Directory of Open Access Books (DOAB)

Design of a secure architecture for the exchange of biomedical information in m-Health scenarios

Author: Alesanco Iglesias Álvaro
García Moros José
Rubio Martín Óscar Jesús
Publication venue: Universidad de Zaragoza, Prensas de la Universidad
Publication date: 01/01/2016
Field of study

El paradigma de m-Salud (salud móvil) aboga por la integración masiva de las más avanzadas tecnologías de comunicación, red móvil y sensores en aplicaciones y sistemas de salud, para fomentar el despliegue de un nuevo modelo de atención clínica centrada en el usuario/paciente. Este modelo tiene por objetivos el empoderamiento de los usuarios en la gestión de su propia salud (p.ej. aumentando sus conocimientos, promocionando estilos de vida saludable y previniendo enfermedades), la prestación de una mejor tele-asistencia sanitaria en el hogar para ancianos y pacientes crónicos y una notable disminución del gasto de los Sistemas de Salud gracias a la reducción del número y la duración de las hospitalizaciones. No obstante, estas ventajas, atribuidas a las aplicaciones de m-Salud, suelen venir acompañadas del requisito de un alto grado de disponibilidad de la información biomédica de sus usuarios para garantizar una alta calidad de servicio, p.ej. fusionar varias señales de un usuario para obtener un diagnóstico más preciso. La consecuencia negativa de cumplir esta demanda es el aumento directo de las superficies potencialmente vulnerables a ataques, lo que sitúa a la seguridad (y a la privacidad) del modelo de m-Salud como factor crítico para su éxito. Como requisito no funcional de las aplicaciones de m-Salud, la seguridad ha recibido menos atención que otros requisitos técnicos que eran más urgentes en etapas de desarrollo previas, tales como la robustez, la eficiencia, la interoperabilidad o la usabilidad. Otro factor importante que ha contribuido a retrasar la implementación de políticas de seguridad sólidas es que garantizar un determinado nivel de seguridad implica unos costes que pueden ser muy relevantes en varias dimensiones, en especial en la económica (p.ej. sobrecostes por la inclusión de hardware extra para la autenticación de usuarios), en el rendimiento (p.ej. reducción de la eficiencia y de la interoperabilidad debido a la integración de elementos de seguridad) y en la usabilidad (p.ej. configuración más complicada de dispositivos y aplicaciones de salud debido a las nuevas opciones de seguridad). Por tanto, las soluciones de seguridad que persigan satisfacer a todos los actores del contexto de m-Salud (usuarios, pacientes, personal médico, personal técnico, legisladores, fabricantes de dispositivos y equipos, etc.) deben ser robustas y al mismo tiempo minimizar sus costes asociados. Esta Tesis detalla una propuesta de seguridad, compuesta por cuatro grandes bloques interconectados, para dotar de seguridad a las arquitecturas de m-Salud con unos costes reducidos. El primer bloque define un esquema global que proporciona unos niveles de seguridad e interoperabilidad acordes con las características de las distintas aplicaciones de m-Salud. Este esquema está compuesto por tres capas diferenciadas, diseñadas a la medidas de los dominios de m-Salud y de sus restricciones, incluyendo medidas de seguridad adecuadas para la defensa contra las amenazas asociadas a sus aplicaciones de m-Salud. El segundo bloque establece la extensión de seguridad de aquellos protocolos estándar que permiten la adquisición, el intercambio y/o la administración de información biomédica -- por tanto, usados por muchas aplicaciones de m-Salud -- pero no reúnen los niveles de seguridad detallados en el esquema previo. Estas extensiones se concretan para los estándares biomédicos ISO/IEEE 11073 PHD y SCP-ECG. El tercer bloque propone nuevas formas de fortalecer la seguridad de los tests biomédicos, que constituyen el elemento esencial de muchas aplicaciones de m-Salud de carácter clínico, mediante codificaciones novedosas. Finalmente el cuarto bloque, que se sitúa en paralelo a los anteriores, selecciona herramientas genéricas de seguridad (elementos de autenticación y criptográficos) cuya integración en los otros bloques resulta idónea, y desarrolla nuevas herramientas de seguridad, basadas en señal -- embedding y keytagging --, para reforzar la protección de los test biomédicos.The paradigm of m-Health (mobile health) advocates for the massive integration of advanced mobile communications, network and sensor technologies in healthcare applications and systems to foster the deployment of a new, user/patient-centered healthcare model enabling the empowerment of users in the management of their health (e.g. by increasing their health literacy, promoting healthy lifestyles and the prevention of diseases), a better home-based healthcare delivery for elderly and chronic patients and important savings for healthcare systems due to the reduction of hospitalizations in number and duration. It is a fact that many m-Health applications demand high availability of biomedical information from their users (for further accurate analysis, e.g. by fusion of various signals) to guarantee high quality of service, which on the other hand entails increasing the potential surfaces for attacks. Therefore, it is not surprising that security (and privacy) is commonly included among the most important barriers for the success of m-Health. As a non-functional requirement for m-Health applications, security has received less attention than other technical issues that were more pressing at earlier development stages, such as reliability, eficiency, interoperability or usability. Another fact that has contributed to delaying the enforcement of robust security policies is that guaranteeing a certain security level implies costs that can be very relevant and that span along diferent dimensions. These include budgeting (e.g. the demand of extra hardware for user authentication), performance (e.g. lower eficiency and interoperability due to the addition of security elements) and usability (e.g. cumbersome configuration of devices and applications due to security options). Therefore, security solutions that aim to satisfy all the stakeholders in the m-Health context (users/patients, medical staff, technical staff, systems and devices manufacturers, regulators, etc.) shall be robust and, at the same time, minimize their associated costs. This Thesis details a proposal, composed of four interrelated blocks, to integrate appropriate levels of security in m-Health architectures in a cost-efcient manner. The first block designes a global scheme that provides different security and interoperability levels accordingto how critical are the m-Health applications to be implemented. This consists ofthree layers tailored to the m-Health domains and their constraints, whose security countermeasures defend against the threats of their associated m-Health applications. Next, the second block addresses the security extension of those standard protocols that enable the acquisition, exchange and/or management of biomedical information | thus, used by many m-Health applications | but do not meet the security levels described in the former scheme. These extensions are materialized for the biomedical standards ISO/IEEE 11073 PHD and SCP-ECG. Then, the third block proposes new ways of enhancing the security of biomedical standards, which are the centerpiece of many clinical m-Health applications, by means of novel codings. Finally the fourth block, with is parallel to the others, selects generic security methods (for user authentication and cryptographic protection) whose integration in the other blocks results optimal, and also develops novel signal-based methods (embedding and keytagging) for strengthening the security of biomedical tests. The layer-based extensions of the standards ISO/IEEE 11073 PHD and SCP-ECG can be considered as robust, cost-eficient and respectful with their original features and contents. The former adds no attributes to its data information model, four new frames to the service model |and extends four with new sub-frames|, and only one new sub-state to the communication model. Furthermore, a lightweight architecture consisting of a personal health device mounting a 9 MHz processor and an aggregator mounting a 1 GHz processor is enough to transmit a 3-lead electrocardiogram in real-time implementing the top security layer. The extra requirements associated to this extension are an initial configuration of the health device and the aggregator, tokens for identification/authentication of users if these devices are to be shared and the implementation of certain IHE profiles in the aggregator to enable the integration of measurements in healthcare systems. As regards to the extension of SCP-ECG, it only adds a new section with selected security elements and syntax in order to protect the rest of file contents and provide proper role-based access control. The overhead introduced in the protected SCP-ECG is typically 2{13 % of the regular file size, and the extra delays to protect a newly generated SCP-ECG file and to access it for interpretation are respectively a 2{10 % and a 5 % of the regular delays. As regards to the signal-based security techniques developed, the embedding method is the basis for the proposal of a generic coding for tests composed of biomedical signals, periodic measurements and contextual information. This has been adjusted and evaluated with electrocardiogram and electroencephalogram-based tests, proving the objective clinical quality of the coded tests, the capacity of the coding-access system to operate in real-time (overall delays of 2 s for electrocardiograms and 3.3 s for electroencephalograms) and its high usability. Despite of the embedding of security and metadata to enable m-Health services, the compression ratios obtained by this coding range from ' 3 in real-time transmission to ' 5 in offline operation. Complementarily, keytagging permits associating information to images (and other signals) by means of keys in a secure and non-distorting fashion, which has been availed to implement security measures such as image authentication, integrity control and location of tampered areas, private captioning with role-based access control, traceability and copyright protection. The tests conducted indicate a remarkable robustness-capacity tradeoff that permits implementing all this measures simultaneously, and the compatibility of keytagging with JPEG2000 compression, maintaining this tradeoff while setting the overall keytagging delay in only ' 120 ms for any image size | evidencing the scalability of this technique. As a general conclusion, it has been demonstrated and illustrated with examples that there are various, complementary and structured manners to contribute in the implementation of suitable security levels for m-Health architectures with a moderate cost in budget, performance, interoperability and usability. The m-Health landscape is evolving permanently along all their dimensions, and this Thesis aims to do so with its security. Furthermore, the lessons learned herein may offer further guidance for the elaboration of more comprehensive and updated security schemes, for the extension of other biomedical standards featuring low emphasis on security or privacy, and for the improvement of the state of the art regarding signal-based protection methods and applications

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Universidad de Zaragoza

Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 17/08/2022
Field of study

This two-volume set LNCS 12962 and 12963 constitutes the thoroughly refereed proceedings of the 7th International MICCAI Brainlesion Workshop, BrainLes 2021, as well as the RSNA-ASNR-MICCAI Brain Tumor Segmentation (BraTS) Challenge, the Federated Tumor Segmentation (FeTS) Challenge, the Cross-Modality Domain Adaptation (CrossMoDA) Challenge, and the challenge on Quantification of Uncertainties in Biomedical Image Quantification (QUBIQ). These were held jointly at the 23rd Medical Image Computing for Computer Assisted Intervention Conference, MICCAI 2020, in September 2021. The 91 revised papers presented in these volumes were selected form 151 submissions. Due to COVID-19 pandemic the conference was held virtually. This is an open access book

Directory of Open Access Books (DOAB)

Applications in Electronics Pervading Industry, Environment and Society

Author
Publication venue: 'MDPI AG'
Publication date: 11/01/2022
Field of study

This book features the manuscripts accepted for the Special Issue “Applications in Electronics Pervading Industry, Environment and Society—Sensing Systems and Pervasive Intelligence” of the MDPI journal Sensors. Most of the papers come from a selection of the best papers of the 2019 edition of the “Applications in Electronics Pervading Industry, Environment and Society” (APPLEPIES) Conference, which was held in November 2019. All these papers have been significantly enhanced with novel experimental results. The papers give an overview of the trends in research and development activities concerning the pervasive application of electronics in industry, the environment, and society. The focus of these papers is on cyber physical systems (CPS), with research proposals for new sensor acquisition and ADC (analog to digital converter) methods, high-speed communication systems, cybersecurity, big data management, and data processing including emerging machine learning techniques. Physical implementation aspects are discussed as well as the trade-off found between functional performance and hardware/system costs

Directory of Open Access Books (DOAB)

Exploiting Spatio-Temporal Coherence for Video Object Detection in Robotics

Author: Fernandez-Chaves David
Gonzalez-Jimenez Javier
Matez-Bandera Jose Luis
Monroy Javier
Petkov Nicolai
Ruiz-Sarmiento Jose Raul
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

This paper proposes a method to enhance video object detection for indoor environments in robotics. Concretely, it exploits knowledge about the camera motion between frames to propagate previously detected objects to successive frames. The proposal is rooted in the concepts of planar homography to propose regions of interest where to find objects, and recursive Bayesian filtering to integrate observations over time. The proposal is evaluated on six virtual, indoor environments, accounting for the detection of nine object classes over a total of ∼ 7k frames. Results show that our proposal improves the recall and the F1-score by a factor of 1.41 and 1.27, respectively, as well as it achieves a significant reduction of the object categorization entropy (58.8%) when compared to a two-stage video object detection method used as baseline, at the cost of small time overheads (120 ms) and precision loss (0.92).</p

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Security in Distributed, Grid, Mobile, and Pervasive Computing

Author: Xiao Yang
Publication venue: 'Informa UK Limited'
Publication date
Field of study

This book addresses the increasing demand to guarantee privacy, integrity, and availability of resources in networks and distributed systems. It first reviews security issues and challenges in content distribution networks, describes key agreement protocols based on the Diffie-Hellman key exchange and key management protocols for complex distributed systems like the Internet, and discusses securing design patterns for distributed systems. The next section focuses on security in mobile computing and wireless networks. After a section on grid computing security, the book presents an overview of security solutions for pervasive healthcare systems and surveys wireless sensor network security

OAPEN Library

Learning from Noisy Data in Statistical Machine Translation

Author: Mediani Mohammed
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2017
Field of study

In dieser Arbeit wurden Methoden entwickelt, die in der Lage sind die negativen Effekte von verrauschten Daten in SMT Systemen zu senken und dadurch die Leistung des Systems zu steigern. Hierbei wird das Problem in zwei verschiedenen Schritten des Lernprozesses behandelt: Bei der Vorverarbeitung und während der Modellierung. Bei der Vorverarbeitung werden zwei Methoden zur Verbesserung der statistischen Modelle durch die Erhöhung der Qualität von Trainingsdaten entwickelt. Bei der Modellierung werden verschiedene Möglichkeiten vorgestellt, um Daten nach ihrer Nützlichkeit zu gewichten. Zunächst wird der Effekt des Entfernens von False-Positives vom Parallel Corpus gezeigt. Ein Parallel Corpus besteht aus einem Text in zwei Sprachen, wobei jeder Satz einer Sprache mit dem entsprechenden Satz der anderen Sprache gepaart ist. Hierbei wird vorausgesetzt, dass die Anzahl der Sätzen in beiden Sprachversionen gleich ist. False-Positives in diesem Sinne sind Satzpaare, die im Parallel Corpus gepaart sind aber keine Übersetzung voneinander sind. Um diese zu erkennen wird ein kleiner und fehlerfreier paralleler Corpus (Clean Corpus) vorausgesetzt. Mit Hilfe verschiedenen lexikalischen Eigenschaften werden zuverlässig False-Positives vor der Modellierungsphase gefiltert. Eine wichtige lexikalische Eigenschaft hierbei ist das vom Clean Corpus erzeugte bilinguale Lexikon. In der Extraktion dieses bilingualen Lexikons werden verschiedene Heuristiken implementiert, die zu einer verbesserten Leistung führen. Danach betrachten wir das Problem vom Extrahieren der nützlichsten Teile der Trainingsdaten. Dabei ordnen wir die Daten basierend auf ihren Bezug zur Zieldomaine. Dies geschieht unter der Annahme der Existenz eines guten repräsentativen Tuning Datensatzes. Da solche Tuning Daten typischerweise beschränkte Größe haben, werden Wortähnlichkeiten benutzt um die Abdeckung der Tuning Daten zu erweitern. Die im vorherigen Schritt verwendeten Wortähnlichkeiten sind entscheidend für die Qualität des Verfahrens. Aus diesem Grund werden in der Arbeit verschiedene automatische Methoden zur Ermittlung von solche Wortähnlichkeiten ausgehend von monoligual und biligual Corpora vorgestellt. Interessanterweise ist dies auch bei beschränkten Daten möglich, indem auch monolinguale Daten, die in großen Mengen zur Verfügung stehen, zur Ermittlung der Wortähnlichkeit herangezogen werden. Bei bilingualen Daten, die häufig nur in beschränkter Größe zur Verfügung stehen, können auch weitere Sprachpaare herangezogen werden, die mindestens eine Sprache mit dem vorgegebenen Sprachpaar teilen. Im Modellierungsschritt behandeln wir das Problem mit verrauschten Daten, indem die Trainingsdaten anhand der Güte des Corpus gewichtet werden. Wir benutzen Statistik signifikante Messgrößen, um die weniger verlässlichen Sequenzen zu finden und ihre Gewichtung zu reduzieren. Ähnlich zu den vorherigen Ansätzen, werden Wortähnlichkeiten benutzt um das Problem bei begrenzten Daten zu behandeln. Ein weiteres Problem tritt allerdings auf sobald die absolute Häufigkeiten mit den gewichteten Häufigkeiten ersetzt werden. In dieser Arbeit werden hierfür Techniken zur Glättung der Wahrscheinlichkeiten in dieser Situation entwickelt. Die Größe der Trainingsdaten werden problematisch sobald man mit Corpora von erheblichem Volumen arbeitet. Hierbei treten zwei Hauptschwierigkeiten auf: Die Länge der Trainingszeit und der begrenzte Arbeitsspeicher. Für das Problem der Trainingszeit wird ein Algorithmus entwickelt, der die rechenaufwendigen Berechnungen auf mehrere Prozessoren mit gemeinsamem Speicher ausführt. Für das Speicherproblem werden speziale Datenstrukturen und Algorithmen für externe Speicher benutzt. Dies erlaubt ein effizientes Training von extrem großen Modellne in Hardware mit begrenztem Speicher

KITopen