8 research outputs found

    Verbesserung einer Erkennungs- und Normalisierungsmaschine für natürlichsprachige Zeitausdrücke

    Get PDF
    Digital gespeicherte Daten erfreuen sich einer stetig steigenden Verwendung. Insbesondere die computerbasierte Kommunikation über E-Mail, SMS, Messenger usw. hat klassische Kommunikationsmittel nahezu vollständig verdrängt. Einen Mehrwert aus diesen Daten zu generieren, ist sowohl im geschäftlichen als auch im privaten Bereich von entscheidender Bedeutung. Eine Möglichkeit den Nutzer zu unterstützen ist es, seine textuellen Daten umfassend zu analysieren und bestimmte Elemente hervorzuheben und ihm die Erstellung von Einträgen für Kalender, Adressbuch und dergleichen abzunehmen bzw. zumindest vorzubereiten. Eine weitere Möglichkeit stellt die semantische Suche in den Daten des Nutzers dar. Selbst mit Volltextsuche muss man bisher den genauen Wortlaut kennen, wenn man eine bestimmte Information sucht. Durch ein tiefgreifendes Verständnis für Zeit ist es nun aber möglich, über einen Zeitstrahl alle mit einem bestimmten Zeitpunkt oder einer Zeitspanne verknüpften Daten zu finden. Es existieren bereits viele Ansätze um Named Entity Recognition voll- bzw. semi-automatisch durchzuführen, aber insbesondere Verfahren, welche weitgehend sprachunabhängig arbeiten und sich somit leicht auf viele Sprachen skalieren lassen, sind kaum publiziert. Um ein solches Verfahren für natürlichsprachige Zeitausdrücke zu verbessern, werden in dieser Arbeit, basierend auf umfangreichen Analysen, Möglichkeiten vorgestellt. Es wird speziell eine Strategie entwickelt, die auf einem Verfahren des maschinellen Lernens beruht und so den manuellen Aufwand für die Unterstützung neuer Sprachen reduziert. Diese und weitere Strategien wurden implementiert und in die bestehende Architektur der Zeiterkennungsmaschine der ExB-Gruppe integriert

    The Digital Classicist 2013

    Get PDF
    This edited volume collects together peer-reviewed papers that initially emanated from presentations at Digital Classicist seminars and conference panels. This wide-ranging volume showcases exemplary applications of digital scholarship to the ancient world and critically examines the many challenges and opportunities afforded by such research. The chapters included here demonstrate innovative approaches that drive forward the research interests of both humanists and technologists while showing that rigorous scholarship is as central to digital research as it is to mainstream classical studies. As with the earlier Digital Classicist publications, our aim is not to give a broad overview of the field of digital classics; rather, we present here a snapshot of some of the varied research of our members in order to engage with and contribute to the development of scholarship both in the fields of classical antiquity and Digital Humanities more broadly

    The Digital Classicist 2013

    Get PDF
    This edited volume collects together peer-reviewed papers that initially emanated from presentations at Digital Classicist seminars and conference panels. This wide-ranging volume showcases exemplary applications of digital scholarship to the ancient world and critically examines the many challenges and opportunities afforded by such research. The chapters included here demonstrate innovative approaches that drive forward the research interests of both humanists and technologists while showing that rigorous scholarship is as central to digital research as it is to mainstream classical studies. As with the earlier Digital Classicist publications, our aim is not to give a broad overview of the field of digital classics; rather, we present here a snapshot of some of the varied research of our members in order to engage with and contribute to the development of scholarship both in the fields of classical antiquity and Digital Humanities more broadly

    Contours in Visualization

    Get PDF
    This thesis studies the visualization of set collections either via or defines as the relations among contours. In the first part, dynamic Euler diagrams are used to communicate and improve semimanually the result of clustering methods which allow clusters to overlap arbitrarily. The contours of the Euler diagram are rendered as implicit surfaces called blobs in computer graphics. The interaction metaphor is the moving of items into or out of these blobs. The utility of the method is demonstrated on data arising from the analysis of gene expressions. The method works well for small datasets of up to one hundred items and few clusters. In the second part, these limitations are mitigated employing a GPU-based rendering of Euler diagrams and mixing textures and colors to resolve overlapping regions better. The GPU-based approach subdivides the screen into triangles on which it performs a contour interpolation, i.e. a fragment shader determines for each pixel which zones of an Euler diagram it belongs to. The rendering speed is thus increased to allow multiple hundred items. The method is applied to an example comparing different document clustering results. The contour tree compactly describes scalar field topology. From the viewpoint of graph drawing, it is a tree with attributes at vertices and optionally on edges. Standard tree drawing algorithms emphasize structural properties of the tree and neglect the attributes. Adapting popular graph drawing approaches to the problem of contour tree drawing it is found that they are unable to convey this information. Five aesthetic criteria for drawing contour trees are proposed and a novel algorithm for drawing contour trees in the plane that satisfies four of these criteria is presented. The implementation is fast and effective for contour tree sizes usually used in interactive systems and also produces readable pictures for larger trees. Dynamical models that explain the formation of spatial structures of RNA molecules have reached a complexity that requires novel visualization methods to analyze these model\''s validity. The fourth part of the thesis focuses on the visualization of so-called folding landscapes of a growing RNA molecule. Folding landscapes describe the energy of a molecule as a function of its spatial configuration; they are huge and high dimensional. Their most salient features are described by their so-called barrier tree -- a contour tree for discrete observation spaces. The changing folding landscapes of a growing RNA chain are visualized as an animation of the corresponding barrier tree sequence. The animation is created as an adaption of the foresight layout with tolerance algorithm for dynamic graph layout. The adaptation requires changes to the concept of supergraph and it layout. The thesis finishes with some thoughts on how these approaches can be combined and how the task the application should support can help inform the choice of visualization modality

    O Corpus of English Language Videos: uma nova ferramenta de corpus on-line para aprendizagem direcionada por dados

    Get PDF
    The main topic of this research is the use of corpora in the teaching and learning of the English language, and its main objective was the conception and development of the Corpus of English Language Videos (CELV), along with an on-line tool for queries in this corpus, which includes, among other functions, the generation of concordance lines. The main theoretical basis for the development and application of the corpus and its query tool was Data-Driven Learning, an approach to the teaching and learning of languages which proposes that the role of the student is that of a language investigator who conducts empiric observation of the linguistic data contained in a corpus, aided by the teacher and the concordancer and following an inductive cognitive process for learning. The main methodological basis for the implementation of the project was Corpus Linguistics, the field in Linguistics dedicated to the compilation and analysis of language samples in electronic format using computers, and Computational Linguistics, the field in Linguistics aimed at the creation and use of computer tools for the processing of languages. A corpus was compiled using YouTube video subtitles, and a computer system for searches in this corpus was developed and published on the internet, to be used by researchers, professors and students interested in the teaching and learning of English. The corpus has enough size and linguistic variation to be used as a source for examples of English language in use, and its query tool enables not only the observation of concordance lines in written form, but also the access to the original videos which compose the sample, allowing the user to watch and listen to the words and expressions searched in the corpus in audiovisual format. The main source of inspiration for the compilation of the corpus and development of the tool was the Corpus of Contemporary American English (COCA). The CELV website has a simple and easy-to-use search interface, aiming to help promote the spread of Corpus Linguistics techniques so that more researchers, professors and students may benefit from the advantages of using corpora in their researches, teaching and learning activities. After the conclusion of the development of the corpus and the tool, both were tested with English teachers, and the opinions of these participants of the research about the usefulness of CELV in their teaching activities were raised, receiving, in general, positive feedback. Additionally, suggestions about topics of the English language whose learning can be enriched by using this tool were made. It is expected that the product of this research may contribute with the teaching and learning of English and with the literature about the applications of corpora in this context.Dissertação (Mestrado)O tema central deste trabalho é o uso de corpora no ensino-aprendizagem de língua inglesa, e seu objetivo principal foi a concepção e desenvolvimento do Corpus ofEnglish Language Videos (CELV), juntamente com uma ferramenta on-line para consultas a esse corpus, que inclui, entre outras funções, a geração de linhas de concordância. O principal embasamento teórico para o desenvolvimento e aplicação do corpus e de sua ferramenta de consulta foi a Aprendizagem Direcionada por Dados, uma abordagem ao ensino- aprendizagem de línguas que propõe o papel do aluno como investigador da língua por meio da observação empírica de dados linguísticos contidos em um corpus, com auxílio do professor e do concordanciador e passando por um processo cognitivo indutivo de aprendizagem. O principal embasamento metodológico para a realização do trabalho foi a Linguística de Corpus, a área da Linguística dedicada à compilação e análise de amostras de língua em formato eletrônico por meio de computadores, e também a Linguística Computacional, a área da Linguística destinada à criação e uso de ferramentas computacionais para o processamento de línguas. Foi compilado um corpus a partir de legendas de vídeos do site YouTube, e foi desenvolvido e publicado na internet um sistema computacional de busca a esse corpus, para uso de pesquisadores, professores e alunos interessados no ensino-aprendizagem de inglês. O corpus possui tamanho e variedade linguística suficientes para ser usado como fonte de exemplos de língua inglesa em uso, e sua ferramenta de consulta permite não só a observação de linhas de concordância de forma escrita, mas, também, o acesso aos vídeos originais que compõem a amostra, possibilitando que o usuário assista e ouça as palavras e expressões pesquisadas no corpus em forma audiovisual. A principal fonte de inspiração para a compilação do corpus e desenvolvimento da ferramenta foi o Corpus of Contemporary American English (COCA). A página do CELV na internet possui uma interface de pesquisa simples e de fácil uso, buscando ajudar a promover a difusão das técnicas da Linguística de Corpus para que mais pesquisadores, professores e alunos possam tirar proveito das vantagens do uso de corpora em suas pesquisas, atividades de ensino e de aprendizagem. Após a conclusão do desenvolvimento do corpus e da ferramenta, ambos foram testados com professores de inglês, e foram levantadas as opiniões desses participantes da pesquisa sobre a utilidade do CELV em suas atividades de ensino, recebendo, de maneira geral, avaliações positivas. Ainda, foram feitas sugestões sobre tópicos em língua inglesa cuja aprendizagem pode ser enriquecida com uso desta ferramenta. Espera-se que o produto deste trabalho possa vir a contribuir com o ensino- aprendizagem de inglês e com a literatura sobre as aplicações de corpora nesse contexto

    Unsupervised Natural Language Processing for Knowledge Extraction from Domain-specific Textual Resources

    Get PDF
    This thesis aims to develop a Relation Extraction algorithm to extract knowledge out of automotive data. While most approaches to Relation Extraction are only evaluated on newspaper data dealing with general relations from the business world their applicability to other data sets is not well studied. Part I of this thesis deals with theoretical foundations of Information Extraction algorithms. Text mining cannot be seen as the simple application of data mining methods to textual data. Instead, sophisticated methods have to be employed to accurately extract knowledge from text which then can be mined using statistical methods from the field of data mining. Information Extraction itself can be divided into two subtasks: Entity Detection and Relation Extraction. The detection of entities is very domain-dependent due to terminology, abbreviations and general language use within the given domain. Thus, this task has to be solved for each domain employing thesauri or another type of lexicon. Supervised approaches to Named Entity Recognition will not achieve reasonable results unless they have been trained for the given type of data. The task of Relation Extraction can be basically approached by pattern-based and kernel-based algorithms. The latter achieve state-of-the-art results on newspaper data and point out the importance of linguistic features. In order to analyze relations contained in textual data, syntactic features like part-of-speech tags and syntactic parses are essential. Chapter 4 presents machine learning approaches and linguistic foundations being essential for syntactic annotation of textual data and Relation Extraction. Chapter 6 analyzes the performance of state-of-the-art algorithms of POS tagging, syntactic parsing and Relation Extraction on automotive data. The findings are: supervised methods trained on newspaper corpora do not achieve accurate results when being applied on automotive data. This is grounded in various reasons. Besides low-quality text, the nature of automotive relations states the main challenge. Automotive relation types of interest (e. g. component – symptom) are rather arbitrary compared to well-studied relation types like is-a or is-head-of. In order to achieve acceptable results, algorithms have to be trained directly on this kind of data. As the manual annotation of data for each language and data type is too costly and inflexible, unsupervised methods are the ones to rely on. Part II deals with the development of dedicated algorithms for all three essential tasks. Unsupervised POS tagging (Chapter 7) is a well-studied task and algorithms achieving accurate tagging exist. All of them do not disambiguate high frequency words, only out-of-lexicon words are disambiguated. Most high frequency words bear syntactic information and thus, it is very important to differentiate between their different functions. Especially domain languages contain ambiguous and high frequent words bearing semantic information (e. g. pump). In order to improve POS tagging, an algorithm for disambiguation is developed and used to enhance an existing state-of-the-art tagger. This approach is based on context clustering which is used to detect a word type’s different syntactic functions. Evaluation shows that tagging accuracy is raised significantly. An approach to unsupervised syntactic parsing (Chapter 8) is developed in order to suffice the requirements of Relation Extraction. These requirements include high precision results on nominal and prepositional phrases as they contain the entities being relevant for Relation Extraction. Furthermore, accurate shallow parsing is more desirable than deep binary parsing as it facilitates Relation Extraction more than deep parsing. Endocentric and exocentric constructions can be distinguished and improve proper phrase labeling. unsuParse is based on preferred positions of word types within phrases to detect phrase candidates. Iterating the detection of simple phrases successively induces deeper structures. The proposed algorithm fulfills all demanded criteria and achieves competitive results on standard evaluation setups. Syntactic Relation Extraction (Chapter 9) is an approach exploiting syntactic statistics and text characteristics to extract relations between previously annotated entities. The approach is based on entity distributions given in a corpus and thus, provides a possibility to extend text mining processes to new data in an unsupervised manner. Evaluation on two different languages and two different text types of the automotive domain shows that it achieves accurate results on repair order data. Results are less accurate on internet data, but the task of sentiment analysis and extraction of the opinion target can be mastered. Thus, the incorporation of internet data is possible and important as it provides useful insight into the customer\''s thoughts. To conclude, this thesis presents a complete unsupervised workflow for Relation Extraction – except for the highly domain-dependent Entity Detection task – improving performance of each of the involved subtasks compared to state-of-the-art approaches. Furthermore, this work applies Natural Language Processing methods and Relation Extraction approaches to real world data unveiling challenges that do not occur in high quality newspaper corpora

    Statistical and Computational Models for Whole Word Morphology

    Get PDF
    Das Ziel dieser Arbeit ist die Formulierung eines Ansatzes zum maschinellen Lernen von Sprachmorphologie, in dem letztere als Zeichenkettentransformationen auf ganzen Wörtern, und nicht als Zerlegung von Wörtern in kleinere stukturelle Einheiten, modelliert wird. Der Beitrag besteht aus zwei wesentlichen Teilen: zum einen wird ein Rechenmodell formuliert, in dem morphologische Regeln als Funktionen auf Zeichenketten definiert sind. Solche Funktionen lassen sich leicht zu endlichen Transduktoren übersetzen, was eine solide algorithmische Grundlage für den Ansatz liefert. Zum anderen wird ein statistisches Modell für Graphen von Wortab\-leitungen eingeführt. Die Inferenz in diesem Modell erfolgt mithilfe des Monte Carlo Expectation Maximization-Algorithmus und die Erwartungswerte über Graphen werden durch einen Metropolis-Hastings-Sampler approximiert. Das Modell wird auf einer Reihe von praktischen Aufgaben evaluiert: Clustering flektierter Formen, Lernen von Lemmatisierung, Vorhersage von Wortart für unbekannte Wörter, sowie Generierung neuer Wörter

    Elements of knowledge-free and unsupervised lexical acquisition

    No full text
    P326, Lexicology --Data processin
    corecore