6 research outputs found

    Toward context-based text-to-3D scene generation

    No full text
    People can describe spatial scenes with language and, vice versa, create images based on linguistic descriptions. However, current systems do not even come close to matching the complexity of humans when it comes to reconstructing a scene from a given text. Even the ever-advancing development of better and better Transformer-based models has not been able to achieve this so far. This task, the automatic generation of a 3D scene based on an input text, is called text-to-3D scene generation. The key challenge, and focus of this dissertation, now relate to the following topics: (a) Analyses of how well current language models understand spatial information, how static embeddings compare, and whether they can be improved by anaphora resolution. (b) Automated resource generation for context expansion and grounding that can help in the creation of realistic scenes. (c) Creation of a VR-based text-to-3D scene system that can be used as an annotation and active-learning environment, but can also be easily extended in a modular way with additional features to solve more contexts in the future. (d) Analyze existing practices and tools for digital and virtual teaching, learning, and collaboration, as well as the conditions and strategies in the context of VR. In the first part of this work, we could show that static word embeddings do not benefit significantly from pronoun substitution. We explain this result by the loss of contextual information, the reduction in the relative occurrence of rare words, and the absence of pronouns to be substituted. But we were able to we have shown that both static and contextualizing language models appear to encode object knowledge, but require a sophisticated apparatus to retrieve it. The models themselves in combination with the measures differ greatly in terms of the amount of knowledge they allow to extract. Classifier-based variants perform significantly better than the unsupervised methods from bias research, but this is also due to overfitting. The resources generated for this evaluation are later also an important component of point three. In the second part, we present AffordanceUPT, a modularization of UPT trained on the HICO-DET dataset, which we have extended with Gibsonien/telic annotations. We then show that AffordanceUPT can effectively make the Gibsonian/telic distinction and that the model learns other correlations in the data to make such distinctions (e.g., the presence of hands in the image) that have important implications for grounding images to language. The third part first presents a VR project to support spatial annotation respectively IsoSpace. The direct spatial visualization and the immediate interaction with the 3D objects should make the labeling more intuitive and thus easier. The project will later be incorporated as part of the Semantic Scene Builder (SeSB). The project itself in turn relies on the Text2SceneVR presented here for generating spatial hypertext, which in turn is based on the VAnnotatoR. Finally, we introduce Semantic Scene Builder (SeSB), a VR-based text-to-3D scene framework using Semantic Annotation Framework (SemAF) as a scheme for annotating semantic relations. It integrates a wide range of tools and resources by utilizing SemAF and UIMA as a unified data structure to generate 3D scenes from textual descriptions and also supports annotations. When evaluating SeSB against another state-of-the-art tool, it was found that our approach not only performed better, but also allowed us to model a wider variety of scenes. The final part reviews existing practices and tools for digital and virtual teaching, learning, and collaboration, as well as the conditions and strategies needed to make the most of technological opportunities in the future.Menschen können rĂ€umliche Szenen mit Sprache beschreiben und umgekehrt Bilder auf der Grundlage von sprachlichen Beschreibungen erzeugen. Aktuelle Systeme kommen jedoch nicht einmal annĂ€hernd an die KomplexitĂ€t von Menschen heran, wenn es darum geht, eine Szene aus einem gegebenen Text zu rekonstruieren. Auch die immer weiter fortschreitende Entwicklung immer besserer Transformator-basierter Modelle konnte dies bisher nicht leisten. Diese Aufgabe, die automatische Generierung einer 3D-Szene auf der Grundlage eines Eingabetextes, wird text-to-3D scene-Generierung genannt. Die zentrale Herausforderung und der Schwerpunkt dieser Dissertation beziehen sich nun auf die folgenden Themen: (a) Analysen, wie gut aktuelle Sprachmodelle rĂ€umliche Informationen verstehen, wie statische Einbettungen im Vergleich dazu abschneiden und ob sie durch Anaphora-Auflösung verbessert werden können. (b) Automatisierte Ressourcengenerierung fĂŒr Kontexterweiterung und Erdung, die bei der Erstellung realistischer Szenen helfen können. (c) Schaffung eines VR-basierten text-to-3D scene-Systems, das als Annotations- und Active-Learning-Umgebung verwendet werden kann, aber auch leicht auf modulare Weise mit zusĂ€tzlichen Funktionen erweitert werden kann, um in Zukunft weitere Kontexte zu lösen. (d) Analysieren Sie bestehende Praktiken und Werkzeuge fĂŒr digitales und virtuelles Lehren, Lernen und Kollaboration sowie die Bedingungen und Strategien im Kontext von VR. Im ersten Teil dieser Arbeit konnten wir zeigen, dass statische Worteinbettungen nicht wesentlich von der Pronomenersetzung profitieren. Wir erklĂ€ren dieses Ergebnis durch den Verlust von Kontextinformationen, die Verringerung des relativen Vorkommens seltener Wörter und das Fehlen von Pronomen, die ersetzt werden mĂŒssen. Wir konnten jedoch zeigen, dass sowohl statische als auch kontextualisierende Sprachmodelle Objektwissen zu kodieren scheinen, aber einen ausgeklĂŒgelten Apparat benötigen, um es abzurufen. Die Modelle selbst in Kombination mit den Maßnahmen unterscheiden sich stark in Bezug auf die Menge des Wissens, das sie zu extrahieren erlauben. Klassifikatorbasierte Varianten schneiden deutlich besser ab als die unĂŒberwachten Methoden aus der Bias-Forschung, was aber auch auf Overfitting zurĂŒckzufĂŒhren ist. Die fĂŒr diese Bewertung generierten Ressourcen sind spĂ€ter auch ein wichtiger Bestandteil von Punkt drei. Im zweiten Teil stellen wir AffordanceUPT vor, eine Modularisierung von UPT, die auf dem HICO-DET-Datensatz trainiert wurde, den wir mit Gibsonien/telischen Annotationen erweitert haben. Wir zeigen dann, dass AffordanceUPT effektiv die Gibsonian/telic-Unterscheidung treffen kann und dass das Modell andere Korrelationen in den Daten erlernt, um solche Unterscheidungen zu treffen (z.B. das Vorhandensein von HĂ€nden im Bild), die wichtige Implikationen fĂŒr die Erdung von Bildern mit Sprache haben. Im dritten Teil wird zunĂ€chst ein VR-Projekt zur UnterstĂŒtzung der rĂ€umlichen Annotation bzw. IsoSpace vorgestellt. Durch die direkte rĂ€umliche Visualisierung und die unmittelbare Interaktion mit den 3D-Objekten soll die Beschriftung intuitiver und damit einfacher werden. Das Projekt wird spĂ€ter als Teil des Semantic Scene Builders (SeSB) integriert. Das Projekt selbst stĂŒtzt sich wiederum auf die hier vorgestellte Text2SceneVR zur Erzeugung von rĂ€umlichem Hypertext, die wiederum auf der VAnnotatoR basiert. Schließlich stellen wir den Semantic Scene Builder (SeSB) vor, ein VR-basiertes text-to-3D scene-Framework, das das Semantic Annotation Framework (SemAF) als Schema fĂŒr die Annotation semantischer Beziehungen verwendet. Es integriert eine Vielzahl von Werkzeugen und Ressourcen, indem es SemAF und UIMA als einheitliche Datenstruktur nutzt, um 3D-Szenen aus textuellen Beschreibungen zu generieren und auch Annotationen zu unterstĂŒtzen. Bei der Bewertung von SeSB im Vergleich zu einem anderen hochmodernen Tool zeigte sich, dass unser Ansatz nicht nur besser abschnitt, sondern auch eine grĂ¶ĂŸere Vielfalt von Szenen modellieren konnte. Der letzte Teil gibt einen Überblick ĂŒber bestehende Praktiken und Werkzeuge fĂŒr digitales und virtuelles Lehren, Lernen und Zusammenarbeiten sowie ĂŒber die Bedingungen und Strategien, die erforderlich sind, um die technologischen Möglichkeiten in Zukunft optimal zu nutzen

    Voting for POS tagging of latin texts: using the flair of FLAIR to better ensemble classifiers by example of latin

    No full text
    Despite the great importance of the Latin language in the past, there are relatively few resources available today to develop modern NLP tools for this language. Therefore, the EvaLatin Shared Task for Lemmatization and Part-of-Speech (POS) tagging was published in the LT4HALA workshop. In our work, we dealt with the second EvaLatin task, that is, POS tagging. Since most of the available Latin word embeddings were trained on either few or inaccurate data, we trained several embeddings on better data in the first step. Based on these embeddings, we trained several state-of-the-art taggers and used them as input for an ensemble classifier called LSTMVoter. We were able to achieve the best results for both the cross-genre and the cross-time task (90.64% and 87.00%) without using additional annotated data (closed modality). In the meantime, we further improved the system and achieved even better results (96.91% on classical, 90.87% on cross-genre and 87.35% on cross-time)

    Data_Sheet_1_Grounding human-object interaction to affordance behavior in multimodal datasets.PDF

    No full text
    While affordance detection and Human-Object interaction (HOI) detection tasks are related, the theoretical foundation of affordances makes it clear that the two are distinct. In particular, researchers in affordances make distinctions between J. J. Gibson's traditional definition of an affordance, “the action possibilities” of the object within the environment, and the definition of a telic affordance, or one defined by conventionalized purpose or use. We augment the HICO-DET dataset with annotations for Gibsonian and telic affordances and a subset of the dataset with annotations for the orientation of the humans and objects involved. We then train an adapted Human-Object Interaction (HOI) model and evaluate a pre-trained viewpoint estimation system on this augmented dataset. Our model, AffordanceUPT, is based on a two-stage adaptation of the Unary-Pairwise Transformer (UPT), which we modularize to make affordance detection independent of object detection. Our approach exhibits generalization to new objects and actions, can effectively make the Gibsonian/telic distinction, and shows that this distinction is correlated with features in the data that are not captured by the HOI annotations of the HICO-DET dataset.</p

    The Frankfurt Latin Lexicon from morphological expansion and word embeddings to SemioGraphs

    No full text
    Geelhaar T, Mehler A, Jussen B, et al. The Frankfurt Latin Lexicon from morphological expansion and word embeddings to SemioGraphs. Studi e saggi linguistici . 2020;58(1):45-81

    Machine learning for phase-resolved reconstruction of nonlinear ocean wave surface elevations from sparse remote sensing data

    No full text
    Accurate short-term predictions of phase-resolved water wave conditions are crucial for decision-making in ocean engineering. However, the initialization of remote-sensing-based wave prediction models first requires a reconstruction of wave surfaces from sparse measurements like radar. Existing reconstruction methods either rely on computationally intensive optimization procedures or simplistic modelling assumptions that compromise the real-time capability or accuracy of the subsequent prediction process. We therefore address these issues by proposing a novel approach for phase-resolved wave surface reconstruction using neural networks based on the U-Net and Fourier neural operator (FNO) architectures. Our approach utilizes synthetic yet highly realistic training data on uniform one-dimensional grids, that is generated by the high-order spectral method for wave simulation and a geometric radar modelling approach. The investigation reveals that both models deliver accurate wave reconstruction results and show good generalization for different sea states when trained with spatio-temporal radar data containing multiple historic radar snapshots in each input. Notably, the FNO demonstrates superior performance in handling the data structure imposed by wave physics due to its global approach to learn the mapping between input and output in Fourier space

    A practitioner’s view: a survey and comparison of lemmatization and morphological tagging in German and Latin

    No full text
    The challenge of POS tagging and lemmatization in morphologically rich languages is examined by comparing German and Latin. We start by defining an NLP evaluation roadmap to model the combination of tools and resources guiding our experiments. We focus on what a practitioner can expect when using state-of-the-art solutions. These solutions are then compared with old(er) methods and implementations for coarse-grained POS tagging, as well as fine-grained (morphological) POS tagging (e.g. case, number, mood). We examine to what degree recent advances in tagger development have improved accuracy – and at what cost, in terms of training and processing time. We also conduct in-domain vs. out-of-domain evaluation. Out-of-domain evaluation is particularly pertinent because the distribution of data to be tagged will typically differ from the distribution of data used to train the tagger. Pipeline tagging is then compared with a tagging approach that acknowledges dependencies between inflectional categories. Finally, we evaluate three lemmatization techniques
    corecore