2,347 research outputs found

    Math Search for the Masses: Multimodal Search Interfaces and Appearance-Based Retrieval

    Full text link
    We summarize math search engines and search interfaces produced by the Document and Pattern Recognition Lab in recent years, and in particular the min math search interface and the Tangent search engine. Source code for both systems are publicly available. "The Masses" refers to our emphasis on creating systems for mathematical non-experts, who may be looking to define unfamiliar notation, or browse documents based on the visual appearance of formulae rather than their mathematical semantics.Comment: Paper for Invited Talk at 2015 Conference on Intelligent Computer Mathematics (July, Washington DC

    Automatic system of reading numbers

    Get PDF
    This paper presents a brief introduction about text-to-spcech (TTS) systems, its main structure and alternatíves. In a more specific way, it was developed three algorithms of a system that automatic reads numbers. Each algorithm hás its own functions and its own way to approach the problem. The algorithms have been programmed using Matiab sofhvare. The audio signals have been recorded and edited using Praat software. Finally a perceptual evaluation was made on each algorithm and was assigned a rating to each one. Generally, thc MOS gives a very good levei ofclassification for the three algorithms.info:eu-repo/semantics/publishedVersio

    Bag-of-words representations for computer audition

    Get PDF
    Computer audition is omnipresent in everyday life, in applications ranging from personalised virtual agents to health care. From a technical point of view, the goal is to robustly classify the content of an audio signal in terms of a defined set of labels, such as, e.g., the acoustic scene, a medical diagnosis, or, in the case of speech, what is said or how it is said. Typical approaches employ machine learning (ML), which means that task-specific models are trained by means of examples. Despite recent successes in neural network-based end-to-end learning, taking the raw audio signal as input, models relying on hand-crafted acoustic features are still superior in some domains, especially for tasks where data is scarce. One major issue is nevertheless that a sequence of acoustic low-level descriptors (LLDs) cannot be fed directly into many ML algorithms as they require a static and fixed-length input. Moreover, also for dynamic classifiers, compressing the information of the LLDs over a temporal block by summarising them can be beneficial. However, the type of instance-level representation has a fundamental impact on the performance of the model. In this thesis, the so-called bag-of-audio-words (BoAW) representation is investigated as an alternative to the standard approach of statistical functionals. BoAW is an unsupervised method of representation learning, inspired from the bag-of-words method in natural language processing, forming a histogram of the terms present in a document. The toolkit openXBOW is introduced, enabling systematic learning and optimisation of these feature representations, unified across arbitrary modalities of numeric or symbolic descriptors. A number of experiments on BoAW are presented and discussed, focussing on a large number of potential applications and corresponding databases, ranging from emotion recognition in speech to medical diagnosis. The evaluations include a comparison of different acoustic LLD sets and configurations of the BoAW generation process. The key findings are that BoAW features are a meaningful alternative to statistical functionals, offering certain benefits, while being able to preserve the advantages of functionals, such as data-independence. Furthermore, it is shown that both representations are complementary and their fusion improves the performance of a machine listening system.Maschinelles Hören ist im täglichen Leben allgegenwärtig, mit Anwendungen, die von personalisierten virtuellen Agenten bis hin zum Gesundheitswesen reichen. Aus technischer Sicht besteht das Ziel darin, den Inhalt eines Audiosignals hinsichtlich einer Auswahl definierter Labels robust zu klassifizieren. Die Labels beschreiben bspw. die akustische Umgebung der Aufnahme, eine medizinische Diagnose oder - im Falle von Sprache - was gesagt wird oder wie es gesagt wird. Übliche Ansätze hierzu verwenden maschinelles Lernen, d.h., es werden anwendungsspezifische Modelle anhand von Beispieldaten trainiert. Trotz jüngster Erfolge beim Ende-zu-Ende-Lernen mittels neuronaler Netze, in welchen das unverarbeitete Audiosignal als Eingabe benutzt wird, sind Modelle, die auf definierten akustischen Merkmalen basieren, in manchen Bereichen weiterhin überlegen. Dies gilt im Besonderen für Einsatzzwecke, für die nur wenige Daten vorhanden sind. Allerdings besteht dabei das Problem, dass Zeitfolgen von akustischen Deskriptoren in viele Algorithmen des maschinellen Lernens nicht direkt eingespeist werden können, da diese eine statische Eingabe fester Länge benötigen. Außerdem kann es auch für dynamische (zeitabhängige) Klassifikatoren vorteilhaft sein, die Deskriptoren über ein gewisses Zeitintervall zusammenzufassen. Jedoch hat die Art der Merkmalsdarstellung einen grundlegenden Einfluss auf die Leistungsfähigkeit des Modells. In der vorliegenden Dissertation wird der sogenannte Bag-of-Audio-Words-Ansatz (BoAW) als Alternative zum Standardansatz der statistischen Funktionale untersucht. BoAW ist eine Methode des unüberwachten Lernens von Merkmalsdarstellungen, die von der Bag-of-Words-Methode in der Computerlinguistik inspiriert wurde, bei der ein Textdokument als Histogramm der vorkommenden Wörter beschrieben wird. Das Toolkit openXBOW wird vorgestellt, welches systematisches Training und Optimierung dieser Merkmalsdarstellungen - vereinheitlicht für beliebige Modalitäten mit numerischen oder symbolischen Deskriptoren - erlaubt. Es werden einige Experimente zum BoAW-Ansatz durchgeführt und diskutiert, die sich auf eine große Zahl möglicher Anwendungen und entsprechende Datensätze beziehen, von der Emotionserkennung in gesprochener Sprache bis zur medizinischen Diagnostik. Die Auswertungen beinhalten einen Vergleich verschiedener akustischer Deskriptoren und Konfigurationen der BoAW-Methode. Die wichtigsten Erkenntnisse sind, dass BoAW-Merkmalsvektoren eine geeignete Alternative zu statistischen Funktionalen darstellen, gewisse Vorzüge bieten und gleichzeitig wichtige Eigenschaften der Funktionale, wie bspw. die Datenunabhängigkeit, erhalten können. Zudem wird gezeigt, dass beide Darstellungen komplementär sind und eine Fusionierung die Leistungsfähigkeit eines Systems des maschinellen Hörens verbessert

    Modeling and Recognizing Assembly Actions

    Get PDF
    We develop the task of assembly understanding by applying concepts from computer vision, robotics, and sequence modeling. Motivated by the need to develop tools for recording and analyzing experimental data for a collaborative study of spatial cognition in humans, we gradually extend an application-specific model into a framework that is broadly applicable across data modalities and application instances. The core of our approach is a sequence model that relates assembly actions to their structural consequences. We combine this sequence model with increasingly-general observation models. With each iteration we increase the variety of applications that can be considered by our framework, and decrease the complexity of modeling decisions that designers are required to make. First we present an initial solution for modeling and recognizing assembly activities in our primary application: videos of children performing a block-assembly task. We develop a symbolic model that completely characterizes the fine-grained temporal and geometric structure of assembly sequences, then combine this sequence model with a probabilistic visual observation model that operates by rendering and registering template images of each assembly hypothesis. Then, we extend this perception system by incorporating kinematic sensor-based observations. We use a part-based observation model that compares mid-level attributes derived from sensor streams with their corresponding predictions from assembly hypotheses. We additionally address the joint segmentation and classification of assembly sequences for the first time, resulting in a feature-based segmental CRF framework. Finally, we address the task of learning observation models rather than constructing them by hand. To achieve this we incorporate contemporary, vision-based action recognition models into our segmental CRF framework. In this approach, the only information required from a tool designer is a mapping from human-centric activities to our previously-defined task-centric activities. These innovations have culminated in a method for modeling fine-grained assembly actions that can be applied generally to any kinematic structure, along with a set of techniques for recognizing assembly actions and structures from a variety of modalities and sensors

    CYCLIC GESTURES AND MULTIMODAL SYMBOLIC ASSEMBLIES: AN ARGUMENT FOR SYMBOLIC COMPLEXITY IN GESTURE

    Get PDF
    In this dissertation, I seek to better understand the nature of the relationship between meanings expressed in gesture and those expressed in speech. This research focuses on the use of cyclic gestures in English. Cyclic gestures are manual co-speech gestures that are characterized by a circular movement of the hand or arm. Despite cyclic gestures being commonplace in many types of spoken discourse, no previous studies to date have specifically explored the functions these gestures serve in English. Broadly, this dissertation addresses two questions: (1) What functions do cyclic gestures serve in interaction in English, and (2) how are cyclic gestures integrated with other meaningful units in multimodal expressions? Using data collected from television talk shows, I examine the functional-semantic properties of spoken language expressions that accompany cyclic gestures and identify properties of meaning that repeatedly align with the expression of the gestures. I also explore relationships between fine-grained formal properties of cyclic gestural expressions and functional-semantic properties of the co-expressed speech. The results of the study find a number of significant relationships between gesture forms and spoken language meanings. For example, when cyclic gestures were expressed with spoken constructions serving an evaluative function, they were significantly associated with bimanual asynchronous rotations and finger spreading (p \u3c .001) with a moderately strong effect size (φc = 0.26). Drawing on the patterns identified in the analysis of the data, I analyze cyclic gestures as component symbolic structures that profile schematic processes. I argue that formal properties that accompany cyclic movement gestures (e.g., handshapes and locations of the hands in space) have the potential to be meaningful. Data from English suggest that cyclic gestures can integrate simultaneously with other symbolic structures in gesture to form complex gestural expressions (i.e., symbolic assemblies). Extending theoretical tools from the framework of Cognitive Grammar (Langacker, 1987, 1991), I explore how the schematic meaning of cyclic gestures is instantiated in specific complex gestural expressions and how those gestural constructions interact with symbolic structures in speech. This work challenges traditional assumptions about the nature of gesture meaning, which treats gestures as simplex, holistic structures. Instead, the findings of this research suggest that gestures are best analyzed as constructions
    corecore