259 research outputs found

    VisualGREP : a systematic method to compare and retrieve video sequences

    Get PDF
    In this paper, we consider the problem of similarity between video sequences. Three basic questions are raised and (partially) answered. Firstly, at what temporal duration can video sequences be compared? The frame, shot, scene and video levels are identified. Secondly, given some image or video feature, what are the requirements on its distance measure and how can it be "easily" transformed into the visual similarity desired by the inquirer? Thirdly, how can video sequences be compared at different levels? A general approach based on either a set or sequence representation with variable degrees of aggregation is proposed and applied recursively over the different levels of temporal resolution: It allows the inquirer to fully control the importance of temporal ordering and duration. The general approach is illustrated by introducing and discussing some of the many possible image and video features. Promising experimental results are presented

    QuickCSG: Fast Arbitrary Boolean Combinations of N Solids

    Get PDF
    QuickCSG computes the result for general N-polyhedron boolean expressions without an intermediate tree of solids. We propose a vertex-centric view of the problem, which simplifies the identification of final geometric contributions, and facilitates its spatial decomposition. The problem is then cast in a single KD-tree exploration, geared toward the result by early pruning of any region of space not contributing to the final surface. We assume strong regularity properties on the input meshes and that they are in general position. This simplifying assumption, in combination with our vertex-centric approach, improves the speed of the approach. Complemented with a task-stealing parallelization, the algorithm achieves breakthrough performance, one to two orders of magnitude speedups with respect to state-of-the-art CPU algorithms, on boolean operations over two to dozens of polyhedra. The algorithm also outperforms GPU implementations with approximate discretizations, while producing an output without redundant facets. Despite the restrictive assumptions on the input, we show the usefulness of QuickCSG for applications with large CSG problems and strong temporal constraints, e.g. modeling for 3D printers, reconstruction from visual hulls and collision detection

    Integration of document representation, processing and management

    Get PDF
    This paper describes a way for document representation and proposes an approach towards an integrated document processing and management system. The approach has the intention to capture essentially freely structured documents, like those typically used in the office domain. The document analysis system ANASTASIL is capable to reveal the structure of complex paper documents, as well as logical objects within it, like receiver, footnote, date. Moreover, it facilitates the handling of the containing information. Analyzed documents are stored by the management system KRISYS that is connected to several different subsequent services. The described integrated system can be considered as an ideal extension of the human clerk, making his tasks in information processing easier. The symbolic representation of the analysis results allow an easy transformation in a given international standard, e.g., ODA/ODIF or SGML, and to interchange it via global network

    QuickCSG: Fast Arbitrary Boolean Combinations of N Solids

    Full text link
    QuickCSG computes the result for general N-polyhedron boolean expressions without an intermediate tree of solids. We propose a vertex-centric view of the problem, which simplifies the identification of final geometric contributions, and facilitates its spatial decomposition. The problem is then cast in a single KD-tree exploration, geared toward the result by early pruning of any region of space not contributing to the final surface. We assume strong regularity properties on the input meshes and that they are in general position. This simplifying assumption, in combination with our vertex-centric approach, improves the speed of the approach. Complemented with a task-stealing parallelization, the algorithm achieves breakthrough performance, one to two orders of magnitude speedups with respect to state-of-the-art CPU algorithms, on boolean operations over two to dozens of polyhedra. The algorithm also outperforms GPU implementations with approximate discretizations, while producing an output without redundant facets. Despite the restrictive assumptions on the input, we show the usefulness of QuickCSG for applications with large CSG problems and strong temporal constraints, e.g. modeling for 3D printers, reconstruction from visual hulls and collision detection

    Building and Refining Abstract Planning Cases by Change of Representation Language

    Full text link
    ion is one of the most promising approaches to improve the performance of problem solvers. In several domains abstraction by dropping sentences of a domain description -- as used in most hierarchical planners -- has proven useful. In this paper we present examples which illustrate significant drawbacks of abstraction by dropping sentences. To overcome these drawbacks, we propose a more general view of abstraction involving the change of representation language. We have developed a new abstraction methodology and a related sound and complete learning algorithm that allows the complete change of representation language of planning cases from concrete to abstract. However, to achieve a powerful change of the representation language, the abstract language itself as well as rules which describe admissible ways of abstracting states must be provided in the domain model. This new abstraction approach is the core of Paris (Plan Abstraction and Refinement in an Integrated System), a system in which abstract planning cases are automatically learned from given concrete cases. An empirical study in the domain of process planning in mechanical engineering shows significant advantages of the proposed reasoning from abstract cases over classical hierarchical planning.Comment: See http://www.jair.org/ for an online appendix and other files accompanying this articl

    Linguistically-motivated sub-word modeling with applications to speech recognition

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Includes bibliographical references (p. 173-185).Despite the proliferation of speech-enabled applications and devices, speech-driven human-machine interaction still faces several challenges. One of theses issues is the new word or the out-of-vocabulary (OOV) problem, which occurs when the underlying automatic speech recognizer (ASR) encounters a word it does not "know". With ASR being deployed in constantly evolving domains such as restaurant ratings, or music querying, as well as on handheld devices, the new word problem continues to arise.This thesis is concerned with the OOV problem, and in particular with the process of modeling and learning the lexical properties of an OOV word through a linguistically-motivated sub-syllabic model. The linguistic model is designed using a context-free grammar which describes the sub-syllabic structure of English words, and encapsulates phonotactic and phonological constraints. The context-free grammar is supported by a probability model, which captures the statistics of the parses generated by the grammar and encodes spatio-temporal context. The two main outcomes of the grammar design are: (1) sub-word units, which encode pronunciation information, and can be viewed as clusters of phonemes; and (2) a high-quality alignment between graphemic and sub-word units, which results in hybrid entities denoted as spellnemes. The spellneme units are used in the design of a statistical bi-directional letter-to-sound (L2S) model, which plays a significant role in automatically learning the spelling and pronunciation of a new word.The sub-word units and the L2S model are assessed on the task of automatic lexicon generation. In a first set of experiments, knowledge of the spelling of the lexicon is assumed. It is shown that the phonemic pronunciations associated with the lexicon can be successfully learned using the L2S model as well as a sub-word recognizer.(cont.) In a second set of experiments, the assumption of perfect spelling knowledge is relaxed, and an iterative and unsupervised algorithm, denoted as Turbo-style, makes use of spoken instances of both spellings and words to learn the lexical entries in a dictionary.Sub-word speech recognition is also embedded in a parallel fashion as a backoff mechanism for a word recognizer. The resulting hybrid model is evaluated in a lexical access application, whereby a word recognizer first attempts to recognize an isolated word. Upon failure of the word recognizer, the sub-word recognizer is manually triggered. Preliminary results show that such a hybrid set-up outperforms a large-vocabulary recognizer.Finally, the sub-word units are embedded in a flat hybrid OOV model for continuous ASR. The hybrid ASR is deployed as a front-end to a song retrieval application, which is queried via spoken lyrics. Vocabulary compression and open-ended query recognition are achieved by designing a hybrid ASR. The performance of the frontend recognition system is reported in terms of sentence, word, and sub-word error rates. The hybrid ASR is shown to outperform a word-only system over a range of out-of-vocabulary rates (1%-50%). The retrieval performance is thoroughly assessed as a fmnction of ASR N-best size, language model order, and the index size. Moreover, it is shown that the sub-words outperform alternative linguistically-motivated sub-lexical units such as phonemes. Finally, it is observed that a dramatic vocabulary compression - by more than a factor of 10 - is accompanied by a minor loss in song retrieval performance.by Ghinwa F. Choueiter.Ph.D

    Generación de resúmenes de videos basada en consultas utilizando aprendizaje de máquina y representaciones coordinadas

    Get PDF
    Video constitutes the primary substrate of information of humanity, consider the video data uploaded daily on platforms as YouTube: 300 hours of video per minute, video analysis is currently one of the most active areas in computer science and industry, which includes fields such as video classification, video retrieval and video summarization (VSUMM). VSUMM is a hot research field due to its importance in allowing human users to simplify the information processing required to see and analyze sets of videos, for example, reducing the number of hours of recorded videos to be analyzed by a security personnel. On the other hand, many video analysis tasks and systems requires to reduce the computational load using segmentation schemes, compression algorithms, and video summarization techniques. Many approaches have been studied to solve VSUMM. However, it is not a single solution problem due to its subjective and interpretative nature, in the sense that important parts to be preserved from the input video requires a subjective estimation of an importance sco- re. This score can be related to how interesting are some video segments, how close they represent the complete video, and how segments are related to the task a human user is performing in a given situation. For example, a movie trailer is, in part, a VSUMM task but related to preserving promising and interesting parts from the movie but not to be able to reconstruct the movie content from them, i.e., movie trailers contains interesting scenes but not representative ones. On the contrary, in a surveillance situation, a summary from the closed-circuit cameras needs to be representative and interesting, and in some situations related with some objects of interest, for example, if it is needed to find a person or a car. As written natural language is the main human-machine communication interface, recently some works have made advances in allowing to include textual queries in the VSUMM process which allows to guide the summarization process, in the sense that video segments related with the query are considered important. In this thesis, we present a computational framework to perform video summarization over an input video, which allows the user to input free-form sentences and keywords queries to guide the process by considering user intention or task intention, but also considering general objectives such as representativeness and interestingness. Our framework relies on the use of pre-trained deep visual and linguistic models, although we trained our visual-linguistic coordination model. We expect this model will be of interest in cases where VSUMM tasks requires a high degree of specification of user/task intentions with minimal training stages and rapid deployment.El video constituye el sustrato primario de información de la humanidad, por ejemplo, considere los datos de video subidos diariamente en plataformas cómo YouTube: 300 horas de video por minuto. El análisis de video es actualmente una de las áreas más activas en la informática y la industria, que incluye campos como la clasificación, recuperación y generación de resúmenes de video (VSUMM). VSUMM es un campo de investigación de alto dinamismo debido a su importancia al permitir que los usuarios humanos simplifiquen el procesamiento de la información requerido para ver y analizar conjuntos de videos, por ejemplo, reduciendo la cantidad de horas de videos grabados para ser analizados por un personal de seguridad. Por otro lado, muchas tareas y sistemas de análisis de video requieren reducir la carga computacional utilizando esquemas de segmentación, algoritmos de compresión y técnicas de VSUMM. Se han estudiado muchos enfoques para abordar VSUMM. Sin embargo, no es un problema de solución única debido a su naturaleza subjetiva e interpretativa, en el sentido de que las partes importantes que se deben preservar del video de entrada, requieren una estimación de una puntuación de importancia. Esta puntuación puede estar relacionada con lo interesantes que son algunos segmentos de video, lo cerca que representan el video completo y con cómo los segmentos están relacionados con la tarea que un usuario humano está realizando en una situación determinada. Por ejemplo, un avance de película es, en parte, una tarea de VSUMM, pero esta ́ relacionada con la preservación de partes prometedoras e interesantes de la película, pero no con la posibilidad de reconstruir el contenido de la película a partir de ellas, es decir, los avances de películas contienen escenas interesantes pero no representativas. Por el contrario, en una situación de vigilancia, un resumen de las cámaras de circuito cerrado debe ser representativo e interesante, y en algunas situaciones relacionado con algunos objetos de interés, por ejemplo, si se necesita para encontrar una persona o un automóvil. Dado que el lenguaje natural escrito es la principal interfaz de comunicación hombre-máquina, recientemente algunos trabajos han avanzado en permitir incluir consultas textuales en el proceso VSUMM lo que permite orientar el proceso de resumen, en el sentido de que los segmentos de video relacionados con la consulta se consideran importantes. En esta tesis, presentamos un marco computacional para realizar un resumen de video sobre un video de entrada, que permite al usuario ingresar oraciones de forma libre y consultas de palabras clave para guiar el proceso considerando la intención del mismo o la intención de la tarea, pero también considerando objetivos generales como representatividad e interés. Nuestro marco se basa en el uso de modelos visuales y linguísticos profundos pre-entrenados, aunque también entrenamos un modelo propio de coordinación visual-linguística. Esperamos que este marco computacional sea de interés en los casos en que las tareas de VSUMM requieran un alto grado de especificación de las intenciones del usuario o tarea, con pocas etapas de entrenamiento y despliegue rápido.MincienciasDoctorad
    corecore