255 research outputs found

    Deep Multimodal Speaker Naming

    Full text link
    Automatic speaker naming is the problem of localizing as well as identifying each speaking character in a TV/movie/live show video. This is a challenging problem mainly attributes to its multimodal nature, namely face cue alone is insufficient to achieve good performance. Previous multimodal approaches to this problem usually process the data of different modalities individually and merge them using handcrafted heuristics. Such approaches work well for simple scenes, but fail to achieve high performance for speakers with large appearance variations. In this paper, we propose a novel convolutional neural networks (CNN) based learning framework to automatically learn the fusion function of both face and audio cues. We show that without using face tracking, facial landmark localization or subtitle/transcript, our system with robust multimodal feature extraction is able to achieve state-of-the-art speaker naming performance evaluated on two diverse TV series. The dataset and implementation of our algorithm are publicly available online

    Look, Listen and Learn - A Multimodal LSTM for Speaker Identification

    Full text link
    Speaker identification refers to the task of localizing the face of a person who has the same identity as the ongoing voice in a video. This task not only requires collective perception over both visual and auditory signals, the robustness to handle severe quality degradations and unconstrained content variations are also indispensable. In this paper, we describe a novel multimodal Long Short-Term Memory (LSTM) architecture which seamlessly unifies both visual and auditory modalities from the beginning of each sequence input. The key idea is to extend the conventional LSTM by not only sharing weights across time steps, but also sharing weights across modalities. We show that modeling the temporal dependency across face and voice can significantly improve the robustness to content quality degradations and variations. We also found that our multimodal LSTM is robustness to distractors, namely the non-speaking identities. We applied our multimodal LSTM to The Big Bang Theory dataset and showed that our system outperforms the state-of-the-art systems in speaker identification with lower false alarm rate and higher recognition accuracy.Comment: The 30th AAAI Conference on Artificial Intelligence (AAAI-16

    Two-Stage Predict+Optimize for Mixed Integer Linear Programs with Unknown Parameters in Constraints

    Full text link
    Consider the setting of constrained optimization, with some parameters unknown at solving time and requiring prediction from relevant features. Predict+Optimize is a recent framework for end-to-end training supervised learning models for such predictions, incorporating information about the optimization problem in the training process in order to yield better predictions in terms of the quality of the predicted solution under the true parameters. Almost all prior works have focused on the special case where the unknowns appear only in the optimization objective and not the constraints. Hu et al.~proposed the first adaptation of Predict+Optimize to handle unknowns appearing in constraints, but the framework has somewhat ad-hoc elements, and they provided a training algorithm only for covering and packing linear programs. In this work, we give a new \emph{simpler} and \emph{more powerful} framework called \emph{Two-Stage Predict+Optimize}, which we believe should be the canonical framework for the Predict+Optimize setting. We also give a training algorithm usable for all mixed integer linear programs, vastly generalizing the applicability of the framework. Experimental results demonstrate the superior prediction performance of our training framework over all classical and state-of-the-art methods

    Generalized and Scalable Optimal Sparse Decision Trees

    Full text link
    Decision tree optimization is notoriously difficult from a computational perspective but essential for the field of interpretable machine learning. Despite efforts over the past 40 years, only recently have optimization breakthroughs been made that have allowed practical algorithms to find optimal decision trees. These new techniques have the potential to trigger a paradigm shift where it is possible to construct sparse decision trees to efficiently optimize a variety of objective functions without relying on greedy splitting and pruning heuristics that often lead to suboptimal solutions. The contribution in this work is to provide a general framework for decision tree optimization that addresses the two significant open problems in the area: treatment of imbalanced data and fully optimizing over continuous variables. We present techniques that produce optimal decision trees over a variety of objectives including F-score, AUC, and partial area under the ROC convex hull. We also introduce a scalable algorithm that produces provably optimal results in the presence of continuous variables and speeds up decision tree construction by several orders of magnitude relative to the state-of-the art.Comment: This paper was published in ICML 202

    Working with the homeless: The case of a non-profit organisation in Shanghai

    No full text
    This article addresses a two-pronged objective, namely to bring to the fore a much neglected social issue of homelessness, and to explore the dynamics of state-society relations in contemporary China, through a case study of a non-profit organisation (NPO) working with the homeless in Shanghai. It shows that the largely invisible homelessness in Chinese cities was substantially due to exclusionary institutions, such as the combined household registration and 'detention and deportation' systems. Official policy has become much more supportive since 2003 when the latter was replaced with government-run shelters, but we argue that the NPO case demonstrates the potential for enhanced longer-term support and enabling active citizenship for homeless people. By analysing the ways in which the NPO offers services through collaboration and partnership with the public (and private) actors, we also argue that the transformations in postreform China and the changes within the state and civil society have significantly blurred their boundaries, rendering state-society relations much more complex, dynamic, fluid and mutually embedded

    Phosphorothioate DNA Mediated Sequence-Insensitive Etching and Ripening of Silver Nanoparticles

    Get PDF
    Many DNA-functionalized nanomaterials and biosensors have been reported, but most have ignored the influence of DNA on the stability of nanoparticles. We observed that cytosine-rich DNA oligonucleotides can etch silver nanoparticles (AgNPs). In this work, we showed that phosphorothioate (PS)-modified DNA (PS-DNA) can etch AgNPs independently of DNA sequence, suggesting that the thio-modifications are playing the major role in etching. Compared to unmodified DNA (e.g., poly-cytosine DNA), the concentration of required PS DNA decreases sharply, and the reaction rate increases. Furthermore, etching by PS-DNA occurs quite independent of pH, which is also different from unmodified DNA. The PS-DNA mediated etching could also be controlled well by varying DNA length and conformation, and the number and location of PS modifications. With a higher activity of PS-DNA, the process of etching, ripening, and further etching was taken place sequentially. The etching ability is inhibited by forming duplex DNA and thus etching can be used to measure the concentration of complementary DNA
    corecore