504,773 research outputs found

    Fast Near Neighbor Search in High-Dimensional Binary Data

    Full text link
    Numerous applications in search, databases, machine learning, and computer vision, can benefit from efficient algorithms for near neighbor search. This paper proposes a simple framework for fast near neighbor search in high-dimensional binary data, which are common in practice (e.g., text). We develop a very simple and effective strategy for sub-linear time near neighbor search, by creating hash tables directly using the bits generated by b-bit minwise hashing. The advantages of our method are demonstrated through thorough comparisons with two strong baselines: spectral hashing and sign (1-bit) random projections.NSF Grant #113184

    PACE: Pattern Accurate Computationally Efficient Bootstrapping for Timely Discovery of Cyber-Security Concepts

    Full text link
    Public disclosure of important security information, such as knowledge of vulnerabilities or exploits, often occurs in blogs, tweets, mailing lists, and other online sources months before proper classification into structured databases. In order to facilitate timely discovery of such knowledge, we propose a novel semi-supervised learning algorithm, PACE, for identifying and classifying relevant entities in text sources. The main contribution of this paper is an enhancement of the traditional bootstrapping method for entity extraction by employing a time-memory trade-off that simultaneously circumvents a costly corpus search while strengthening pattern nomination, which should increase accuracy. An implementation in the cyber-security domain is discussed as well as challenges to Natural Language Processing imposed by the security domain.Comment: 6 pages, 3 figures, ieeeTran conference. International Conference on Machine Learning and Applications 201

    Can Automatic Abstracting Improve on Current Extracting Techniques in Aiding Users to Judge the Relevance of Pages in Search Engine Results?

    No full text
    Current search engines use sentence extraction techniques to produce snippet result summaries, which users may find less than ideal for determining the relevance of pages. Unlike extracting, abstracting programs analyse the context of documents and rewrite them into informative summaries. Our project aims to produce abstracting summaries which are coherent and easy to read thereby lessening users’ time in judging the relevance of pages. However, automatic abstracting technique has its domain restriction. For solving this problem we propose to employ text classification techniques. We propose a new approach to initially classify whole web documents into sixteen top level ODP categories by using machine learning and a Bayesian classifier. We then manually create sixteen templates for each category. The summarisation techniques we use include a natural language processing techniques to weight words and analyse lexical chains to identify salient phrases and place them into relevant template slots to produce summaries

    Optimizing Photonic Nanostructures via Multi-fidelity Gaussian Processes

    Get PDF
    We apply numerical methods in combination with finite-difference-time-domain (FDTD) simulations to optimize transmission properties of plasmonic mirror color filters using a multi-objective figure of merit over a five-dimensional parameter space by utilizing novel multi-fidelity Gaussian processes approach. We compare these results with conventional derivative-free global search algorithms, such as (single-fidelity) Gaussian Processes optimization scheme, and Particle Swarm Optimization---a commonly used method in nanophotonics community, which is implemented in Lumerical commercial photonics software. We demonstrate the performance of various numerical optimization approaches on several pre-collected real-world datasets and show that by properly trading off expensive information sources with cheap simulations, one can more effectively optimize the transmission properties with a fixed budget.Comment: NIPS 2018 Workshop on Machine Learning for Molecules and Materials. arXiv admin note: substantial text overlap with arXiv:1811.0075

    Adapting Data Mining for German Named Entity Recognition

    Get PDF
    International audienceIn the latest decades, machine learning approaches have been intensively exper-imented for natural language processing. Most of the time, systems rely on using statistics within the system, by analyzing texts at the token level and, for labelling tasks, categorizing each among possible classes. One may notice that previous sym-bolic approaches (e.g. transducers) where designed to delimit pieces of text. Our re-search team developped mXS, a system that aims at combining both approaches. It lo-cates boundaries of entities by using se-quential pattern mining and machine learn-ing. This system, intially developped for French, has been adapted to German

    Automatic Video Annotation

    Get PDF
    Currently all video search engines are text-based, i.e. they search for the text labels associated with any video to retrieve the desired ones. However, this can lead to incorrect or inaccurate results, as labelling or annotating a video is mainly done manually. Consequently, many false positive results are generated during video searches by mislabelled videos.To solve this problem we need to improve the process of video annotation. This can be achieved by automatic annotation of videos based on their actual content, rather than text labels or tags. To accomplish this we need to enable computers to extract video “storylines”, composed by the events or actions taking place in each video. This has the potential to save time and provide better results for online video searches, as well as improve event detection in real-world surveillance footage. The project aims to facilitate Probabilistic Semantic Search and Query Answering by annotating videos in the way described, through machine le

    Probabilistic Bag-Of-Hyperlinks Model for Entity Linking

    Full text link
    Many fundamental problems in natural language processing rely on determining what entities appear in a given text. Commonly referenced as entity linking, this step is a fundamental component of many NLP tasks such as text understanding, automatic summarization, semantic search or machine translation. Name ambiguity, word polysemy, context dependencies and a heavy-tailed distribution of entities contribute to the complexity of this problem. We here propose a probabilistic approach that makes use of an effective graphical model to perform collective entity disambiguation. Input mentions (i.e.,~linkable token spans) are disambiguated jointly across an entire document by combining a document-level prior of entity co-occurrences with local information captured from mentions and their surrounding context. The model is based on simple sufficient statistics extracted from data, thus relying on few parameters to be learned. Our method does not require extensive feature engineering, nor an expensive training procedure. We use loopy belief propagation to perform approximate inference. The low complexity of our model makes this step sufficiently fast for real-time usage. We demonstrate the accuracy of our approach on a wide range of benchmark datasets, showing that it matches, and in many cases outperforms, existing state-of-the-art methods

    Deep Learning for Predicting Molecular Electronic Properties

    Get PDF
    Applications of novel materials have a significant positive impact on our lives. To search for such novel materials, material scientists traverse massive datasets of prospective materials identifying ones with favourable properties. Prospective materials are screened by studying a suitable spectra of these materials.Contemporary methods like high-throughput screening are very time consuming for moderately sized datasets. Recently, deep learning algorithms have proven to be successful in modelling very complex functions like the mapping from image to text, use for image captioning and the mapping from text in one language to another, used for machine translation. In this thesis, we propose deep learning methods which are able to predict molecular orbital energies and spectra, from only the charges and coordinates of constituent atoms of test molecules. Our proposed machine learning (ML) model surpassed the state-of-the-art in prediction accuracy of the molecular orbital energies and based on our literature review it is the first ML model to predict molecular spectra
    • …
    corecore