504,773 research outputs found
Fast Near Neighbor Search in High-Dimensional Binary Data
Numerous applications in search, databases, machine learning,
and computer vision, can benefit from efficient algorithms for near
neighbor search. This paper proposes a simple framework for fast near
neighbor search in high-dimensional binary data, which are common in
practice (e.g., text). We develop a very simple and effective strategy for
sub-linear time near neighbor search, by creating hash tables directly
using the bits generated by b-bit minwise hashing. The advantages of
our method are demonstrated through thorough comparisons with two
strong baselines: spectral hashing and sign (1-bit) random projections.NSF Grant #113184
PACE: Pattern Accurate Computationally Efficient Bootstrapping for Timely Discovery of Cyber-Security Concepts
Public disclosure of important security information, such as knowledge of
vulnerabilities or exploits, often occurs in blogs, tweets, mailing lists, and
other online sources months before proper classification into structured
databases. In order to facilitate timely discovery of such knowledge, we
propose a novel semi-supervised learning algorithm, PACE, for identifying and
classifying relevant entities in text sources. The main contribution of this
paper is an enhancement of the traditional bootstrapping method for entity
extraction by employing a time-memory trade-off that simultaneously circumvents
a costly corpus search while strengthening pattern nomination, which should
increase accuracy. An implementation in the cyber-security domain is discussed
as well as challenges to Natural Language Processing imposed by the security
domain.Comment: 6 pages, 3 figures, ieeeTran conference. International Conference on
Machine Learning and Applications 201
Can Automatic Abstracting Improve on Current Extracting Techniques in Aiding Users to Judge the Relevance of Pages in Search Engine Results?
Current search engines use sentence extraction techniques to produce snippet result summaries, which users may find less than ideal for determining the relevance of pages. Unlike extracting, abstracting programs analyse the context of documents and rewrite them into informative summaries. Our project aims to produce abstracting summaries which are coherent and easy to read thereby lessening users’ time in judging the relevance of pages. However, automatic abstracting technique has its domain restriction. For solving this problem we propose to employ text classification techniques. We propose a new approach to initially classify whole web documents into sixteen top level ODP categories by using machine learning and a Bayesian classifier. We then manually create sixteen templates for each category. The summarisation techniques we use include a natural language processing techniques to weight words and analyse lexical chains to identify salient phrases and place them into relevant template slots to produce summaries
Optimizing Photonic Nanostructures via Multi-fidelity Gaussian Processes
We apply numerical methods in combination with finite-difference-time-domain
(FDTD) simulations to optimize transmission properties of plasmonic mirror
color filters using a multi-objective figure of merit over a five-dimensional
parameter space by utilizing novel multi-fidelity Gaussian processes approach.
We compare these results with conventional derivative-free global search
algorithms, such as (single-fidelity) Gaussian Processes optimization scheme,
and Particle Swarm Optimization---a commonly used method in nanophotonics
community, which is implemented in Lumerical commercial photonics software. We
demonstrate the performance of various numerical optimization approaches on
several pre-collected real-world datasets and show that by properly trading off
expensive information sources with cheap simulations, one can more effectively
optimize the transmission properties with a fixed budget.Comment: NIPS 2018 Workshop on Machine Learning for Molecules and Materials.
arXiv admin note: substantial text overlap with arXiv:1811.0075
Adapting Data Mining for German Named Entity Recognition
International audienceIn the latest decades, machine learning approaches have been intensively exper-imented for natural language processing. Most of the time, systems rely on using statistics within the system, by analyzing texts at the token level and, for labelling tasks, categorizing each among possible classes. One may notice that previous sym-bolic approaches (e.g. transducers) where designed to delimit pieces of text. Our re-search team developped mXS, a system that aims at combining both approaches. It lo-cates boundaries of entities by using se-quential pattern mining and machine learn-ing. This system, intially developped for French, has been adapted to German
Automatic Video Annotation
Currently all video search engines are text-based, i.e. they search for the text labels associated with any video to retrieve the desired ones. However, this can lead to incorrect or inaccurate results, as labelling or annotating a video is mainly done manually. Consequently, many false positive results are generated during video searches by mislabelled videos.To solve this problem we need to improve the process of video annotation. This can be achieved by automatic annotation of videos based on their actual content, rather than text labels or tags. To accomplish this we need to enable computers to extract video “storylines”, composed by the events or actions taking place in each video. This has the potential to save time and provide better results for online video searches, as well as improve event detection in real-world surveillance footage. The project aims to facilitate Probabilistic Semantic Search and Query Answering by annotating videos in the way described, through machine le
Probabilistic Bag-Of-Hyperlinks Model for Entity Linking
Many fundamental problems in natural language processing rely on determining
what entities appear in a given text. Commonly referenced as entity linking,
this step is a fundamental component of many NLP tasks such as text
understanding, automatic summarization, semantic search or machine translation.
Name ambiguity, word polysemy, context dependencies and a heavy-tailed
distribution of entities contribute to the complexity of this problem.
We here propose a probabilistic approach that makes use of an effective
graphical model to perform collective entity disambiguation. Input mentions
(i.e.,~linkable token spans) are disambiguated jointly across an entire
document by combining a document-level prior of entity co-occurrences with
local information captured from mentions and their surrounding context. The
model is based on simple sufficient statistics extracted from data, thus
relying on few parameters to be learned.
Our method does not require extensive feature engineering, nor an expensive
training procedure. We use loopy belief propagation to perform approximate
inference. The low complexity of our model makes this step sufficiently fast
for real-time usage. We demonstrate the accuracy of our approach on a wide
range of benchmark datasets, showing that it matches, and in many cases
outperforms, existing state-of-the-art methods
Deep Learning for Predicting Molecular Electronic Properties
Applications of novel materials have a significant positive impact on our lives. To search for such novel materials, material scientists traverse massive datasets of prospective materials identifying ones with favourable properties. Prospective materials are screened by studying a suitable spectra of these materials.Contemporary methods like high-throughput screening are very time consuming for moderately sized datasets.
Recently, deep learning algorithms have proven to be successful in modelling very complex functions like the mapping from image to text, use for image captioning and the mapping from text in one language to another, used for machine translation.
In this thesis, we propose deep learning methods which are able to predict molecular orbital energies and spectra, from only the charges and coordinates of constituent atoms of test molecules. Our proposed machine learning (ML) model surpassed the state-of-the-art in prediction accuracy of the molecular orbital energies and based on our literature review it is the first ML model to predict molecular spectra
- …