84 research outputs found
Hashing for Similarity Search: A Survey
Similarity search (nearest neighbor search) is a problem of pursuing the data
items whose distances to a query item are the smallest from a large database.
Various methods have been developed to address this problem, and recently a lot
of efforts have been devoted to approximate search. In this paper, we present a
survey on one of the main solutions, hashing, which has been widely studied
since the pioneering work locality sensitive hashing. We divide the hashing
algorithms two main categories: locality sensitive hashing, which designs hash
functions without exploring the data distribution and learning to hash, which
learns hash functions according the data distribution, and review them from
various aspects, including hash function design and distance measure and search
scheme in the hash coding space
Optimized Cartesian -Means
Product quantization-based approaches are effective to encode
high-dimensional data points for approximate nearest neighbor search. The space
is decomposed into a Cartesian product of low-dimensional subspaces, each of
which generates a sub codebook. Data points are encoded as compact binary codes
using these sub codebooks, and the distance between two data points can be
approximated efficiently from their codes by the precomputed lookup tables.
Traditionally, to encode a subvector of a data point in a subspace, only one
sub codeword in the corresponding sub codebook is selected, which may impose
strict restrictions on the search accuracy. In this paper, we propose a novel
approach, named Optimized Cartesian -Means (OCKM), to better encode the data
points for more accurate approximate nearest neighbor search. In OCKM, multiple
sub codewords are used to encode the subvector of a data point in a subspace.
Each sub codeword stems from different sub codebooks in each subspace, which
are optimally generated with regards to the minimization of the distortion
errors. The high-dimensional data point is then encoded as the concatenation of
the indices of multiple sub codewords from all the subspaces. This can provide
more flexibility and lower distortion errors than traditional methods.
Experimental results on the standard real-life datasets demonstrate the
superiority over state-of-the-art approaches for approximate nearest neighbor
search.Comment: to appear in IEEE TKDE, accepted in Apr. 201
Self-Supervised Video Hashing with Hierarchical Binary Auto-encoder
Existing video hash functions are built on three isolated stages: frame
pooling, relaxed learning, and binarization, which have not adequately explored
the temporal order of video frames in a joint binary optimization model,
resulting in severe information loss. In this paper, we propose a novel
unsupervised video hashing framework dubbed Self-Supervised Video Hashing
(SSVH), that is able to capture the temporal nature of videos in an end-to-end
learning-to-hash fashion. We specifically address two central problems: 1) how
to design an encoder-decoder architecture to generate binary codes for videos;
and 2) how to equip the binary codes with the ability of accurate video
retrieval. We design a hierarchical binary autoencoder to model the temporal
dependencies in videos with multiple granularities, and embed the videos into
binary codes with less computations than the stacked architecture. Then, we
encourage the binary codes to simultaneously reconstruct the visual content and
neighborhood structure of the videos. Experiments on two real-world datasets
(FCVID and YFCC) show that our SSVH method can significantly outperform the
state-of-the-art methods and achieve the currently best performance on the task
of unsupervised video retrieval
Macro- and microplastic accumulation in soil after 32 years of plastic film mulching
Plastic film mulch (PFM) is a double-edged-sword agricultural technology, which greatly improves global agricultural production but can also cause severe plastic pollution of the environment. Here, we characterized and quantified the amount of macro- and micro-plastics accumulated after 32 years of continuous plastic mulch film use in an agricultural field. An interactive field trial was established in 1987, where the effect of plastic mulching and N fertilization on maize yield was investigated. We assessed the abundance and type of macroplastics (>5 mm) at 0–20 cm soil depth and microplastic (<5 mm) at 0–100 cm depth. In the PFM plot, we found about 10 times more macroplastic particles in the fertilized plots than in the non-fertilized plots (6796 vs 653 pieces/m2), and the amount of film microplastics was about twice as abundant in the fertilized plots than in the non-fertilized plots (3.7 × 106 vs 2.2 × 106 particles/kg soil). These differences can be explained by entanglement of plastics with plant roots and stems, which made it more difficult to remove plastic film after harvest. Macroplastics consisted mainly of films, while microplastics consisted of films, fibers, and granules, with the films being identified as polyethylene originating from the plastic mulch films. Plastic mulch films contributed 33%–56% to the total microplastics in 0–100 cm depth. The total number of microplastics in the topsoil (0–10 cm) ranged as 7183–10,586 particles/kg, with an average of 8885 particles/kg. In the deep subsoil (80–100 cm) the plastic concentration ranged as 2268–3529 particles/kg, with an average of 2899 particles/kg. Long-term use of plastic mulch films caused considerable pollution of not only surface, but also subsurface soil. Migration of plastic to deeper soil layers makes removal and remediation more difficult, implying that the plastic pollution legacy will remain in soil for centuries
CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models
With the emergence of Large Language Models (LLMs), there has been a
significant improvement in the programming capabilities of models, attracting
growing attention from researchers. We propose CodeApex, a bilingual benchmark
dataset focusing on the programming comprehension and code generation abilities
of LLMs. CodeApex comprises three types of multiple-choice questions:
conceptual understanding, commonsense reasoning, and multi-hop reasoning,
designed to evaluate LLMs on programming comprehension tasks. Additionally,
CodeApex utilizes algorithmic questions and corresponding test cases to assess
the code quality generated by LLMs. We evaluate 14 state-of-the-art LLMs,
including both general-purpose and specialized models. GPT exhibits the best
programming capabilities, achieving approximate accuracies of 50% and 56% on
the two tasks, respectively. There is still significant room for improvement in
programming tasks. We hope that CodeApex can serve as a reference for
evaluating the coding capabilities of LLMs, further promoting their development
and growth. Datasets are released at https://github.com/APEXLAB/CodeApex.git.
CodeApex submission website is https://apex.sjtu.edu.cn/codeapex/.Comment: 21 page
Modeling Rett Syndrome Using TALEN-Edited MECP2 Mutant Cynomolgus Monkeys
Gene-editing technologies have made it feasible to create nonhuman primate models for human genetic disorders. Here, we report detailed genotypes and phenotypes of TALEN-edited MECP2 mutant cynomolgus monkeys serving as a model for a neurodevelopmental disorder, Rett syndrome (RTT), which is caused by loss-of-function mutations in the human MECP2 gene. Male mutant monkeys were embryonic lethal, reiterating that RTT is a disease of females. Through a battery of behavioral analyses, including primate-unique eye-tracking tests, in combination with brain imaging via MRI, we found a series of physiological, behavioral, and structural abnormalities resembling clinical manifestations of RTT. Moreover, blood transcriptome profiling revealed that mutant monkeys resembled RTT patients in immune gene dysregulation. Taken together, the stark similarity in phenotype and/or endophenotype between monkeys and patients suggested that gene-edited RTT founder monkeys would be of value for disease mechanistic studies as well as development of potential therapeutic interventions for RTT
- …