84 research outputs found

    Hashing for Similarity Search: A Survey

    Full text link
    Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the smallest from a large database. Various methods have been developed to address this problem, and recently a lot of efforts have been devoted to approximate search. In this paper, we present a survey on one of the main solutions, hashing, which has been widely studied since the pioneering work locality sensitive hashing. We divide the hashing algorithms two main categories: locality sensitive hashing, which designs hash functions without exploring the data distribution and learning to hash, which learns hash functions according the data distribution, and review them from various aspects, including hash function design and distance measure and search scheme in the hash coding space

    Optimized Cartesian KK-Means

    Full text link
    Product quantization-based approaches are effective to encode high-dimensional data points for approximate nearest neighbor search. The space is decomposed into a Cartesian product of low-dimensional subspaces, each of which generates a sub codebook. Data points are encoded as compact binary codes using these sub codebooks, and the distance between two data points can be approximated efficiently from their codes by the precomputed lookup tables. Traditionally, to encode a subvector of a data point in a subspace, only one sub codeword in the corresponding sub codebook is selected, which may impose strict restrictions on the search accuracy. In this paper, we propose a novel approach, named Optimized Cartesian KK-Means (OCKM), to better encode the data points for more accurate approximate nearest neighbor search. In OCKM, multiple sub codewords are used to encode the subvector of a data point in a subspace. Each sub codeword stems from different sub codebooks in each subspace, which are optimally generated with regards to the minimization of the distortion errors. The high-dimensional data point is then encoded as the concatenation of the indices of multiple sub codewords from all the subspaces. This can provide more flexibility and lower distortion errors than traditional methods. Experimental results on the standard real-life datasets demonstrate the superiority over state-of-the-art approaches for approximate nearest neighbor search.Comment: to appear in IEEE TKDE, accepted in Apr. 201

    Self-Supervised Video Hashing with Hierarchical Binary Auto-encoder

    Full text link
    Existing video hash functions are built on three isolated stages: frame pooling, relaxed learning, and binarization, which have not adequately explored the temporal order of video frames in a joint binary optimization model, resulting in severe information loss. In this paper, we propose a novel unsupervised video hashing framework dubbed Self-Supervised Video Hashing (SSVH), that is able to capture the temporal nature of videos in an end-to-end learning-to-hash fashion. We specifically address two central problems: 1) how to design an encoder-decoder architecture to generate binary codes for videos; and 2) how to equip the binary codes with the ability of accurate video retrieval. We design a hierarchical binary autoencoder to model the temporal dependencies in videos with multiple granularities, and embed the videos into binary codes with less computations than the stacked architecture. Then, we encourage the binary codes to simultaneously reconstruct the visual content and neighborhood structure of the videos. Experiments on two real-world datasets (FCVID and YFCC) show that our SSVH method can significantly outperform the state-of-the-art methods and achieve the currently best performance on the task of unsupervised video retrieval

    Macro- and microplastic accumulation in soil after 32 years of plastic film mulching

    Get PDF
    Plastic film mulch (PFM) is a double-edged-sword agricultural technology, which greatly improves global agricultural production but can also cause severe plastic pollution of the environment. Here, we characterized and quantified the amount of macro- and micro-plastics accumulated after 32 years of continuous plastic mulch film use in an agricultural field. An interactive field trial was established in 1987, where the effect of plastic mulching and N fertilization on maize yield was investigated. We assessed the abundance and type of macroplastics (>5 mm) at 0–20 cm soil depth and microplastic (<5 mm) at 0–100 cm depth. In the PFM plot, we found about 10 times more macroplastic particles in the fertilized plots than in the non-fertilized plots (6796 vs 653 pieces/m2), and the amount of film microplastics was about twice as abundant in the fertilized plots than in the non-fertilized plots (3.7 × 106 vs 2.2 × 106 particles/kg soil). These differences can be explained by entanglement of plastics with plant roots and stems, which made it more difficult to remove plastic film after harvest. Macroplastics consisted mainly of films, while microplastics consisted of films, fibers, and granules, with the films being identified as polyethylene originating from the plastic mulch films. Plastic mulch films contributed 33%–56% to the total microplastics in 0–100 cm depth. The total number of microplastics in the topsoil (0–10 cm) ranged as 7183–10,586 particles/kg, with an average of 8885 particles/kg. In the deep subsoil (80–100 cm) the plastic concentration ranged as 2268–3529 particles/kg, with an average of 2899 particles/kg. Long-term use of plastic mulch films caused considerable pollution of not only surface, but also subsurface soil. Migration of plastic to deeper soil layers makes removal and remediation more difficult, implying that the plastic pollution legacy will remain in soil for centuries

    CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models

    Full text link
    With the emergence of Large Language Models (LLMs), there has been a significant improvement in the programming capabilities of models, attracting growing attention from researchers. We propose CodeApex, a bilingual benchmark dataset focusing on the programming comprehension and code generation abilities of LLMs. CodeApex comprises three types of multiple-choice questions: conceptual understanding, commonsense reasoning, and multi-hop reasoning, designed to evaluate LLMs on programming comprehension tasks. Additionally, CodeApex utilizes algorithmic questions and corresponding test cases to assess the code quality generated by LLMs. We evaluate 14 state-of-the-art LLMs, including both general-purpose and specialized models. GPT exhibits the best programming capabilities, achieving approximate accuracies of 50% and 56% on the two tasks, respectively. There is still significant room for improvement in programming tasks. We hope that CodeApex can serve as a reference for evaluating the coding capabilities of LLMs, further promoting their development and growth. Datasets are released at https://github.com/APEXLAB/CodeApex.git. CodeApex submission website is https://apex.sjtu.edu.cn/codeapex/.Comment: 21 page

    Modeling Rett Syndrome Using TALEN-Edited MECP2 Mutant Cynomolgus Monkeys

    Get PDF
    Gene-editing technologies have made it feasible to create nonhuman primate models for human genetic disorders. Here, we report detailed genotypes and phenotypes of TALEN-edited MECP2 mutant cynomolgus monkeys serving as a model for a neurodevelopmental disorder, Rett syndrome (RTT), which is caused by loss-of-function mutations in the human MECP2 gene. Male mutant monkeys were embryonic lethal, reiterating that RTT is a disease of females. Through a battery of behavioral analyses, including primate-unique eye-tracking tests, in combination with brain imaging via MRI, we found a series of physiological, behavioral, and structural abnormalities resembling clinical manifestations of RTT. Moreover, blood transcriptome profiling revealed that mutant monkeys resembled RTT patients in immune gene dysregulation. Taken together, the stark similarity in phenotype and/or endophenotype between monkeys and patients suggested that gene-edited RTT founder monkeys would be of value for disease mechanistic studies as well as development of potential therapeutic interventions for RTT

    Surface‐modified quantity of Fe 3

    No full text
    corecore