172 research outputs found
MIAEC: Missing data imputation based on the evidence Chain
© 2013 IEEE. Missing or incorrect data caused by improper operations can seriously compromise security investigation. Missing data can not only damage the integrity of the information but also lead to the deviation of the data mining and analysis. Therefore, it is necessary to implement the imputation of missing value in the phase of data preprocessing to reduce the possibility of data missing as a result of human error and operations. The performances of existing imputation approaches of missing value cannot satisfy the analysis requirements due to its low accuracy and poor stability, especially the rapid decreasing imputation accuracy with the increasing rate of missing data. In this paper, we propose a novel missing value imputation algorithm based on the evidence chain (MIAEC), which first mines all relevant evidence of missing values in each data tuple and then combines this relevant evidence to build the evidence chain for further estimation of missing values. To extend MIAEC for large-scale data processing, we apply the map-reduce programming model to realize the distribution and parallelization of MIAEC. Experimental results show that the proposed approach can provide higher imputation accuracy compared with the missing data imputation algorithm based on naive Bayes, the mode imputation algorithm, and the proposed missing data imputation algorithm based on K-nearest neighbor. MIAEC has higher imputation accuracy and its imputation accuracy is also assured with the increasing rate of missing value or the position change of missing value. MIAEC is also proved to be suitable for the distributed computing platform and can achieve an ideal speedup ratio
Breaking through Deterministic Barriers: Randomized Pruning Mask Generation and Selection
It is widely acknowledged that large and sparse models have higher accuracy
than small and dense models under the same model size constraints. This
motivates us to train a large model and then remove its redundant neurons or
weights by pruning. Most existing works pruned the networks in a deterministic
way, the performance of which solely depends on a single pruning criterion and
thus lacks variety. Instead, in this paper, we propose a model pruning strategy
that first generates several pruning masks in a designed random way.
Subsequently, along with an effective mask-selection rule, the optimal mask is
chosen from the pool of mask candidates. To further enhance efficiency, we
introduce an early mask evaluation strategy, mitigating the overhead associated
with training multiple masks. Our extensive experiments demonstrate that this
approach achieves state-of-the-art performance across eight datasets from GLUE,
particularly excelling at high levels of sparsity
Biomimetic Polymer Film with Brilliant Brightness Using a OneâStep Water VaporâInduced Phase Separation Method
The scales of the white Cyphochilus beetles are endowed with unusual whiteness arising from the exceptional scattering efficiency of their disordered ultrastructure optimized through millions of years of evolution. Here, a simple, oneâstep method based on water vaporâinduced phase separation is developed to prepare thin polystyrene films with similar microstructure and comparable optical performance. A typical biomimetic 3.5 ”m PS film exhibits a diffuse reflectance of 61% at 500 nm wavelength, which translates into a transport mean free path below 1 ”m. A complete optical characterization through Monte Carlo simulations reveals how such a scattering performance arises from the scattering coefficient and scattering anisotropy, whose interplay provides insight into the morphological properties of the material. The potential of brightâwhite coatings as smart sensors or wearable devices is highlighted using a treated 3.5 ”m film as a realâtime sensor for human exhalation
Heterogeneous Graph Reasoning for Fact Checking over Texts and Tables
Fact checking aims to predict claim veracity by reasoning over multiple
evidence pieces. It usually involves evidence retrieval and veracity reasoning.
In this paper, we focus on the latter, reasoning over unstructured text and
structured table information. Previous works have primarily relied on
fine-tuning pretrained language models or training homogeneous-graph-based
models. Despite their effectiveness, we argue that they fail to explore the
rich semantic information underlying the evidence with different structures. To
address this, we propose a novel word-level Heterogeneous-graph-based model for
Fact Checking over unstructured and structured information, namely HeterFC. Our
approach leverages a heterogeneous evidence graph, with words as nodes and
thoughtfully designed edges representing different evidence properties. We
perform information propagation via a relational graph neural network,
facilitating interactions between claims and evidence. An attention-based
method is utilized to integrate information, combined with a language model for
generating predictions. We introduce a multitask loss function to account for
potential inaccuracies in evidence retrieval. Comprehensive experiments on the
large fact checking dataset FEVEROUS demonstrate the effectiveness of HeterFC.
Code will be released at: https://github.com/Deno-V/HeterFC.Comment: Accepted by 38th Association for the Advancement of Artificial
Intelligence, AAA
Performance optimization of convolution calculation by blocking and sparsity on GPU
Convolution neural network (CNN) plays a paramount role in machine learning,
which has made significant contributions in medical image classification,
natural language processing, recommender system and so on. A successful
convolution neural network can achieve excellent performance with fast
execution time. The convolution operation dominates the total operation time of
convolution neural network. Therefore, in this paper, we propose a novel
convolution method on Graphic Processing Units (GPUs), which reduces the
convolution operation time and improves the execution speed by approximately 2X
than the state of the art convolution algorithm. Our work is based on the
observation that the sparsity of the input feature map of convolution operation
is relatively large, and the zero value of the feature map is redundancy for
convolution result. Therefore, we skip the zero value calculation and improve
the speed by compressing the feature map. Besides, the shape of the feature map
for the deep network is small, and the number of threads is limited. Therefore,
for a limited number of threads, it is necessary to reduce the amount of
calculation to increase the calculation speed. Our algorithm has a good effect
on the convolution operation for the feature map of the deep network with large
sparsity and small size
Generalized bioinspired approach to a daytime radiative cooling "skin"
Energy-saving cooling materials with strong operability are desirable towards
sustainable thermal management. Inspired by the cooperative thermo-optical
effect in fur of polar bear, we develop a flexible and reusable cooling skin
via laminating a polydimethylsiloxane film with a highly-scattering
polyethylene aerogel. Owing to its high porosity of 97.9% and tailored pore
size of 3.8 +- 1.4 micrometers, superior solar reflectance of 0.96 and high
transparency to irradiated thermal energy of 0.8 can be achieved at a thickness
of 2.7 mm. Combined with low thermal conductivity of 0.032 W/m/K of the
aerogel, the cooling skin exerts midday sub-ambient temperature drops of 5-6
degrees in a metropolitan environment, with an estimated limit of 14 degrees
under ideal service conditions. We envision that this generalized bilayer
approach will construct a bridge from night-time to daytime radiative cooling
and pave the way for economical, scalable, flexible and reusable cooling
materials.Comment: 15 pages, 4 figures, of which another version has been accepted by
ACS ami but not published ye
Non-Orthogonal Multiple Access Enhanced Multi-User Semantic Communication
Semantic communication serves as a novel paradigm and attracts the broad
interest of researchers. One critical aspect of it is the multi-user semantic
communication theory, which can further promote its application to the
practical network environment. While most existing works focused on the design
of end-to-end single-user semantic transmission, a novel non-orthogonal
multiple access (NOMA)-based multi-user semantic communication system named
NOMASC is proposed in this paper. The proposed system can support semantic
tranmission of multiple users with diverse modalities of source information. To
avoid high demand for hardware, an asymmetric quantizer is employed at the end
of the semantic encoder for discretizing the continuous full-resolution
semantic feature. In addition, a neural network model is proposed for mapping
the discrete feature into self-learned symbols and accomplishing intelligent
multi-user detection (MUD) at the receiver. Simulation results demonstrate that
the proposed system holds good performance in non-orthogonal transmission of
multiple user signals and outperforms the other methods, especially at
low-to-medium SNRs. Moreover, it has high robustness under various simulation
settings and mismatched test scenarios.Comment: accepted by IEEE Transactions on Cognitive Communications and
Networkin
Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment
High-dimensional vector similarity search (HVSS) is gaining prominence as a
powerful tool for various data science and AI applications. As vector data
scales up, in-memory indexes pose a significant challenge due to the
substantial increase in main memory requirements. A potential solution involves
leveraging disk-based implementation, which stores and searches vector data on
high-performance devices like NVMe SSDs. However, implementing HVSS for data
segments proves to be intricate in vector databases where a single machine
comprises multiple segments for system scalability. In this context, each
segment operates with limited memory and disk space, necessitating a delicate
balance between accuracy, efficiency, and space cost. Existing disk-based
methods fall short as they do not holistically address all these requirements
simultaneously. In this paper, we present Starling, an I/O-efficient
disk-resident graph index framework that optimizes data layout and search
strategy within the segment. It has two primary components: (1) a data layout
incorporating an in-memory navigation graph and a reordered disk-based graph
with enhanced locality, reducing the search path length and minimizing disk
bandwidth wastage; and (2) a block search strategy designed to minimize costly
disk I/O operations during vector query execution. Through extensive
experiments, we validate the effectiveness, efficiency, and scalability of
Starling. On a data segment with 2GB memory and 10GB disk capacity, Starling
can accommodate up to 33 million vectors in 128 dimensions, offering HVSS with
over 0.9 average precision and top-10 recall rate, and latency under 1
millisecond. The results showcase Starling's superior performance, exhibiting
43.9 higher throughput with 98% lower query latency compared to
state-of-the-art methods while maintaining the same level of accuracy.Comment: This paper has been accepted by SIGMOD 202
Enhanced E-Commerce Attribute Extraction: Innovating with Decorative Relation Correction and LLAMA 2.0-Based Annotation
The rapid proliferation of e-commerce platforms accentuates the need for
advanced search and retrieval systems to foster a superior user experience.
Central to this endeavor is the precise extraction of product attributes from
customer queries, enabling refined search, comparison, and other crucial
e-commerce functionalities. Unlike traditional Named Entity Recognition (NER)
tasks, e-commerce queries present a unique challenge owing to the intrinsic
decorative relationship between product types and attributes. In this study, we
propose a pioneering framework that integrates BERT for classification, a
Conditional Random Fields (CRFs) layer for attribute value extraction, and
Large Language Models (LLMs) for data annotation, significantly advancing
attribute recognition from customer inquiries. Our approach capitalizes on the
robust representation learning of BERT, synergized with the sequence decoding
prowess of CRFs, to adeptly identify and extract attribute values. We introduce
a novel decorative relation correction mechanism to further refine the
extraction process based on the nuanced relationships between product types and
attributes inherent in e-commerce data. Employing LLMs, we annotate additional
data to expand the model's grasp and coverage of diverse attributes. Our
methodology is rigorously validated on various datasets, including Walmart,
BestBuy's e-commerce NER dataset, and the CoNLL dataset, demonstrating
substantial improvements in attribute recognition performance. Particularly,
the model showcased promising results during a two-month deployment in
Walmart's Sponsor Product Search, underscoring its practical utility and
effectiveness.Comment: 9 pages, 5 image
- âŠ