178 research outputs found
MIAEC: Missing data imputation based on the evidence Chain
Ā© 2013 IEEE. Missing or incorrect data caused by improper operations can seriously compromise security investigation. Missing data can not only damage the integrity of the information but also lead to the deviation of the data mining and analysis. Therefore, it is necessary to implement the imputation of missing value in the phase of data preprocessing to reduce the possibility of data missing as a result of human error and operations. The performances of existing imputation approaches of missing value cannot satisfy the analysis requirements due to its low accuracy and poor stability, especially the rapid decreasing imputation accuracy with the increasing rate of missing data. In this paper, we propose a novel missing value imputation algorithm based on the evidence chain (MIAEC), which first mines all relevant evidence of missing values in each data tuple and then combines this relevant evidence to build the evidence chain for further estimation of missing values. To extend MIAEC for large-scale data processing, we apply the map-reduce programming model to realize the distribution and parallelization of MIAEC. Experimental results show that the proposed approach can provide higher imputation accuracy compared with the missing data imputation algorithm based on naive Bayes, the mode imputation algorithm, and the proposed missing data imputation algorithm based on K-nearest neighbor. MIAEC has higher imputation accuracy and its imputation accuracy is also assured with the increasing rate of missing value or the position change of missing value. MIAEC is also proved to be suitable for the distributed computing platform and can achieve an ideal speedup ratio
Breaking through Deterministic Barriers: Randomized Pruning Mask Generation and Selection
It is widely acknowledged that large and sparse models have higher accuracy
than small and dense models under the same model size constraints. This
motivates us to train a large model and then remove its redundant neurons or
weights by pruning. Most existing works pruned the networks in a deterministic
way, the performance of which solely depends on a single pruning criterion and
thus lacks variety. Instead, in this paper, we propose a model pruning strategy
that first generates several pruning masks in a designed random way.
Subsequently, along with an effective mask-selection rule, the optimal mask is
chosen from the pool of mask candidates. To further enhance efficiency, we
introduce an early mask evaluation strategy, mitigating the overhead associated
with training multiple masks. Our extensive experiments demonstrate that this
approach achieves state-of-the-art performance across eight datasets from GLUE,
particularly excelling at high levels of sparsity
Biomimetic Polymer Film with Brilliant Brightness Using a OneāStep Water VaporāInduced Phase Separation Method
The scales of the white Cyphochilus beetles are endowed with unusual whiteness arising from the exceptional scattering efficiency of their disordered ultrastructure optimized through millions of years of evolution. Here, a simple, oneāstep method based on water vaporāinduced phase separation is developed to prepare thin polystyrene films with similar microstructure and comparable optical performance. A typical biomimetic 3.5 Āµm PS film exhibits a diffuse reflectance of 61% at 500 nm wavelength, which translates into a transport mean free path below 1 Āµm. A complete optical characterization through Monte Carlo simulations reveals how such a scattering performance arises from the scattering coefficient and scattering anisotropy, whose interplay provides insight into the morphological properties of the material. The potential of brightāwhite coatings as smart sensors or wearable devices is highlighted using a treated 3.5 Āµm film as a realātime sensor for human exhalation
Heterogeneous Graph Reasoning for Fact Checking over Texts and Tables
Fact checking aims to predict claim veracity by reasoning over multiple
evidence pieces. It usually involves evidence retrieval and veracity reasoning.
In this paper, we focus on the latter, reasoning over unstructured text and
structured table information. Previous works have primarily relied on
fine-tuning pretrained language models or training homogeneous-graph-based
models. Despite their effectiveness, we argue that they fail to explore the
rich semantic information underlying the evidence with different structures. To
address this, we propose a novel word-level Heterogeneous-graph-based model for
Fact Checking over unstructured and structured information, namely HeterFC. Our
approach leverages a heterogeneous evidence graph, with words as nodes and
thoughtfully designed edges representing different evidence properties. We
perform information propagation via a relational graph neural network,
facilitating interactions between claims and evidence. An attention-based
method is utilized to integrate information, combined with a language model for
generating predictions. We introduce a multitask loss function to account for
potential inaccuracies in evidence retrieval. Comprehensive experiments on the
large fact checking dataset FEVEROUS demonstrate the effectiveness of HeterFC.
Code will be released at: https://github.com/Deno-V/HeterFC.Comment: Accepted by 38th Association for the Advancement of Artificial
Intelligence, AAA
Performance optimization of convolution calculation by blocking and sparsity on GPU
Convolution neural network (CNN) plays a paramount role in machine learning,
which has made significant contributions in medical image classification,
natural language processing, recommender system and so on. A successful
convolution neural network can achieve excellent performance with fast
execution time. The convolution operation dominates the total operation time of
convolution neural network. Therefore, in this paper, we propose a novel
convolution method on Graphic Processing Units (GPUs), which reduces the
convolution operation time and improves the execution speed by approximately 2X
than the state of the art convolution algorithm. Our work is based on the
observation that the sparsity of the input feature map of convolution operation
is relatively large, and the zero value of the feature map is redundancy for
convolution result. Therefore, we skip the zero value calculation and improve
the speed by compressing the feature map. Besides, the shape of the feature map
for the deep network is small, and the number of threads is limited. Therefore,
for a limited number of threads, it is necessary to reduce the amount of
calculation to increase the calculation speed. Our algorithm has a good effect
on the convolution operation for the feature map of the deep network with large
sparsity and small size
Large Language Model based Situational Dialogues for Second Language Learning
In second language learning, scenario-based conversation practice is
important for language learners to achieve fluency in speaking, but students
often lack sufficient opportunities to practice their conversational skills
with qualified instructors or native speakers. To bridge this gap, we propose
situational dialogue models for students to engage in conversational practice.
Our situational dialogue models are fine-tuned on large language models (LLMs),
with the aim of combining the engaging nature of an open-ended conversation
with the focused practice of scenario-based tasks. Leveraging the
generalization capabilities of LLMs, we demonstrate that our situational
dialogue models perform effectively not only on training topics but also on
topics not encountered during training. This offers a promising solution to
support a wide range of conversational topics without extensive manual work.
Additionally, research in the field of dialogue systems still lacks reliable
automatic evaluation metrics, leading to human evaluation as the gold standard
(Smith et al., 2022), which is typically expensive. To address the limitations
of existing evaluation methods, we present a novel automatic evaluation method
that employs fine-tuned LLMs to efficiently and effectively assess the
performance of situational dialogue models.Comment: 14 pages, 6 figure
Generalized bioinspired approach to a daytime radiative cooling "skin"
Energy-saving cooling materials with strong operability are desirable towards
sustainable thermal management. Inspired by the cooperative thermo-optical
effect in fur of polar bear, we develop a flexible and reusable cooling skin
via laminating a polydimethylsiloxane film with a highly-scattering
polyethylene aerogel. Owing to its high porosity of 97.9% and tailored pore
size of 3.8 +- 1.4 micrometers, superior solar reflectance of 0.96 and high
transparency to irradiated thermal energy of 0.8 can be achieved at a thickness
of 2.7 mm. Combined with low thermal conductivity of 0.032 W/m/K of the
aerogel, the cooling skin exerts midday sub-ambient temperature drops of 5-6
degrees in a metropolitan environment, with an estimated limit of 14 degrees
under ideal service conditions. We envision that this generalized bilayer
approach will construct a bridge from night-time to daytime radiative cooling
and pave the way for economical, scalable, flexible and reusable cooling
materials.Comment: 15 pages, 4 figures, of which another version has been accepted by
ACS ami but not published ye
Non-Orthogonal Multiple Access Enhanced Multi-User Semantic Communication
Semantic communication serves as a novel paradigm and attracts the broad
interest of researchers. One critical aspect of it is the multi-user semantic
communication theory, which can further promote its application to the
practical network environment. While most existing works focused on the design
of end-to-end single-user semantic transmission, a novel non-orthogonal
multiple access (NOMA)-based multi-user semantic communication system named
NOMASC is proposed in this paper. The proposed system can support semantic
tranmission of multiple users with diverse modalities of source information. To
avoid high demand for hardware, an asymmetric quantizer is employed at the end
of the semantic encoder for discretizing the continuous full-resolution
semantic feature. In addition, a neural network model is proposed for mapping
the discrete feature into self-learned symbols and accomplishing intelligent
multi-user detection (MUD) at the receiver. Simulation results demonstrate that
the proposed system holds good performance in non-orthogonal transmission of
multiple user signals and outperforms the other methods, especially at
low-to-medium SNRs. Moreover, it has high robustness under various simulation
settings and mismatched test scenarios.Comment: accepted by IEEE Transactions on Cognitive Communications and
Networkin
Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment
High-dimensional vector similarity search (HVSS) is gaining prominence as a
powerful tool for various data science and AI applications. As vector data
scales up, in-memory indexes pose a significant challenge due to the
substantial increase in main memory requirements. A potential solution involves
leveraging disk-based implementation, which stores and searches vector data on
high-performance devices like NVMe SSDs. However, implementing HVSS for data
segments proves to be intricate in vector databases where a single machine
comprises multiple segments for system scalability. In this context, each
segment operates with limited memory and disk space, necessitating a delicate
balance between accuracy, efficiency, and space cost. Existing disk-based
methods fall short as they do not holistically address all these requirements
simultaneously. In this paper, we present Starling, an I/O-efficient
disk-resident graph index framework that optimizes data layout and search
strategy within the segment. It has two primary components: (1) a data layout
incorporating an in-memory navigation graph and a reordered disk-based graph
with enhanced locality, reducing the search path length and minimizing disk
bandwidth wastage; and (2) a block search strategy designed to minimize costly
disk I/O operations during vector query execution. Through extensive
experiments, we validate the effectiveness, efficiency, and scalability of
Starling. On a data segment with 2GB memory and 10GB disk capacity, Starling
can accommodate up to 33 million vectors in 128 dimensions, offering HVSS with
over 0.9 average precision and top-10 recall rate, and latency under 1
millisecond. The results showcase Starling's superior performance, exhibiting
43.9 higher throughput with 98% lower query latency compared to
state-of-the-art methods while maintaining the same level of accuracy.Comment: This paper has been accepted by SIGMOD 202
- ā¦