53 research outputs found
Qualitative analyses on the classification model of bystander behavior in cyberbullying
IntroductionBystanders account for the largest proportion of those involve in cyberbullying and play an important role in the development of cyberbullying incidents. Regarding the classification of bystander behavior in cyberbullying, there exist some limitations in the previous research, such as not considering the complexity of the online environment. Therefore, this study constructed a new classification model of bystander behavior in cyberbullying.MethodsBy separately utilizing questionnaires and experimental methods, the study collected participants’ behavioral intentions and actual behavioral responses to deal with cyberbullying incidents.ResultsBased on two qualitative studies, this study summarized a new classification model, which included three first-level factors and six second-level factors. Specifically, the classification model included positive bystander behavior (i.e., pointing at the victim, bully, and others), neutral bystander behavior (i.e., inaction), and negative bystander behavior (i.e., supporting and excessively confronting the bully).DiscussionThe classification model has important contributions to the research on bystander behavior in cyberbullying. This model helps researchers to develop more effective intervention approaches on cyberbullying from the perspective of each category of bystander behavior
DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data
Data preprocessing is a crucial step in the machine learning process that
transforms raw data into a more usable format for downstream ML models.
However, it can be costly and time-consuming, often requiring the expertise of
domain experts. Existing automated machine learning (AutoML) frameworks claim
to automate data preprocessing. However, they often use a restricted search
space of data preprocessing pipelines which limits the potential performance
gains, and they are often too slow as they require training the ML model
multiple times. In this paper, we propose DiffPrep, a method that can
automatically and efficiently search for a data preprocessing pipeline for a
given tabular dataset and a differentiable ML model such that the performance
of the ML model is maximized. We formalize the problem of data preprocessing
pipeline search as a bi-level optimization problem. To solve this problem
efficiently, we transform and relax the discrete, non-differential search space
into a continuous and differentiable one, which allows us to perform the
pipeline search using gradient descent with training the ML model only once.
Our experiments show that DiffPrep achieves the best test accuracy on 15 out of
the 18 real-world datasets evaluated and improves the model's test accuracy by
up to 6.6 percentage points.Comment: Published at SIGMOD 202
Rethinking Similarity Search: Embracing Smarter Mechanisms over Smarter Data
In this vision paper, we propose a shift in perspective for improving the
effectiveness of similarity search. Rather than focusing solely on enhancing
the data quality, particularly machine learning-generated embeddings, we
advocate for a more comprehensive approach that also enhances the underpinning
search mechanisms. We highlight three novel avenues that call for a
redefinition of the similarity search problem: exploiting implicit data
structures and distributions, engaging users in an iterative feedback loop, and
moving beyond a single query vector. These novel pathways have gained relevance
in emerging applications such as large-scale language models, video clip
retrieval, and data labeling. We discuss the corresponding research challenges
posed by these new problem areas and share insights from our preliminary
discoveries
Falcon: Fair Active Learning using Multi-armed Bandits
Biased data can lead to unfair machine learning models, highlighting the
importance of embedding fairness at the beginning of data analysis,
particularly during dataset curation and labeling. In response, we propose
Falcon, a scalable fair active learning framework. Falcon adopts a data-centric
approach that improves machine learning model fairness via strategic sample
selection. Given a user-specified group fairness measure, Falcon identifies
samples from "target groups" (e.g., (attribute=female, label=positive)) that
are the most informative for improving fairness. However, a challenge arises
since these target groups are defined using ground truth labels that are not
available during sample selection. To handle this, we propose a novel
trial-and-error method, where we postpone using a sample if the predicted label
is different from the expected one and falls outside the target group. We also
observe the trade-off that selecting more informative samples results in higher
likelihood of postponing due to undesired label prediction, and the optimal
balance varies per dataset. We capture the trade-off between informativeness
and postpone rate as policies and propose to automatically select the best
policy using adversarial multi-armed bandit methods, given their computational
efficiency and theoretical guarantees. Experiments show that Falcon
significantly outperforms existing fair active learning approaches in terms of
fairness and accuracy and is more efficient. In particular, only Falcon
supports a proper trade-off between accuracy and fairness where its maximum
fairness score is 1.8-4.5x higher than the second-best results.Comment: Accepted to VLDB 202
DynaQuant: Compressing Deep Learning Training Checkpoints via Dynamic Quantization
With the increase in the scale of Deep Learning (DL) training workloads in
terms of compute resources and time consumption, the likelihood of encountering
in-training failures rises substantially, leading to lost work and resource
wastage. Such failures are typically offset by a checkpointing mechanism, which
comes at the cost of storage and network bandwidth overhead. State-of-the-art
approaches involve lossy model compression mechanisms, which induce a tradeoff
between the resulting model quality (accuracy) and compression ratio. Delta
compression is then used to further reduce the overhead by only storing the
difference between consecutive checkpoints. We make a key enabling observation
that the sensitivity of model weights to compression varies during training,
and different weights benefit from different quantization levels (ranging from
retaining full precision to pruning). We propose (1) a non-uniform quantization
scheme that leverages this variation, (2) an efficient search mechanism that
dynamically finds the best quantization configurations, and (3) a
quantization-aware delta compression mechanism that rearranges weights to
minimize checkpoint differences, thereby maximizing compression. We instantiate
these contributions in DynaQuant - a framework for DL workload checkpoint
compression. Our experiments show that DynaQuant consistently achieves a better
tradeoff between accuracy and compression ratios compared to prior works,
enabling a compression ratio up to 39x and withstanding up to 10 restores with
negligible accuracy impact for fault-tolerant training. DynaQuant achieves at
least an order of magnitude reduction in checkpoint storage overhead for
training failure recovery as well as transfer learning use cases without any
loss of accuracy
- …