27 research outputs found
Towards Code Watermarking with Dual-Channel Transformations
The expansion of the open source community and the rise of large language
models have raised ethical and security concerns on the distribution of source
code, such as misconduct on copyrighted code, distributions without proper
licenses, or misuse of the code for malicious purposes. Hence it is important
to track the ownership of source code, in wich watermarking is a major
technique. Yet, drastically different from natural languages, source code
watermarking requires far stricter and more complicated rules to ensure the
readability as well as the functionality of the source code. Hence we introduce
SrcMarker, a watermarking system to unobtrusively encode ID bitstrings into
source code, without affecting the usage and semantics of the code. To this
end, SrcMarker performs transformations on an AST-based intermediate
representation that enables unified transformations across different
programming languages. The core of the system utilizes learning-based embedding
and extraction modules to select rule-based transformations for watermarking.
In addition, a novel feature-approximation technique is designed to tackle the
inherent non-differentiability of rule selection, thus seamlessly integrating
the rule-based transformations and learning-based networks into an
interconnected system to enable end-to-end training. Extensive experiments
demonstrate the superiority of SrcMarker over existing methods in various
watermarking requirements.Comment: 16 page
Achieving Adversarial Robustness via Sparsity
Network pruning has been known to produce compact models without much
accuracy degradation. However, how the pruning process affects a network's
robustness and the working mechanism behind remain unresolved. In this work, we
theoretically prove that the sparsity of network weights is closely associated
with model robustness. Through experiments on a variety of adversarial pruning
methods, we find that weights sparsity will not hurt but improve robustness,
where both weights inheritance from the lottery ticket and adversarial training
improve model robustness in network pruning. Based on these findings, we
propose a novel adversarial training method called inverse weights inheritance,
which imposes sparse weights distribution on a large network by inheriting
weights from a small network, thereby improving the robustness of the large
network
TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT
Tables are prevalent in real-world databases, requiring significant time and
effort for humans to analyze and manipulate. The advancements in large language
models (LLMs) have made it possible to interact with tables using natural
language input, bringing this capability closer to reality. In this paper, we
present TableGPT, a unified fine-tuned framework that enables LLMs to
understand and operate on tables using external functional commands. It
introduces the capability to seamlessly interact with tables, enabling a wide
range of functionalities such as question answering, data manipulation (e.g.,
insert, delete, query, and modify operations), data visualization, analysis
report generation, and automated prediction. TableGPT aims to provide
convenience and accessibility to users by empowering them to effortlessly
leverage tabular data. At the core of TableGPT lies the novel concept of global
tabular representations, which empowers LLMs to gain a comprehensive
understanding of the entire table beyond meta-information. By jointly training
LLMs on both table and text modalities, TableGPT achieves a deep understanding
of tabular data and the ability to perform complex operations on tables through
chain-of-command instructions. Importantly, TableGPT offers the advantage of
being a self-contained system rather than relying on external API interfaces.
Moreover, it supports efficient data process flow, query rejection (when
appropriate) and private deployment, enabling faster domain data fine-tuning
and ensuring data privacy, which enhances the framework's adaptability to
specific use cases.Comment: Technical Repor
Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the United States
Short-term probabilistic forecasts of the trajectory of the COVID-19 pandemic in the United States have served as a visible and important communication channel between the scientific modeling community and both the general public and decision-makers. Forecasting models provide specific, quantitative, and evaluable predictions that inform short-term decisions such as healthcare staffing needs, school closures, and allocation of medical supplies. Starting in April 2020, the US COVID-19 Forecast Hub (https://covid19forecasthub.org/) collected, disseminated, and synthesized tens of millions of specific predictions from more than 90 different academic, industry, and independent research groups. A multimodel ensemble forecast that combined predictions from dozens of groups every week provided the most consistently accurate probabilistic forecasts of incident deaths due to COVID-19 at the state and national level from April 2020 through October 2021. The performance of 27 individual models that submitted complete forecasts of COVID-19 deaths consistently throughout this year showed high variability in forecast skill across time, geospatial units, and forecast horizons. Two-thirds of the models evaluated showed better accuracy than a naïve baseline model. Forecast accuracy degraded as models made predictions further into the future, with probabilistic error at a 20-wk horizon three to five times larger than when predicting at a 1-wk horizon. This project underscores the role that collaboration and active coordination between governmental public-health agencies, academic modeling teams, and industry partners can play in developing modern modeling capabilities to support local, state, and federal response to outbreaks
The United States COVID-19 Forecast Hub dataset
Academic researchers, government agencies, industry groups, and individuals have produced forecasts at an unprecedented scale during the COVID-19 pandemic. To leverage these forecasts, the United States Centers for Disease Control and Prevention (CDC) partnered with an academic research lab at the University of Massachusetts Amherst to create the US COVID-19 Forecast Hub. Launched in April 2020, the Forecast Hub is a dataset with point and probabilistic forecasts of incident cases, incident hospitalizations, incident deaths, and cumulative deaths due to COVID-19 at county, state, and national, levels in the United States. Included forecasts represent a variety of modeling approaches, data sources, and assumptions regarding the spread of COVID-19. The goal of this dataset is to establish a standardized and comparable set of short-term forecasts from modeling teams. These data can be used to develop ensemble models, communicate forecasts to the public, create visualizations, compare models, and inform policies regarding COVID-19 mitigation. These open-source data are available via download from GitHub, through an online API, and through R packages
Learning Towards Better Accuracy and Privacy
The thesis starts from resolving a real-world issue: to provide indoor localization services to satisfy contextual and ephemeral needs, e.g., at conferences or exhibitions events. We design, implement, and evaluate Tack, a new mobile application framework that uses a combination of known landmark locations, contacts over Bluetooth Low Energy, crowdsourcing, and dead-reckoning to estimate and refine user locations. At its core, an inference algorithm integrates all the modules and runs on mobile devices to return accurate position estimates. However, such a crowdsourcing system raises privacy concerns: a user is rarely willing to share its private location with others, but still wishes to retrieve accurate location estimates from the untrusted crowdsourcing server. Hence the second part of the thesis aims at designing a practical privacy-preserving inference system which produces statistically accurate learning results with privacy guarantees. Extending the inference algorithm to general machine learning frameworks, we focus on the contradictory requirements of accuracy and privacy in the last piece of the thesis --- a stricter privacy guarantee is usually achieved with degraded learning accuracy --- and such degradation is even worse with deep learning. We have observed that the fundamental cause of this problem is that the relationship between model accuracy and data privacy is not well characterized, leading to overly strict privacy constraints. We address the problem from an optimization perspective, and formulate the problem as one that minimizes the accuracy loss given a set of privacy constraints. As a highlight of our privacy mechanism, it is highly robust in the high privacy regime, and against any change in the neural network structure and experimental settings.Ph.D.2020-11-19 00:00:0
Mobile Offloading for Energy-efficient Computation on Smartphones
Mobile offloading enables mobile devices to distribute computation-intensive tasks to the cloud or other devices for energy conservation or performance gains. In principle, the idea is to trade the relatively low communication energy expense for high computation power consumption. In this thesis, we first focus on the technique of mobile code offloading to the cloud by proposing the new technique of coalesced offloading, which exploits the potential for multiple applications to coordinate their offloading requests with the objective of saving additional energy on mobile devices. We then turn our attention to collaborative mobile computing where a group of mobile users with the common target job form coalitions to reduce the overall energy costs. We propose distributed collaboration strategies through game theory, and formulate the problem as a non-transferable utility coalitional game, and solve it by merge and split rules.M.A.S