5 research outputs found
Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent
To demonstrate and address the underlying maliciousness, we propose a
theoretical hypothesis and analytical approach, and introduce a new black-box
jailbreak attack methodology named IntentObfuscator, exploiting this identified
flaw by obfuscating the true intentions behind user prompts.This approach
compels LLMs to inadvertently generate restricted content, bypassing their
built-in content security measures. We detail two implementations under this
framework: "Obscure Intention" and "Create Ambiguity", which manipulate query
complexity and ambiguity to evade malicious intent detection effectively. We
empirically validate the effectiveness of the IntentObfuscator method across
several models, including ChatGPT-3.5, ChatGPT-4, Qwen and Baichuan, achieving
an average jailbreak success rate of 69.21\%. Notably, our tests on
ChatGPT-3.5, which claims 100 million weekly active users, achieved a
remarkable success rate of 83.65\%. We also extend our validation to diverse
types of sensitive content like graphic violence, racism, sexism, political
sensitivity, cybersecurity threats, and criminal skills, further proving the
substantial impact of our findings on enhancing 'Red Team' strategies against
LLM content security frameworks
TeleChat Technical Report
In this technical report, we present TeleChat, a collection of large language
models (LLMs) with parameters of 3 billion, 7 billion and 12 billion. It
includes pretrained language models as well as fine-tuned chat models that is
aligned with human preferences. TeleChat is initially pretrained on an
extensive corpus containing a diverse collection of texts from both English and
Chinese languages, including trillions of tokens. Subsequently, the model
undergoes fine-tuning to align with human preferences, following a detailed
methodology that we describe. We evaluate the performance of TeleChat on
various tasks, including language understanding, mathematics, reasoning, code
generation, and knowledge-based question answering. Our findings indicate that
TeleChat achieves comparable performance to other open-source models of similar
size across a wide range of public benchmarks. To support future research and
applications utilizing LLMs, we release the fine-tuned model checkpoints of
TeleChat's 7B and 12B variant, along with code and a portion of our pretraining
data, to the public community.Comment: 28 pages, 2 figure