The emergence of LLM (Large Language Model) integrated virtual assistants has
brought about a rapid transformation in communication dynamics. During virtual
assistant development, some developers prefer to leverage the system message,
also known as an initial prompt or custom prompt, for preconditioning purposes.
However, it is important to recognize that an excessive reliance on this
functionality raises the risk of manipulation by malicious actors who can
exploit it with carefully crafted prompts. Such malicious manipulation poses a
significant threat, potentially compromising the accuracy and reliability of
the virtual assistant's responses. Consequently, safeguarding the virtual
assistants with detection and defense mechanisms becomes of paramount
importance to ensure their safety and integrity. In this study, we explored
three detection and defense mechanisms aimed at countering attacks that target
the system message. These mechanisms include inserting a reference key,
utilizing an LLM evaluator, and implementing a Self-Reminder. To showcase the
efficacy of these mechanisms, they were tested against prominent attack
techniques. Our findings demonstrate that the investigated mechanisms are
capable of accurately identifying and counteracting the attacks. The
effectiveness of these mechanisms underscores their potential in safeguarding
the integrity and reliability of virtual assistants, reinforcing the importance
of their implementation in real-world scenarios. By prioritizing the security
of virtual assistants, organizations can maintain user trust, preserve the
integrity of the application, and uphold the high standards expected in this
era of transformative technologies.Comment: Accepted to be published in the Proceedings of the 10th IEEE CSDE
2023, the Asia-Pacific Conference on Computer Science and Data Engineering
202