Adversarial Patterns Starting to Pop Up In LLMs

Quick Points:

Paper Title: “Semantic Stealth: Crafting Covert Adversarial Patches for Sentiment Classifiers Using Large Language Models“
Paper Link: Via google scholar: https://dl.acm.org/doi/pdf/10.1145/3689932.3694758

New research (Nov 2024) is showing that LLMs are just as susceptible to adversarial patterns as facial recognition is. What does this mean? Let’s try to understand.

Adversarial Patterns

An “Adversarial Pattern” is a pattern that is specifically designed to break a targeted Deep Neural Network based system… so an ‘AI’. Some researchers have created “Adversarial Patches” (https://arxiv.org/pdf/1712.09665) that are special patterns that cause image classification systems to break in some predictable way. They did this by printing the pattern out as a sticker and placing it in the camera’s field of view.

This is an active area of research (including myself) for some time: with some very interesting results. This whitepaper from Tencent (https://keenlab.tencent.com/en/whitepapers/Experimental_Security_Research_of_Tesla_Autopilot.pdf) shows some attacks for the Tesla EV — including some interesting perturbation (‘semi-random noise’ pattern) style attacks — for those interested.

Why This is Interesting

We knew that image based systems were essentially Swiss cheese when it came to adversarial patterns (Just head to google scholar and search ‘adversarial patterns’). What is interesting is that this ports a traditionally-vision based attack to a text based LLM. SUPER Interesting.

Remember kids. Under the hood, there’s a Deep Neural Network. Attack not the application, but its foundation.

The Attack

These researchers did away with perturbation based attacks (“Unconstrained Gradient-Based Attacks” in the paper). This is because these attacks generate random noise. In an image, this looks like … random noise. It’s pretty easy to see that it’s not what was classified — thus relatively easy to detect. In an LLM, this style attack would be random characters (letters, symbols, etc.) and not a true ‘sentence’ that is grammatically correct.

This paper proposes a model that can “generate grammatically correct and semantically meaningful text to craft adversarial patches that seamlessly blend in with the original input text.” They can hide the sentences in plain sight, but it can have a hidden attack within it. Pretty cool.

Finally, what did they actually do to prove this?

“We demonstrate the feasibility of our approach using open-source LLMs, including Intel’s Neural Chat, Llama2, and Mistral-Instruct, to generate adversarial patches capable of altering the predictions of a distilBERT model fine-tuned on the IMDB reviews dataset for sentiment classification.”

The Caveat (There’s Always One)

The attack requires grey-box training on the target model. This means that they need some access to the target model’s internal scores. This significantly impacts the “real-world” applicability of this attack.

Caleb Shortt

Technology and Other Interesting Topics

Adversarial Patterns Starting to Pop Up In LLMs

Adversarial Patterns

Why This is Interesting

The Attack

The Caveat (There’s Always One)

Adversarial Patterns

Why This is Interesting

The Attack

The Caveat (There’s Always One)

Share this: