Caleb’s Guide to Getting Started in Attacking AI

You might have heard about them. Maybe you saw a headline or read an article talking about them. Or maybe you overheard a conversation. Regardless of where or how, more people are talking about attacks on AI or Machine Learning. Which I find fantastic!

For me, It’s been almost 10 years. My master’s degree was situated around using machine learning to attack other systems, and now my PhD is focused on attacking the very algorithms that facilitated my previous research.

But what if someone wanted to learn about these attacks? Or even learn some of them? Unfortunately the information around attacking AI is scattered between academia and industry, including papers, random websites, blogs (perhaps I am a contributor to this problem), social media posts, and the like. So I did some research on some of the more pertinent areas to provide the list of resources below.

This is not an exhaustive list by any means (and I plan to update it regularly), but it is a great start for anyone interested.

The list below is broken up into a few categories:

“Intro and General“: This is for new-comers who are interested in the area but don’t know where to start. It includes some overviews of attacking AI, some more in-depth resources that cover all the bases (but might be more on the ‘verbose’ side), and some articles/topics that don’t fall into the other categories.
“Practical“: This category includes practical attacks or instructions on AI or Machine Learning (AI/ML). Some of the other categories might have some practical content too, but this is where most of the more applied attacks will go.
“Assessment“: This category is more about methodology. There is an approach to attacking a system, and the methodology is important. This category will include topics that I find useful in assessing a system. It will include things that may help find “bad code smells”, or red flags, but could also include general advice.
“Academic“: This category will be less practical but represents the bleeding-edge of research in the area. I will update this area from time to time with sources (not necessarily papers, but sites where you can find your own research) to keep things fresh.

So let’s get started.

Intro and General

A good intro to attacking AI

https://www.belfercenter.org/publication/AttackingAI
From Harvard, 2019, and still applicable as an entry point into the area.
Gives a great summary and examples of various styles of attack.

NIST “Quick” Intro to Attacks and Taxonomy

https://www.nist.gov/news-events/news/2024/01/nist-identifies-types-cyberattacks-manipulate-behavior-ai-systems
Actual Taxonomy and document (Very verbose, but this is NIST): https://csrc.nist.gov/pubs/ai/100/2/e2023/final
Although NIST creates some of the most verbose/dense documentation around, the content tends to be good — once you get to it.
This is an article about some of the attacks they are seeing, and the second link is their document that tracks all of the attacks they have seen. If you want an information dump of all these attacks, this might be a good source. Just … bring coffee.

An Intro to Adversarial Image Attacks

https://www.unite.ai/why-adversarial-image-attacks-are-no-joke/
This is a great end-to-end example of an attack against image-based AI systems

Intro to Prompt Injection Attacks

https://research.nccgroup.com/2022/12/05/exploring-prompt-injection-attacks/
A good place to start with Large Language Model (LLM) prompt attacks

Practical

PortSwigger Academy: LLM Attacks

https://portswigger.net/web-security/llm-attacks
Attacking Large Language Models (LLMs)
Gives a great introduction to what LLMs are then dives right into showing how they can be attacked
Provides labs to practice what you just learned in a safe environment

OWASP Machine Learning Security Top Ten

https://owasp.org/www-project-machine-learning-security-top-10/
OWASP has a few top tens, and all of them are good. This is their Machine Learning (ML) security top ten.
Each item in their top ten includes a description, advice on preventing the given attack, various risk factors, and then example attack scenarios.
Although these are not “learn to attack models” in purpose, we can derive useful information about attacks and learn to conduct them ourselves.

Learn Prompt Injection Attacks (With Lab)

https://learnprompting.org/docs/prompt_hacking/injection
Practical learning with areas to apply what you learned

Gandalf LLM Pentesting Lab

https://gandalf.lakera.ai/
Classic pentesting / CTF format, but for LLMs. Is quite fun.

Assessment

OWASP AI Security and Privacy Guide

https://owasp.org/www-project-ai-security-and-privacy-guide/
A general rule of thumb when assessing privacy and security adherence is simply to check the standard:
- If it says “MUST”, check that the system actually does that.
- If it says “MUST NOT”, check that the system actually does NOT do that.
- You will be surprised how often a system only partially implements the standard.
As an offensive researcher, we are looking for red flags. If the standard says “reduce the access to sensitive information”, and you see that there might be some sensitive information in the training set, that indicates that the standard/guideline was not followed well. Perhaps there are other more glaring issues there too?
This is one such guide but there are many others. Read the guidelines. Understand what is expected by them, and look for where a system falls short of those expectations.
Caveat: Some items in a standard/guideline may not apply to an attacker. It is up to you to figure out what is applicable and what is not.

Academic

If you are looking for publications, you might want to start here:

Offensive AI Lab: https://offensive-ai-lab.github.io/
Papers with Code: https://paperswithcode.com/
NOTE: There are plenty of publications on google scholar if you search for “adversarial machine learning” or “attacking AI”

Specific Publications of Note:

https://not-just-memorization.github.io/extracting-training-data-from-chatgpt.html
- The authors are able to extract some of the original training data from a model
https://www.usenix.org/conference/usenixsecurity23/presentation/tao
- “General” adversarial patch attack that fools object detection (image)

Caleb Shortt

Technology and Other Interesting Topics

Caleb’s Guide to Getting Started in Attacking AI

Intro and General

Practical

Assessment

Academic

Intro and General

Practical

Assessment

Academic

Share this: