You might have heard about them. Maybe you saw a headline or read an article talking about them. Or maybe you overheard a conversation. Regardless of where or how, more people are talking about attacks on AI or Machine Learning. Which I find fantastic!
For me, It’s been almost 10 years. My master’s degree was situated around using machine learning to attack other systems, and now my PhD is focused on attacking the very algorithms that facilitated my previous research.
But what if someone wanted to learn about these attacks? Or even learn some of them? Unfortunately the information around attacking AI is scattered between academia and industry, including papers, random websites, blogs (perhaps I am a contributor to this problem), social media posts, and the like. So I did some research on some of the more pertinent areas to provide the list of resources below.
This is not an exhaustive list by any means (and I plan to update it regularly), but it is a great start for anyone interested.
The list below is broken up into a few categories:
- “Intro and General“: This is for new-comers who are interested in the area but don’t know where to start. It includes some overviews of attacking AI, some more in-depth resources that cover all the bases (but might be more on the ‘verbose’ side), and some articles/topics that don’t fall into the other categories.
- “Practical“: This category includes practical attacks or instructions on AI or Machine Learning (AI/ML). Some of the other categories might have some practical content too, but this is where most of the more applied attacks will go.
- “Assessment“: This category is more about methodology. There is an approach to attacking a system, and the methodology is important. This category will include topics that I find useful in assessing a system. It will include things that may help find “bad code smells”, or red flags, but could also include general advice.
- “Academic“: This category will be less practical but represents the bleeding-edge of research in the area. I will update this area from time to time with sources (not necessarily papers, but sites where you can find your own research) to keep things fresh.
So let’s get started.
Intro and General
A good intro to attacking AI
- https://www.belfercenter.org/publication/AttackingAI
- From Harvard, 2019, and still applicable as an entry point into the area.
- Gives a great summary and examples of various styles of attack.
NIST “Quick” Intro to Attacks and Taxonomy
- https://www.nist.gov/news-events/news/2024/01/nist-identifies-types-cyberattacks-manipulate-behavior-ai-systems
- Actual Taxonomy and document (Very verbose, but this is NIST): https://csrc.nist.gov/pubs/ai/100/2/e2023/final
- Although NIST creates some of the most verbose/dense documentation around, the content tends to be good — once you get to it.
- This is an article about some of the attacks they are seeing, and the second link is their document that tracks all of the attacks they have seen. If you want an information dump of all these attacks, this might be a good source. Just … bring coffee.
An Intro to Adversarial Image Attacks
- https://www.unite.ai/why-adversarial-image-attacks-are-no-joke/
- This is a great end-to-end example of an attack against image-based AI systems
Intro to Prompt Injection Attacks
- https://research.nccgroup.com/2022/12/05/exploring-prompt-injection-attacks/
- A good place to start with Large Language Model (LLM) prompt attacks
Practical
PortSwigger Academy: LLM Attacks
- https://portswigger.net/web-security/llm-attacks
- Attacking Large Language Models (LLMs)
- Gives a great introduction to what LLMs are then dives right into showing how they can be attacked
- Provides labs to practice what you just learned in a safe environment
OWASP Machine Learning Security Top Ten
- https://owasp.org/www-project-machine-learning-security-top-10/
- OWASP has a few top tens, and all of them are good. This is their Machine Learning (ML) security top ten.
- Each item in their top ten includes a description, advice on preventing the given attack, various risk factors, and then example attack scenarios.
- Although these are not “learn to attack models” in purpose, we can derive useful information about attacks and learn to conduct them ourselves.
Learn Prompt Injection Attacks (With Lab)
- https://learnprompting.org/docs/prompt_hacking/injection
- Practical learning with areas to apply what you learned
Gandalf LLM Pentesting Lab
- https://gandalf.lakera.ai/
- Classic pentesting / CTF format, but for LLMs. Is quite fun.
Assessment
OWASP AI Security and Privacy Guide
- https://owasp.org/www-project-ai-security-and-privacy-guide/
- A general rule of thumb when assessing privacy and security adherence is simply to check the standard:
- If it says “MUST”, check that the system actually does that.
- If it says “MUST NOT”, check that the system actually does NOT do that.
- You will be surprised how often a system only partially implements the standard.
- As an offensive researcher, we are looking for red flags. If the standard says “reduce the access to sensitive information”, and you see that there might be some sensitive information in the training set, that indicates that the standard/guideline was not followed well. Perhaps there are other more glaring issues there too?
- This is one such guide but there are many others. Read the guidelines. Understand what is expected by them, and look for where a system falls short of those expectations.
- Caveat: Some items in a standard/guideline may not apply to an attacker. It is up to you to figure out what is applicable and what is not.
Academic
If you are looking for publications, you might want to start here:
- Offensive AI Lab: https://offensive-ai-lab.github.io/
- Papers with Code: https://paperswithcode.com/
- NOTE: There are plenty of publications on google scholar if you search for “adversarial machine learning” or “attacking AI”
Specific Publications of Note:
- https://not-just-memorization.github.io/extracting-training-data-from-chatgpt.html
- The authors are able to extract some of the original training data from a model
- https://www.usenix.org/conference/usenixsecurity23/presentation/tao
- “General” adversarial patch attack that fools object detection (image)