Automating Red Teaming for Scalable AI Trust

Background

Earlier this year, Salesforce’s Office of Ethical and Humane Use and Offensive Security teams performed security and ethics assessments on Salesforce’s first version of Copilot, now Agentforce, an autonomous AI agent for CRM use cases. The assessments focused on using direct prompt injection to bypass internal and external model guardrails used at various points in the LLM interaction lifecycle to prevent offensive, irrelevant, or incorrect outputs. During the assessment, we wanted to automatically modify prompts utilizing well-known prompt injection methods while maintaining the original prompt’s semantics. 

As described in our introductory post on red teaming, automated approaches enhance human-driven testing by using scripts, algorithms, and software tools to simulate numerous attacks or adversarial scenarios quickly. This allows us to test systems against thousands of attack vectors, providing a broad assessment of potential vulnerabilities. These tests can be consistently repeated, making them cost-effective and enabling large-scale testing with minimal human intervention.

Fortunately, one of Salesforce’s Offensive Security team members who has expertise in vulnerability research and exploit development, noticed that the problem —programmatically altering prompt injection attacks while keeping the prompt semantics— resembled something a “fuzzer” could solve. A fuzzer is a tool that automatically injects data into a program to find vulnerabilities and bugs. Fortunately, this team member specialized in writing fuzzers.

We started by examining prior work and found a paper published a few months earlier titled FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models. Bolstered by the discovery of a similar idea and promising results, we drew on our past experience using (LibAFL) and writing (feroxfuzz) fuzzing libraries to design our own framework for fuzzing Large Language Models (LLMs) named fuzzai (pronounced fuzz-ay-eye). Fuzzai is an internal Salesforce library for automatic red teaming, led by Salesforce’s Offensive Security team with major contributions from the Office of Ethical and Humane Use. We found further validation roughly two months later when Microsoft released its own AI red teaming toolkit, PyRIT, which has many parallels to our fuzzai library in its high-level architecture and design.

It has been nine months since the inception of fuzzai. The internal-only library has powered many use cases within Salesforce, ranging from generating synthetic data to conducting various assessments and automated red teaming, which all help us achieve faster, high-quality evaluations. In this article, we’ll discuss fuzzai’s architecture, how it’s being used at Salesforce, and where we see its development going in the coming months.

Architecture

The Importance of Abstraction

When designing a system for automated red teaming, focusing on abstract components using object oriented programming is extremely important. Not only does it make the development of the system easier, but it also makes it flexible for different use cases. Some teams may use it for automated red teaming while others may use it for synthetic data generation. Below, we discuss the main abstractions we use to enable this kind of flexibility. 

Diagram of an automated pipeline using all the FuzzaiAI components

Models

The ‘Models’ class is primarily responsible for connecting to appropriate APIs for specific models, whether open source or closed source. This means handling any request parameters necessary, such as decoder mechanism parameters or prompts to pass to the model. ‘Models’ are thus the foundation for two primary objects we have in our fuzzai:

  1. ‘Mutators’: these are a subclass of ‘Models’ that include a ‘mutation’ function. In our library, the mutation function orchestrates specific prompt strategies to mutate, or alter, some text into a different version with similar semantic meaning. This is great for generating a variety of prompt injection attacks that attack in the same way but with different words or for creating toxic text from benign text.
  2. ‘Judges’: These are a subclass of ‘Models’ that include a ‘classify’ function. These can be generative or discriminative models that classify specific text, or even a rule-based engine if appropriate for the task. Overriding the ‘classification’ function allows developers to define the exact labeling scheme they want, which can be used for automated labeling or red teaming.

By having an abstract ‘Models’ class, we can flexibly cater to specific model instantiations that afford specific usage. This logic also helps us develop our own tooling and interact with the class’s API seamlessly.

Messages and Records

Defining data classes standardizes how data passes through a system. Two standard classes we use are ‘Messages’ and ‘Records’.

‘Messages’ are primarily used for passing data to external systems. Thus, any pre- or post-processing functions we build will take in ‘Messages’ as the standard data type. As discussed above, ‘Models’ primarily work with ‘Messages’ as inputs to send to external model APIs.

‘Records’ are primarily used to communicate with components internal to the package. These are essentially a metadata wrapper around ‘Messages’, enabling the storage and saving of entire conversations with logs and classifications. Moreover, since ‘Records’ contain classification and errors metadata, it interacts with our ‘Analysis’ module, which helps automate the calculation of machine learning metrics for reporting purposes.

Labels

‘Labels’ are what we use for classifications and errors, as well as the specific metrics we want to use. Having an abstract class for labeling enables us to be more flexible with our other components. 

For example, sometimes the model backing a ‘Mutator’ fails to mutate text. This could be due to a myriad of reasons, such as guardrails on the model, jailbreak texts working against the model, or the model resolving the request in the text as opposed to mutating the text. When this happens, the output is not a usable mutation. Thus, when trying to automate these processes, we need some way of detecting and resolving these errors. In those situations, we’ve developed ‘Judges’ that can validate the output of ‘Mutators’, where the ‘classify’ function is overridden with specific functionality to detect these kinds of ‘Mutator’ failures.  

Input/Output (I/O)

Data can come in many formats, such as JSON, JSONL, CSV, plain text, etc. Having a standard I/O gives a common functional interface to search through, load, and save data without needing to worry about the specific file type. Thus, when developing components of fuzzai, the specific data I/O formats are abstracted away and we can just focus on the internal logic of the package.

Pipeline

‘Pipeline’ orchestrates the automation of the above components and subcomponents. We have intentionally designed it with flexibility in mind so that teams can choose whether to use all the components for automated red teaming or only specific components for their own use cases. To make this more concrete, here is some pseudo code for how we construct pipelines:

Python code: 
mutator1 = Bias()
mutator2 = Paraphrase()

model = TargetModel()

judge = BiasDetector()
validator = AppropriateOutput()

input_data = CSV('some/path/to/data.csv')
output_format = JSONL('some/path/to/output.jsonl')

pipeline = Pipeline(
    mutators = [mutator1, mutator2],
    models = [model],
    judges = [judge],
    validators = [validator],
    sources = [input_data],
    sinks = [output_format]
)

pipeline.run()

For machine learning engineers and data scientists alike, this should resemble the type of components that scikit-learn, Hugging Face, and spaCy employ. Essentially it is an easy and flexible development framework that handles the automation, orchestration, logging, and error handling of the different components. If only passing in `Mutators` and `Validators`, that would be a synthetic data pipeline. Passing in only `Judges` creates an automated labeling pipeline. Thus, this abstraction offers myriad affordances while allowing an easy way for engineering teams to implement their use cases.

Applications at Salesforce

Fuzzai has been used in a variety of use cases over the past nine months, ranging from synthetic data generation to semi-automated testing to automated red teaming. Examples include:

Prompt injection defense data generation

While building a Prompt Injection Detection model destined for Salesforce’s Trust Layer, we partnered with the AI Research team to design and build a mutate, train, and test pipeline. We used both public and private datasets as seeds for fuzzai’s `Mutators` to create synthetic data necessary to further enhance. After training, we ran a fuzzai `Pipeline` that prompted the model with our standard and adversarial prompts, labeled the output, and analyzed the results. After a handful of iterations, the pipeline helped improve the models’ recall by 27%, balanced accuracy by 15%, and reduced its false positive rate to one-third of the previous model’s value. 

Reduction in human exposure to toxic content

To prevent unnecessary employee exposure to harmful image content, fuzzai was used to build a system where image guardrails could be tested and analyzed without any human interaction. Given an unsafe image as input, we used one of fuzzai’s out of the box refusal `Judges` to help measure refusal rates of LLMs. In addition to reducing the amount of human hours spent annotating from scratch, it helped ensure compliance with our moderator safety policy and legal requirements with regard to toxic context exposure, which is imperative in a high-risk area like multimodal toxicity. 

Automated testing for products and features

We’ve also used fuzzai to automatically evaluate and analyze a myriad of Salesforce AI models, applications, and components. Some notable assessment targets include Einstein Copilot, Agentforce for Developers, Agentforce Sales Development Representative, and fine-tuned variants of our public xGen LLMs, such as xGen-Sales. One key aspect of these assessments is devising context-specific model inputs. In addition to sending prompts to the models, judging responses, and labeling results, we also use fuzzai to expand context-specific inputs into more robust and complete datasets via fuzzai’s ‘Mutators.’ ‘Mutators’ augment context-specific data points with more linguistic diversity and different risk vectors. This method adds both breadth and depth to our assessments and leads to the discovery of nearly 60 findings across eight Salesforce technologies, all of which were reported and triaged by product teams. 

Future Work: Agents and Red Teaming

We are doing work in several directions related to agents and red teaming. We recently teamed up with Salesforce’s AI Research team to develop a method to generate context-specific inputs to greatly accelerate the red teaming of specific product features. For example, to red team a call transcripts summarization feature, it would be ideal to be able to rapidly provide basic documentation about the product feature to a red teaming agent, that then rapidly generates attack strategies and a seed set of example prompts, which then get mutated using our `Mutator` module. Recent research suggests that being able to automatically reason through different attack strategies will be important in building such a red teaming agent. This presents an exciting challenge for us to continue to add more attack strategies (many interesting ones have come out recently, such as many-shot jailbreaking) to the fuzzai package, and intelligently decide between them.

In parallel, our team is starting to take on more work beyond red teaming chat models to red teaming agentic systems that have the ability to call tools and autonomously execute actions. Here, we have found that being able to red team in a similar environment in which LLM agents can operate will be necessary, since the risks we need to uncover through red teaming may only be surfaced when an LLM agent interacts with its environment. For example, a privacy risk may only surface in a function call to retrieve information from a Salesforce database. Additionally, these agentic systems may not consist of only one LLM agent, but multiple LLM agents, some of which perform specialized tasks and then communicate back to a more general agent. To this end, our team is starting to investigate options for agent sandboxes, containing synthetic environments that approximate the environments in which multiple LLM agents may inhabit and interact with each other.

Acknowledgements: 

Thanks to Jason Ross and Srinivas Batchu, core members of Salesforce’s Offensive Security team for their insights and contributions to fuzzai.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *