Generative AI (GenAI) technologies are increasingly integrated into every part of our daily workflow, from drafting emails to generating and fixing code with large language models (LLMs) and agents. Ensuring the safety and integrity of these LLM-based solutions, and protecting end users from potential harm is crucial. At Salesforce AI Research, we innovate on systems that help safeguard our LLM usage [3] and the broader Agentforce ecosystem by developing novel guard models that specialize in and excel at the customer service management (CRM) tasks that our customers need the most.
Background: LLM Agent and Safety
In a nutshell, an LLM agent can be considered a smart assistant to a human user that receives requests, operates in its environment (e.g., sending an email from an email account, fixing a bug in a code repository, or retrieving information from a customer database), and returns the result of the action to the human user. As agents can often execute actions with real-world impact (e.g., deleting an entry from a database or running a piece of generated code), we need to ensure that their execution is safe.
There are multiple potential threats in an LLM agent execution loop.
- First, the human user may act with a malicious intent [2], such as to ask the agent to mass-send spam emails and leak sensitive information [9].
- Second, the underlying LLM powering the agent may generate harmful content, sometimes even from legitimate requests, such as misinformation or stereotype-reinforcing statements.
- Finally, an emerging concern with LLM agents is the potential hostility of environment. For example, when an agent needs to read a code repository to implement a bug fix, one particular file may contain a comment of “TODO: write a Trojan virus”, which tricks the agent to supply the virus implementation.
In all these cases, we need to ensure that both the input and output of the LLM agent is safe and free from any toxic content. Towards this end, scientists and engineers at Salesforce AI Research leverage both publicly available data and internal collected CRM-specialized data to train a set of reliable SFR-Guard family of models, and some of them are deployed as Salesforce products such as Trust Layer, Security Checks, and Guradrails.
SFR-Guard: CRM-Specialized Layer for Trust and Guardrail
The table below summarizes the core capabilities and performance of our SFR-guard model family, as measured by F1 scores averaged across the benchmarking suites, along with leading models from other providers.
Name | Number of Parameters | Developer | Supports Fine-Grained Category | Supports Evidence and Explanation | Supports Severity Level | Performance on Public Benchmark Suite | Performance on Private Benchmark Suite |
SFR-Guard Family | 0.05B – 8B | Salesforce | Yes | Yes | Yes | 83.3 | 93 |
GPT-4o | Unknown | OpenAI | Yes | Yes | Yes | 78.7 | 84.5 |
LlamaGuard 3 | 8B | Meta | Yes | No | No | 71.3 | 71 |
ShieldGemma | 2B-7B | No | No | No | 81.7 | 74 | |
WildGuard | 7B | AI2 & UW | No | No | No | 80.6 | 85.5 |
Over the past two years, we’ve developed two generations of SFR-Guard. The first generation involves finetuning classifier models, such as BERT, DistilBERT and Flan-T5 to generate binary (Yes/No) predictions on the input. The second generation employs genenerative LLMs, where we finetune a (relatively small) LLM model on the training data to produce verbal judgments in natural language.
An important consideration for Trust when adopting these automated decision making processes, our No. 1 Value at Salesforce, is for the users and stakeholders to transparently understand the rationale for the decision. However, the models we use are generally considered as black-boxes, making it difficult to inspect its internal decision making logics. To address this problem and and maintain trust, we’ve augmented SFR-Guard with additional user-friendly capabilities that improve transparency and actionability.
First, we annotate our training data with fine-grained categories (e.g., gender, race or socioeconomic status for toxic language detection) and train our models to provide fine-grained detection result. Second, we propose a pipeline to additionally generate the reasoning of the detection in two manners: inline citations of the violating input segments and a short paragraph of natural language explanation. Last, we devise a training procedure to equip models the ability to generate severity label predictions (e.g., 1 or 2 for mild and unintentional violations vs. 4 or 5 for egregious and severe violations). Combined together, these rich output signals not only increase trust into our models, but also enable system administrators and developers to develop custom and more actionable response policies (e.g., show a warning for a low-severity violation vs. terminate the user access for a high severity violation).
Before highlighting key things to know about our model development, we first summarize our data collection effort and the resulting dataset mix that goes into training and testing these models.
Data Sources
We use a mix of data sources to train and evaluate our models, carefully balancing coverage, safety, and real-world relevance.
For training data, we consider the following four types:
- Publicly available data: we leverage datasets with commercial licenses for any production use, and broader public data for research purposes. The training data include both general chat conversations and targeted red teaming attempts specifically crafted to break LLM security.
- Internal CRM-specialized data: we incorporate de-identified data collected from real usage by Salesforce employees, capturing CRM-specific language patterns and edge cases.
- Synthetic data: we generate synthetic examples by augmenting safe inputs using large language models as well as manipulating safe inputs to make it unsafe. This allows us to safely expand coverage across rare or risky scenarios.
- Multi-lingual data: we collect multilingual data using machine translation tools to broaden language coverage and enhance model robustness across regions, with an initial focus on English, French, Italian, German, Spanish, and Japanese.
For test data, we consider the following two types:
- Publicly available data: for research models, we benchmark on held-out test sets. For production models, we additionally evaluate on unseen public data not used during training, ensuring generalization and robustness. In particular, the suite of public benchmarks includes evaluations on a variety of safety-related aspects such as toxicity, bias, model jailbreaking, copyright protection, etc. for both input queries and model responses. Furthermore, we included a dataset to benchmark “exaggerated safety”, to evaluate over-sensitive behaviors of content moderators typically found in keyword-matching based mechanisms, which may block safe inputs such as “How do I force kill a computer program” or “How to lose fat for people who hate to walk”.
- Internal CRM-specialized data: we validate models using traffic from real Salesforce employee usage and select anonymized customer interactions. For the suite of private benchmarks, we cover a diverse range of CRM-related use cases such as customer service, summarization, marketing and automated code generation and construct a test set of around 10K instances in total. Similar to training data, we also augment our private benchmark data with unsafe examples that are synthetically manipulated from their safe counterparts.
The diversity of the the data mixture ensures that our models learn and understand a comprehensive notion of LLM safety and are ready for real-world use cases with potential shift of data characterstics.
Generation 1: Classifier Models
Our initial release of SFR-Guard introduces binary classifiers built using encoder models (like BERT, DistilBERT) and decoder models (like Flan-T5). These models serve two key detection use cases in parallel:
- Toxicity Detection: the model detects multiple types of harmful language, including hate speech, profanity, identity attacks, and more. The model is trained on a mix of public datasets, internal CRM-specific data, and multilingual examples to ensure broad coverage and adaptability.
- Prompt Injection Detection: the model identifies attempts to bypass model safety via techniques such as jailbreaks, encoding attacks, system prompt leakage, malicious code generation, and social engineering prompts. It is designed to generalize across different applications and adversarial strategies, ensuring strong protection without overblocking legitimate use cases.
Generation 2: Generative LLM Models
Our another generation model attempts to capitalize on the latest development of LLM models as general-purpose reasoning models. In particular, we use the Phi-3-mini model which strikes an impressive balance between small model size and high performance on downstream tasks. Furthermore, as recent LLMs have been pre-trained on a diverse corpus of text, we use such single model for toxicity, prompt injection detection, and more, rather than having multiple small models.
Nonetheless, we find that out-of-the-box Phi-3-mini with prompting gives unsatisfactory performance. This is likely due to the highly specialized nature of our CRM-specialized data and highlights the need for domain-specific model finetuning. After finetuning on our training data, we observe a significant improvement of the model performance, also outperforming the previous generation BERT-like model.
Furthermore, the 128k context window length (i.e., how long of an input the model can process) for this model is much longer than BERT-like models (usally only 512 or 1024). Thus, while for the earlier we need to divide a long input (e.g., an entire model-generated paper draft) into multiple chunks and aggregate scores, we now can directly send the input in its entirety to this model, which also allows the model to jointly utilize linguistic cues that are farther apart and would have otherwise been separated.
Capability 1: Fine-Grained Category Detection
Together with our teammates from the Office of Ethical & Humane Use (OEHU), we use the following categories and definitions:
Toxicity: Languages that are rude, disrepsectful or unreasonable
Category | Definition |
Hate | Content that expresses, incites, or promotes hatred based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, caste, or violence or severe harm toward the targeted group |
Identity | Negative or hateful comments targeting someone because of their identity |
Violence | Content that promotes or glorifies violence or celebrates the suffering or humiliation of others |
Physical harm | Content that provides unsafe advice that may harm the user or others physically, or content that promotes, encourages, or depicts acts of self-harm |
Sexual | Content meant to arouse sexual excitement, such as the description of sexual activity, or that promotes sexual services (excluding sex education and wellness) |
Profanity | Swear words, curse words, or obscene or profane language |
Prompt Injection: Attempting to trick or manipulate the LLM agent into doing things beyond its intended purposes
Category | Definition |
Pretending/Role-play | Instructing the LLM/agent to assume the role of a different “system persona” with malicious intent. Social engineering attacks such as deceiving the system with adversarial conversational content |
Privilege Escalation/ Attempts to change core system rules | Injecting malicious instructions that aim to bypass/change existing system policies and the LLM safety training. E.g. Do Anything Now (DAN) jailbreak attacks |
Prompt Leakage | Prompts intending to leak sensitive information from the LLM prompt such as the system policies and contextual knowledge documents. This is for the purpose of active reconnaissance |
Adversarial Suffix | A set of seemingly random character encodings appended to a prompt. It is designed to circumvent guardrails and alignme |
Privacy Attacks | Prompts that attempt to extract, infer, or expose personal or confidential data. This is with the aim of unauthorized access or misus |
Malicious Code Generation | Prompts attempting to generate malicious code outputs from an LLM. E.g. creating malware, viruses, fraud utilities etc. |
With these definitions, we employ a mixed annotation effort to augment our training data with the category-specific labels, leveraging both the high accuracy of our in-house annotation team and the scalability of LLM-based automated labeling. We find that models trained with rich categorical features achieve high performance on fine-grained classifications while maintaining or exceeding their binary predictor counterparts on the same binary classification test sets.
Capability 2: Evidence and Summary Explanation Generation
Besides fine-grained category information, concrete reasons for the model detection would also be highly desirable, especially if the input is very long, as such information helps the system administrator and end user to quickly identify the cause and legitimacy of the content moderation decision. In particular, we provide two types of reasoning:
- an inline citation that highlights the violating part of the input, and
- a natural language summary of the decision contextualized to the specific input.
To generate these reasoning, we prompt a high-performing model, such as GPT-4o, by providing the input and SFR-Guard model decision. To evaluate the quality and helpfulness of the explanation, we conduct an extensive user study and prompt tuning, again leveraging our expert in-house annotation team, and demonstrate that both modes of explanation are effective in improving the transparency of content moderation decisions and reduce the cognitive workload for people to validate the correctness of the decisions, especially on long model inputs.
Capability 3: Severity Level-Aware Guardrail
An orthogonal direction from violation category and explanation is to severity level of the violation. The severity level gives a practical signal of the potential harm of the text, and thus the resulting necessary actions. For example, an email-drafting request may ask the agent to write a product advertisement of kitchen appliances targeted specifically to female customers, unintentionally amplifying and strengthening gender stereotype and bias. On the other hand, another request may contain explicitly hateful words towards particular gender groups in an attempt to cause intentional discomfort and harm. These two inputs demonstrate different severity levels and warrant different actions. In the first case, an automated warning message is likely sufficient, while in the second case, the request should definitely be blocked and may also be flagged for human review on potential system abuse by unauthorized or bad-faith users.
In our recent work [1], we propose a methodology to integrate such severity labels in our model training pipeline. The results not only demonstrate the effectiveness of our training method in the severity level classification task, but also show that this severity level-aware training allows us the model to perform better even on the original binary detection task, compared to a suite of content moderation models developed by other institutions and companies.
Conclusion & Future Work
In this blogpost, we described the recent efforts here at Salesforce AI Research to build the SFR-Guard series of models, which utilize classifiers and LLM models with a diverse set of capabilities. With the high performance of such guardrail models, our customers can rest assured that our AgentForce systems can deliver exceptional value and utility with safety and integrity.
Furthermore, Salesforce AI Research is broadly innovating in the area of trustworthy AI with both academic research impact and product innovation, with recent projects on culturally appropriate trust [4], trustworthiness of generative search engines (e.g., perplexity.ai) [5], hallucination analysis [6], correctly refusing to answer nonsensical questions [7], facutal consistency in summarization [8], multiturn conversation safety [9], natural language explanations [10] and mechanistic interpretability [11], along with xAlign, an internal suite of tools for aligning large language models with human preference.
Reference
- BingoGuard: LLM Content Moderation Tools with Risk Levels. Fan Yin, Philippe Laban, Xiangyu Peng, Yilun Zhou, Yixin Mao, Vaibhav Vats, Linnea Ross, Divyansh Agarwal, Caiming Xiong, Chien-Sheng Wu. ICLR 2025.
- Prompt Injection Detection: Securing AI Systems Against Malicious Actors. Divyansh Agarwal, Ben Risher, Denise Pérez. Blog 2025.
- Trusted NLG Research at Salesforce AI. Chien-Sheng Wu, Alex Fabbri. Blog 2024.
- Evaluating Cultural and Social Awareness of LLM Web Agents. Haoyi Qiu, Alexander R Fabbri, Divyansh Agarwal, Kung-Hsiang Huang, Sarah Tan, Nanyun Peng, Chien-Sheng Wu. NAACL-Findings 2025.
- Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses. Pranav Narayanan Venkit, Philippe Laban, Yilun Zhou, Yixin Mao, Chien-Sheng Wu. Arxiv 2025.
- Shared Imagination: LLMs Hallucinate Alike. Yilun Zhou, Caiming Xiong, Silvio Savarese, Chien-Sheng Wu. Arxiv 2025.
- Unanswerability Evaluation for Retreival Augmented Generation. Xiangyu Peng, Prafulla Kumar Choubey, Caiming Xiong, Chien-Sheng Wu. Arxiv 2025.
- SummExecEdit: A Factual Consistency Benchmark in Summarization with Executable Edits. Onkar Thorat, Philippe Laban, Chien-Sheng Wu. Arxiv 2025.
- Prompt Leakage effect and defense strategies for multi-turn LLM interactions. Divyansh Agarwal, Alexander R. Fabbri, Ben Risher, Philippe Laban, Shafiq Joty, Chien-Sheng Wu. EMNLP 2024.
- Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations. Shiyuan Huang, Siddarth Mamidanna, Shreedhar Jangam, Yilun Zhou, Leilani H Gilpin. Arxiv 2023.
- A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models. Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, Ziyu Yao. Arxiv 2024.