Exploring Vulnerabilities in LLMs: A Red Teaming Approach to Evaluate Social Bias
Abstract
Generative AI has caused a paradigm shift in the area of Artificial Intelligence (AI) and as such has inspired much new research, especially on Large Language Models (LLMs). LLMs are transforming how people interact with computers in service-oriented fields in both the consumer (for example: retail, travel, education, healthcare) and enterprise (customer care, field service, sales, marketing, etc.) spaces. One barrier to widespread adoption is the current unpredictability of LLM behavior: users must trust that LLM-based services and systems are accurate, fair, and unbiased. Model responses that exhibit biases related to race, social status, and other sensitive topics can have serious consequences, ranging from lack of trust in the model to adverse social implications for consumers, all the way to damage to the reputations of the corporations that provide them. This study explores how to uncover biases related to social stigmas in LLM output, by using an adversarial prompt-based Red teaming approach. Discovering model vulnerabilities of this type is a non-trivial task due to the large search space, making it resource-intensive. We present an evaluation framework for probing and analyzing the behaviors of multiple LLMs systematically. We use a curated set of adversarial prompts with a focus on uncovering biased responses to prompts associated with social attributes.