Adversarial Robustness in LLMs Pt.2: Guardrails

Jul 1, 2024

5 min read

In our previous post in this series, “Adversarial Robustness in LLMs: Identifying and Mitigating Threats”,” we discussed the various types of adversarial attacks modern LLMs are subjected to, some intentional and some not so much. These attacks can range from something as simple as Prompt Injection to a complete Jailbreak of the system database using the LLM as a gateway.

Such attacks, whether motivated by malice or awareness, have been noted to erase millions of dollars from various organizations’ market value, as seen in the case of Alphabet’s historic Gemini launch. Building upon that, we dive deeper into this subject and talk about the potential safeguards and guardrails one can use to implement such systems.


Guardrails can be defined as the parameters that monitor and dictate the user-side behavior of an LLM application. These parameters are not limited to chatbots; they can also be implemented on LLM-aided applications such as Multi-Modal Interpreters, Database Management Systems, and more.

Going beyond enforcing a consistent output from an LLM, such guardrails are essential to mitigate and define the structure and quality of the responses generated against prompts, which may go beyond simple Prompt Engineering.

Llama Guard

Llama Guard represents a significant advancement in ensuring the safe deployment of large language models in human-AI interactions. By integrating a comprehensive safety risk taxonomy, it effectively categorizes potential risks in both prompts and generated responses. This taxonomy not only aids in identifying and mitigating harmful content but also enhances transparency and accountability in AI-generated interactions. Llama Guard's robust performance on benchmarks like the OpenAI Moderation Evaluation dataset underscores its efficacy in content moderation, often surpassing traditional tools.

Moreover, Llama Guard's provision of model weights encourages community-driven development and adaptation, fostering continual improvements in AI safety standards. The model's ability to perform multi-class classification and generate binary decision scores ensures nuanced response evaluations, supporting diverse use cases without compromising safety protocols. Its flexibility in taxonomy adjustment allows for the alignment of safety measures with specific operational needs, promoting effective zero-shot or few-shot prompting across different risk categories.

Documentation: Click Me

Tags: Taxonomy, OpenSource, Tools, LLaMA

Nvidia NeMo

NeMo Guardrails addresses critical challenges in deploying LLMs by offering a robust toolkit for enhancing steerability and trustworthiness in conversational AI systems. By enabling the implementation of programmable rails, NeMo Guardrails empowers users to define and enforce constraints on LLM outputs, such as avoiding harmful topics or adhering to predefined dialogue paths.

Moreover, NeMo Guardrails safeguards against prompt injection attacks by providing mechanisms to control and validate user inputs, ensuring the model's outputs align with intended usage scenarios. The toolkit leverages a programmable runtime engine that acts as a dialogue manager, interpreting rules defined in Colang—a specialized modeling language—to govern LLM interactions. This approach complements traditional model alignment and prompt engineering strategies by enabling runtime customization of behavior rules, tailored to specific application needs.

Documentation: Click Me

Tags: Colang, Prompt Validation

Guardrails AI

Guardrails-AI is an open-source project ensuring responsible use of LLMs. It applies strict validations to user prompts, safeguarding against risks like PII, toxic language, and API secrets in closed-source models. It also prevents prompt injection and jailbreak attempts, crucial for maintaining data security in non-local deployments. Additionally, Guardrails-AI verifies LLM-generated responses to filter out toxic language, hallucinations, and sensitive competitive information, ensuring ethical and compliant AI interactions.

Guardrails-AI offers modular components to enforce safety measures on both input prompts and LLM-generated outputs. It supports structured output generation and rigorous validation to uphold transparency and reliability in AI applications. This approach strengthens data protection and ethical standards, making it suitable for secure deployment in sensitive contexts.

Documentation: Click Me

Tags: API, prompt safety, hallucination


TruLens is a powerful software tool designed to streamline the development and evaluation of applications utilizing LLMs. By integrating sophisticated feedback functions, TruLens enables developers to objectively measure the quality and effectiveness of inputs, outputs, and intermediate results across a range of use cases including question answering, summarization, retrieval-augmented generation, and agent-based applications. This capability allows teams to expedite experiment evaluation, identify areas for improvement, and scale application development with confidence.

Its extensible library of built-in feedback functions supports iterative refinement of prompts, hyperparameters, and overall model performance, facilitating rapid adaptation to evolving project needs.

Documentation: ​​Click Me

Tags: evaluation, feedback, RAGs


LMQL is an innovative programming language designed for LLMs, expanding upon Python with integrated LLM interaction. It allows developers to seamlessly embed LLM calls within their code, blending traditional algorithmic logic with natural language processing capabilities. LMQL programs resemble standard Python syntax, where top-level strings serve as query strings passed to LLMs for completion. 

Developers can control LLM behavior using the 'where' keyword to specify constraints and guide reasoning processes. LMQL supports various decoding algorithms like argmax, sample, beam search, and best_k, offering flexibility in executing programs and optimizing outputs.

Documentation: Click Me

Tags: syntax prompting, logic, Python


In the evolving landscape of LLMs, guardrails have become the ethical and moral responsibility of enterprises and individuals alike, ensuring intelligent and moderated GenAI models, which are not just driven from a product standpoint but as a service which is working towards mitigating off-topic responses, misinformation, and security vulnerabilities for which it is responsible for with the expansive sensitive data it learns and grows from.

Following the series where we breakdown Adversarial Robustness in the current LLM Landscape, we will be taking a look at how one can go about implementing Guardrails on Tune Studio, and as part of a case study, the legal outcomes of Weaker LLM Guardrails.

Further Reading

Written by

Aryan Kargwal

Data Evangelist