AI Safety Guardrails

🟢 Smarter AI 🟢

⚡We Care. Period.

Human-in-the-Loop Advanced Agent Verifier AI Safety Guardrails Testing

Last updated: 03-02-2026

AI Voice+

Overview

AIVoice+ implements a layered guardrails system across all AI-powered features. These guardrails protect the superadmin, the company, and paying users without sacrificing speed or UX.

Layered Guardrails Stack

1. Policy Layer (System Prompt)

SAFETY_PREAMBLE injected into every AI conversation
Non-negotiable rules that the model must follow
Covers: role adherence, content restrictions, data handling

2. Capability Layer (Input Validation)

Message length limits (10,000 characters)
UUID format validation for workspace/run IDs
Input type checking (string, required fields)

3. Intent Layer (Injection Scanning)

10 regex patterns detecting prompt injection attempts
Safe-wrapping of detected injections (not blocking — preserves UX)
Audit logging of all injection detections

4. Runtime Layer (Rate Limiting & Quotas)

IP-based rate limiting: 30 messages / 15 minutes per endpoint
Per-org daily quotas: 100 messages/day (configurable per plan)
Fail-closed rate limiter: if the DB is down, requests are blocked (not allowed)

5. Tool Layer (MCP & Agent Sandboxing)

Per-request context isolation in MCP server (no global state)
Org-scoped queries in all tool handlers
API key hashing (SHA-256) for MCP authentication

9 Guardrail Categories

Agent Safety Modes

All AI agents operate in a single safety mode: Standard. This mode applies all guardrails uniformly. There is no "relaxed" or "unrestricted" mode.

Voice-Specific Guardrails

Blocked Voice Phrases

The following phrases are flagged in voice output scanning:

"guaranteed results"
"trust me"
"no risk"
"100% safe"
"you must"
"i promise"

These are scanned client-side via scanBlockedVoicePhrases() in agent-one-safety.ts.

Error Masking Policy

All edge functions return generic error messages to clients:

Client sees: "An unexpected error occurred"
Server logs: Full error details via console.error()

This prevents information leakage about internal architecture, database structure, or API configurations.

Affected functions: delete-organization, store-checkout, send-email, create-vapi-assistant, snapshot-org-stats, admin-ai-insights, ai-contact-reply, chat-with-data

Logging & Audit Trail

Event

Feature Tag

Table

Content moderation block (input)

agent_one_moderation_block / convo_moderation_block / chat_data_moderation_block

ai_usage_logs

Content moderation block (output)

convo_output_moderation_block

ai_usage_logs

Injection detection

injection_detected

ai_usage_logs

Normal AI usage

agent_one_chat / convo_chat / chat_with_data

ai_usage_logs

Rate limit hits

IP + endpoint logged

auth_rate_limits

Framework Alignment

This guardrails implementation aligns with:

NIST AI RMF: Map → Measure → Manage → Govern cycle. Our guardrails cover the Manage function (risk mitigation controls).
EU AI Act: Our AI systems would be classified as "limited risk" (chatbots). We implement transparency (safety notices) and human oversight (content moderation).
ISO/IEC 42001: Our safety pipeline documents align with the AI management system requirements.

These are reference alignments, not certifications. External auditing is recommended for formal compliance.

What We Intentionally Do NOT Implement

Item

Reason

CORS restriction

Would break MCP clients and external integrations

Differential privacy

Academic technique, not applicable to chat-based SaaS

Model retraining

We use third-party models; we mitigate via prompts

Real-time bias scoring

Would add latency; we use post-hoc audit logging instead

Content Moderation (how it's done on our platform)

Content Moderation

PreviousGuardrails+NextBias Protections

Last updated 2 days ago

hashtag⚡We Care. Period.

hashtagAI Voice+

hashtagOverview

hashtagLayered Guardrails Stack

hashtag1. Policy Layer (System Prompt)

hashtag2. Capability Layer (Input Validation)

hashtag3. Intent Layer (Injection Scanning)

hashtag4. Runtime Layer (Rate Limiting & Quotas)

hashtag5. Tool Layer (MCP & Agent Sandboxing)

hashtag9 Guardrail Categories

hashtagAgent Safety Modes

hashtagVoice-Specific Guardrails

hashtagBlocked Voice Phrases

hashtagError Masking Policy

hashtagLogging & Audit Trail

hashtagFramework Alignment

hashtagWhat We Intentionally Do NOT Implement

hashtagContent Moderation (how it's done on our platform)

⚡We Care. Period.

AI Voice+

Overview

Layered Guardrails Stack

1. Policy Layer (System Prompt)

2. Capability Layer (Input Validation)

3. Intent Layer (Injection Scanning)

4. Runtime Layer (Rate Limiting & Quotas)

5. Tool Layer (MCP & Agent Sandboxing)

9 Guardrail Categories

Agent Safety Modes

Voice-Specific Guardrails

Blocked Voice Phrases

Error Masking Policy

Logging & Audit Trail

Framework Alignment

What We Intentionally Do NOT Implement

Content Moderation (how it's done on our platform)