Your AI Feature Is a New Attack Surface

Fourteen years in IT have made me a product person before anything else. I teach AI product building in every workshop I run, with a framework for pressure-testing whether an idea can become a real business. And I have a conviction about what India actually needs from AI: not one unicorn, but a thousand companies each doing a million dollars in revenue — the same billion-dollar impact, spread across far more founders, empowering far more Indians and solving far more real problems. That's the scale at which AI changes a country.

But building products is only one half of the competence. The other half is knowing what today's models can't do — because those are the failure modes you'd be building on top of. You don't ship a product on a foundation whose cracks you haven't mapped. I knew the capabilities well; I wanted to know the limits just as well.

So I entered OpenAI's gpt-oss-20b Red-Teaming Challenge on Kaggle — to study LLM weakness firsthand by attacking an open model the way a bad actor would. It taught me more about what these models actually are than any capability demo ever had.

A bank can't run on the hope of no attack

A bank doesn't operate with no security and hope nobody robs it. It assumes attack and hardens against it. LLMs need the same assumption — because for all their intelligence, they are remarkably gullible, especially against someone deliberately trying to break them. That was a surprising find — intelligent but gullible is a genuinely dangerous combination.

This stops being abstract the moment you ship. Wire an LLM chatbot to your systems, and a bad actor can talk their way past it and pull data out of your secure database. Your shiny new LLM feature, instead of adding capability, becomes the thing that leaks what should never have been exposed. A bad actor only has to find one hole; you have to close all of them.

The competition taught me a distinction we should all be aware of but aren't — I wasn't aware of it myself. There are two kinds of red teaming:

Model-level — weaknesses baked into the model's weights. The model owner (OpenAI, Anthropic, Google) fixes these, and advertises that red-teaming loudly. That loud voice is itself part of the problem: it pulls attention away from a quieter, more critical class of errors.
Application-level — weaknesses in how you connect the model to tools, data, and users. No one fixes these for you.

Application-level red teaming is how you secure your own feature — and adopting it was an important addition to my AI product-building approach. Building a feature is one thing; hardening it is a separate discipline.

And these exploits are alarmingly simple — because an LLM is distilled from human intelligence, it inherits human weak spots. Tell a person you're an authority and they tend to cave under pressure; the model does exactly the same. Often, simply claiming authority — "I'm authorized to read this" — is enough to get access. The same effect comes through a different language, or a request wrapped in a poem or a story — the guardrails slip far more easily. None of it takes sophistication, and with programmatic attacks — thousands of automated attempts a minute — the risk compounds fast.

Three jobs every AI feature needs: red-team, defend, verify

That's the real lesson from the competition — not just where LLMs break, but how to find the breaks and close them. The solution is what counts. And you don't do any of it by hand; there's open-source tooling for each piece. Shipping a real AI feature takes three jobs, not one, and all three are essential:

1. Red-team it — find the vulnerabilities. Can someone make it leak private data, jailbreak its rules, or corrupt its behaviour? You attack your own app before an adversary does.

promptfoo and DeepTeam (50+ known vulnerabilities, mapped to the OWASP LLM Top 10); Garak (NVIDIA) and PyRIT (Microsoft) for deeper scans.

2. Defend it — guard at runtime. Once you know the holes, put a layer in front of the model that filters malicious inputs and unsafe outputs while it's live.

NeMo Guardrails (NVIDIA), Guardrails AI, LLM Guard.

3. Verify the business logic — track that it actually works. The one people forget. An LLM is non-deterministic; its performance varies — across prompts, across versions, as the provider quietly updates the model under you. You can't test once and trust forever. You need standing checks that the feature still does its job: relevant, accurate, grounded answers, every time.

DeepEval — "Pytest for LLMs": scores answer relevancy, faithfulness, and hallucination. Ragas does the same for RAG: is the retrieved context relevant, is the answer grounded in it?

Skip any one and you've shipped something incomplete. A feature that's secure but wrong is useless; one that's accurate but jailbreakable is a liability; one that's both today but unmonitored will drift into failure tomorrow. Integrate all three the way you'd wire in any other dependency.

Here's the mental model. Every engineer learns — usually after graduating, the first time a bug ships to production — that you don't write real software without tests. An AI feature is no different, except you're testing an intelligence: it has to be secure, correct, and consistently so — because the model underneath you keeps changing. Red-teaming, guardrails, and business-logic checks are its unit tests. They aren't optional extras; they're an essential part of building an AI application.

That same competition pushed me toward a deeper, more unsettling discovery — one that matters most for the AI models entire nations are now building. That's the next piece.

This is the kind of grounded, both-sides view of AI I bring to enterprise teams through Purna Medha — including AI Strategy and Safety sessions on where LLMs actually break and how to harden the application layer you own. If your team is shipping LLM features without a clear picture of model-level vs application-level risk, that's the gap we close. → purnamedha.ai

Built for OpenAI's gpt-oss-20b Red-Teaming Challenge, Kaggle.