PII guardrails for .NET applications - Part 1: TasmanianDevil library

2026/06/26 · 2180 words · 11 minutes to read

ai net security

A few months ago I introduced AgentGuard, a library for declarative guardrails and safety controls for .NET AI agents. One of the rules it shipped with from day one was PII redaction, but back then it was a fairly basic, regex-only affair - good enough to scrub an email address or a credit card number, but not much more. Since then I have rebuilt that part of the library from the ground up into a proper, offline PII detection and de-identification engine, and it has grown enough that it no longer makes sense to keep it locked inside AgentGuard.

As of the 0.10.0 release, the PII engine lives in its own standalone, brand-neutral library called TasmanianDevil - framework-agnostic, no dependency on AgentGuard, usable anywhere you have a string you would rather not leak. AgentGuard now consumes it like any other NuGet package.

This is the first of a two-part series. In this part I will walk through TasmanianDevil on its own - detection, anonymization, the reversible round-trip, structured data, and the optional multilingual NER. In part two I will wire it into Microsoft Agent Framework (MAF) agents, where it really comes into its own.

Series overview 🔗

Part 1 (this part): TasmanianDevil library
Part 2: Agent Framework agents

Why PII needs its own engine 🔗

If you build anything with language models, personally identifiable information has a way of showing up where you do not want it. A user pastes their full account details into a chat. A RAG pipeline indexes a document full of customer records. A tool returns a database row with an email, a phone number and a social security number in it, and that row is about to be fed straight back into the model’s context. In all of these cases, data that should never leave your boundary is one prompt away from being sent to a cloud model provider - cached, logged and out of your control.

Cloud content-safety services can help, but they come with trade-offs: they are per-call, they run in the cloud, and ironically they often need to see the very PII you are trying to protect in order to classify it. For a whole class of applications - regulated industries, on-premise deployments, anything privacy-sensitive - what you really want is a powerful, yet lightweight, offline redaction that runs locally within your own process.

That is what TasmanianDevil is - a fully offline, deterministic PII detection and de-identification engine whose architecture is inspired by Microsoft Presidio, the well-known Python PII toolkit, rebuilt as idiomatic, dependency-light C#. Its only runtime dependency for the core engine is libphonenumber-csharp for phone parsing; everything else is regular expressions and checksum validation.

Quick start 🔗

The fastest way in is the PiiEngine facade, which wires the analyzer, anonymizer, and de-anonymizer together from a single options object:

using TasmanianDevil;

var engine = new PiiEngine();
var result = engine.Deidentify("Email jane@contoso.com or call +1 425 555 0100.");
Console.WriteLine(result.AnonymizedText);
// Email <EMAIL_ADDRESS> or call <PHONE_NUMBER>.

That is the whole thing - no model to download, no service to call. But the facade hides a fair amount of machinery, so let’s open it up.

Detection, scores and context 🔗

Detection is built around an AnalyzerEngine that runs a registry of recognizers over the text, scores each candidate, and boosts the score when supporting context words appear nearby:

using TasmanianDevil;
using TasmanianDevil.Analyzer;
using TasmanianDevil.Analyzer.Context;

// always-on generic + US recognizers, plus the opt-in German country pack
var registry = PiiRecognizers.CreateRegistry("en", [PiiCountries.De]);
var analyzer = new AnalyzerEngine(registry, new LemmaContextAwareEnhancer(), defaultScoreThreshold: 0.4);

const string text =
    "Email jane.doe@acme.com, SSN 078-05-1120, card 4012888888881881, " +
    "IBAN DE89370400440532013000, VAT DE123456788, phone +1 415 555 0132.";

foreach (var d in analyzer.Analyze(text, language: "en").OrderBy(d => d.Start))
    Console.WriteLine($"  {d.EntityType,-14} score={d.Score:F2} '{text[d.Start..d.End]}'");

which prints:

  EMAIL_ADDRESS  score=1.00 (context-boosted) 'jane.doe@acme.com'
  URL            score=0.50                   'jane.do'
  URL            score=0.50                   'acme.com'
  PHONE_NUMBER   score=0.40                   '078-05-1120'
  CREDIT_CARD    score=1.00 (context-boosted) '4012888888881881'
  IBAN_CODE      score=1.00 (context-boosted) 'DE89370400440532013000'
  DE_VAT_ID      score=1.00 (context-boosted) 'DE123456788'
  PHONE_NUMBER   score=0.75 (context-boosted) '+1 415 555 0132'

There are a few things worth unpacking here. The credit card is validated with the Luhn algorithm, the IBAN with the mod-97 check, the German VAT id with its ISO-7064 checksum - so a random 16-digit string will not be flagged as a card. The scores are also not static: the word “card” sitting right before the number boosts the credit card recognizer to a full 1.00, which is what context-boosted means. The context matching is lemma-aware - it normalizes both the surrounding tokens and the recognizer’s context words through a dependency-free Porter stemmer, so “cards” and “card” match the same way. And because I enabled the German country pack (PiiCountries.De), the DE_VAT_ID recognizer fires; without it, only the generic and US recognizers would run.

What you see above is the raw candidate list, before overlap resolution. Notice that the email domain also matches the URL recognizer as two low-confidence fragments, and the SSN gets picked up by the phone recognizer here. This is by design - recognizers run independently and are allowed to overlap. When you anonymize, the engine resolves those overlaps (highest score and longest span win), so the email span swallows the URL fragments and you get a single clean <EMAIL_ADDRESS>. The defaultScoreThreshold of 0.4 is what keeps the weakest candidates out in the first place.

The recognizer catalog is reasonably broad. The generic, always-on recognizers cover email, credit cards, IBANs, crypto wallet addresses (base58 and bech32), IP and MAC addresses, URLs, and phone numbers (via libphonenumber). On top of that, a US pack is always on (SSN, ITIN, ABA routing number, bank account, driver’s license, passport, NPI, MBI, medical license), and there are opt-in country packs for the UK, Germany, India, Italy and Spain - opt-in because enabling all of them at once tends to explode the false positive rate.

Anonymization operators 🔗

Detection is only half the story. Once you know where the PII is, you need to decide what to do with it. The AnonymizerEngine supports a set of operators that you can configure per entity type: replace (the default <ENTITY_TYPE> tags), redact (remove entirely), mask, hash, encrypt, keep, or a custom lambda.

using TasmanianDevil.Anonymizer;
using TasmanianDevil.Anonymizer.Operators;

const string text = "Reach John at john@acme.com or +1 415 555 0132; card 4012888888881881.";

var operators = new Dictionary<string, OperatorConfig>
{
    ["EMAIL_ADDRESS"] = new("mask", new() { [OperatorParams.MaskingChar] = "*", [OperatorParams.CharsToMask] = 6 }),
    ["PHONE_NUMBER"]  = new("hash", new() { [OperatorParams.Salt] = "0123456789abcdef" }),
    ["CREDIT_CARD"]   = new("redact"),
};

var anonymizer = new AnonymizerEngine();
var result = anonymizer.Anonymize(text, analyzer.Analyze(text, language: "en"), operators);
Console.WriteLine(result.Text);

Reach John at ******cme.com or 6dd03ba38e32c873855f5730ad5d612a4b2a338c045bb0fce1473ab2f1567e3c; card .

The email is partially masked, the phone is replaced with a salted SHA-256 hash, and the card is removed entirely.

The most interesting operator is encrypt, because it is reversible. You can anonymize text with an AES key, hand the redacted version to a third party, and later decrypt it back to the exact original:

const string key = "0123456789abcdef"; // 128-bit AES key
const string secret = "Patient Mary booked at mary@clinic.org on file 078-05-1120.";

var encryptOps = new Dictionary<string, OperatorConfig>
{
    ["DEFAULT"] = new("encrypt", new() { [OperatorParams.Key] = key }),
};

var encrypted = anonymizer.Anonymize(secret, analyzer.Analyze(secret, language: "en"), encryptOps);
var deid = PiiDeidentificationResult.FromEngineResult(encrypted);

// later, decrypt back to the original
var restoreOps = new Dictionary<string, OperatorConfig>
{
    ["DEFAULT"] = new("decrypt", new() { [OperatorParams.Key] = key }),
};
var restored = new DeanonymizerEngine().Deanonymize(deid.AnonymizedText, deid.Items, restoreOps);

before   : Patient Mary booked at mary@clinic.org on file 078-05-1120.
after    : Patient Mary booked at YxRj-hBtcI08d8FJ1mUelfm3Ck7iOS7KWXk4KvS6MCs on file SCjy_ijxZh8m8lWsgkC-VTqK04-kFMGO-cI1IqUs1UU.
reversible? True
restored : Patient Mary booked at mary@clinic.org on file 078-05-1120.

This round-trip is the foundation of the reversible redaction pattern I will show in part two, where the encrypted tokens go to the model and the real values come back.

If none of the built-in operators fit, you can supply a custom lambda. For example, keeping only the last four digits of a card:

var customOps = new Dictionary<string, OperatorConfig>
{
    ["CREDIT_CARD"] = new("custom", new()
    {
        [CustomOperator.Lambda] = (Func<string, string>)(s =>
            s.Length <= 4 ? s : new string('#', s.Length - 4) + s[^4..]),
    }),
    ["DEFAULT"] = new("keep"), // leave everything else untouched
};

before : Reach John at john@acme.com or +1 415 555 0132; card 4012888888881881.
after  : Reach John at john@acme.com or +1 415 555 0132; card ############1881.

Structured data and batches 🔗

Real-world PII rarely arrives as a single tidy sentence. More often it is a JSON payload, a CSV export, or a batch of records. The engine handles all three.

For JSON, the StructuredEngine redacts values by dotted key path, preserving the shape of the document and leaving non-string types alone. You can scope it with an allow-list or a deny-list of paths - here, redacting only $.user.email:

var structured = new StructuredEngine(analyzer);
var redacted = structured.AnonymizeJson(json, new JsonRedactionScope { IncludePaths = ["user.email"] }, writeIndented: true);

{
  "id": 4821,
  "user": {
    "name": "Acme Ltd",
    "email": "<EMAIL_ADDRESS>",
    "phone": "+1 415 555 0132"
  },
  "active": true,
  "notes": "VIP since 2019"
}

For CSV, it infers which columns contain PII by sampling their values, then redacts those columns while leaving the benign ones in place:

name     | email           | card           | city
Acme Ltd | <EMAIL_ADDRESS>  | <CREDIT_CARD>  | Berlin
Globex   | <EMAIL_ADDRESS>  | <CREDIT_CARD>  | Oslo
Initech  | <EMAIL_ADDRESS>  | <CREDIT_CARD>  | Madrid
inferred PII columns: email=EMAIL_ADDRESS, card=CREDIT_CARD

And there is a batch API over plain IEnumerable<string> as well as keyed IReadOnlyDictionary<string, string> records - the keyed variant even feeds the key into the detection context, so a field named support_phone gives the phone recognizer a little nudge:

billing_email: billing@acme.com    ->  billing_email: <EMAIL_ADDRESS>
support_phone: +1 415 555 0132     ->  support_phone: <PHONE_NUMBER>
ticket_title: Cannot log in        ->  ticket_title: Cannot log in

The third record is left untouched, which is the point - the engine only redacts what it actually recognizes.

Multilingual NER (optional) 🔗

Regular expressions and checksums are great for structured PII - things with a definable shape, like a card number or an IBAN. They are useless for the unstructured kind: a person’s name, a city, a company, a date. There is no regex for “this is somebody’s name.”

For that, TasmanianDevil has an optional named-entity recognition add-on in a companion package, TasmanianDevil.Onnx. It detects PERSON, LOCATION, ORGANIZATION and DATE_TIME spans via a zero-shot GLiNER model (an mDeBERTa-v3 backbone, multilingual out of the box), and registers as an ordinary recognizer - so its spans flow through the same analyzer pass, overlap resolution and anonymization as the regex/checksum entities:

using TasmanianDevil.Onnx;

var ner = new GlinerNerRecognizer(new GlinerNerOptions
{
    ModelPath = modelPath, TokenizerPath = spmPath, ConfigPath = configPath,
});
registry.AddRecognizer(ner); // now PERSON/LOCATION/... join the same analyzer pass

Under the hood, TasmanianDevil.Onnx runs the model through Kyoto - another library that came out of the same 0.10.0 split, holding the ONNX inference engine (session pooling plus a set of ready-to-use text classifiers). It is opt-in and bring-your-own-download: the ONNX export is published on Hugging Face and weighs around 580 MB, so it is not bundled. Getting a span-NER model to run correctly in pure .NET - tokenization, the span decoding, the whole assembly - is genuinely the hard part, and this is what TasmanianDevil packages up for you.

The reason it is worth the extra weight is multilingual coverage. Because the backbone is multilingual, it catches names and places across languages where an English-leaning approach would fall flat:

[en]
before : Jane Doe joined ACME Corp in Berlin on March 3rd; email jane.doe@acme.com.
after  : <PERSON> joined <ORGANIZATION> in <LOCATION> on <DATE_TIME>; email <EMAIL_ADDRESS>.

[de]
before : Klaus Müller arbeitet bei der Siemens AG in München seit dem 5. Mai.
after  : <PERSON> arbeitet bei der <ORGANIZATION> in <LOCATION> seit dem <DATE_TIME>.

[ru]
before : Иван Петров живёт в Москве и работает в компании Газпром.
after  : <PERSON> живёт в <LOCATION> и работает в компании <ORGANIZATION>.

Notice the German seit dem 5. Mai and the Russian Москве (Moscow) are both correctly tagged - and in the English example, the NER spans flow through the same pipeline as the regex email recognizer, so they all get redacted in one pass.

Final thoughts 🔗

The thing I like about handling PII this way is that it is deterministic and offline by default. The core engine is regex and checksums - no model, no network call, no cloud dependency. You get reproducible behavior you can unit-test, and your customers’ data stays on your machine. When you need the extra reach of name and place detection across languages, the optional ONNX NER layer is there, but it composes into the same pipeline rather than being a separate thing bolted on.

TasmanianDevil is published on NuGet (TasmanianDevil and the optional TasmanianDevil.Onnx), targets .NET 10 and is MIT licensed. The source and a narrated, end-to-end showcase sample - which is where I harvested all the code and output in this post - are on GitHub.

In part two, I will take this exact engine and wire it into Microsoft Agent Framework agents through AgentGuard, so PII is handled automatically on the way into the model, on the way out, and - the surface people most often miss - in the results that come back from tool calls.

About

Hi! I'm Filip W., a software architect from Zürich 🇨🇭. I like Toronto Maple Leafs 🇨🇦, Rancid and quantum computing. Oh, and I love the Lowlands 🏴󠁧󠁢󠁳󠁣󠁴󠁿.

You can find me on Github, on Mastodon and on Bluesky.

StrathWeb. A free flowing tech monologue.

PII guardrails for .NET applications - Part 1: TasmanianDevil library

Series overview 🔗

Why PII needs its own engine 🔗

Quick start 🔗

Detection, scores and context 🔗

Anonymization operators 🔗

Structured data and batches 🔗

Multilingual NER (optional) 🔗

Final thoughts 🔗

About

Recent Posts

Categories