Anonymising PII in PostgreSQL with pgEdge Anonymizer
Data privacy regulations such as GDPR, CCPA, and HIPAA have made it increasingly important for organisations to protect personally identifiable information (PII) in their databases. Whether you're creating a development environment from production data, sharing datasets with third parties, or simply trying to minimise risk, you'll often need to anonymise sensitive data whilst maintaining the structure and relationships within your database.
I've been working on a tool to address this need: pgEdge Anonymizer. It's a command-line utility that replaces PII in PostgreSQL databases with realistic but fake values, all whilst preserving referential integrity and data consistency.
The Problem
Consider a typical scenario: you have a production database containing customer records, and you need to create a copy for your development team. The data includes names, email addresses, phone numbers, National Insurance numbers, and credit card details. You can't simply hand over the production data as that would be a compliance nightmare, but you also need the development database to contain realistic data that exercises the same code paths as production.
Manually anonymising this data is tedious and error-prone. You need to ensure that:
The same customer email appears consistently across all tables
Foreign key relationships remain intact
The anonymised data looks realistic (not just "XXXX" or "test@test.com")
The process is repeatable and auditable
Enter pgEdge Anonymizer
pgEdge Anonymizer addresses these challenges with a simple YAML-based configuration approach. You define which columns contain PII and what type of data they hold, and the tool handles the rest.
Installation
Building from source is straightforward:
git clone https://github.com/pgEdge/pgedge-anonymizer.git
cd pgedge-anonymizer
make buildThis produces a single binary in the bin/ directory that you can copy wherever you need it.
Configuration
The configuration file defines your database connection and the columns to anonymise. Here's a typical example:
database:
host: localhost
port: 5432
database: myapp
columns:
- column: public.customers.first_name
pattern: PERSON_FIRST_NAME
- column: public.customers.last_name
pattern: PERSON_LAST_NAME
- column: public.customers.email
pattern: EMAIL
- column: public.customers.phone
pattern: UK_PHONE
- column: public.customers.ni_number
pattern: UK_NI
- column: public.employees.first_name
pattern: PERSON_FIRST_NAME
- column: public.employees.last_name
pattern: PERSON_LAST_NAME
- column: public.audit_log.user_email
pattern: EMAIL
Each column is specified using its fully-qualified name (schema.table.column) and assigned a pattern that determines how the data should be anonymised.
Running the Anonymizer
Before making any changes, it's wise to validate your configuration:
pgedge-anonymizer validate --user myuser --password mypasswordThis checks that the configuration file is valid, the database is accessible, and all specified columns exist. Once you're satisfied, run the anonymisation:
pgedge-anonymizer run --user myuser --password mypasswordYou'll see progress output as the tool processes each column:
Processing public.customers.email (est. 50000 rows)...
10000 rows processed
20000 rows processed
30000 rows processed
40000 rows processed
50000 rows processed
Completed: 50000 rows, 48234 values anonymized
=== Anonymization Statistics ===
Total columns processed: 6
Total rows processed: 127500
Total values anonymized: 98432
Total duration: 5.67s
Throughput: 22487 rows/secBuilt-in Patterns
One of the things I'm particularly pleased with is the range of built-in patterns. There are over 100 patterns covering common PII types, with country-specific support for 19 countries.
For those of us in the UK, the relevant patterns include:
| Pattern | Description | Example Output |
|---|---|---|
| UK_PHONE | UK phone numbers | +44 20 7946 0958 |
| UK_NI | National Insurance numbers | AB123456C |
| UK_NHS | NHS numbers | 485 777 3456 |
| UK_POSTCODE | Postcodes | SW1A 1AA |
| UK_ADDRESS | Full addresses | 42 Oak Street, Manchester, Greater Manchester |
Similar patterns exist for the US, Canada, Germany, France, Australia, and many other countries. The tool also includes patterns for credit cards, passports, dates of birth, IP addresses, and free-text fields.
Format Preservation
A nice touch is that the tool preserves the format of the original data where possible. If your phone numbers use dashes (555-123-4567), the anonymised values will too. If they use spaces or parentheses, that format is maintained. The same applies to dates, credit card numbers, and other formatted data.
Consistency and Referential Integrity
Perhaps the most important feature is consistency. Within a single anonymisation run, the same input value always produces the same output value. This means that if john.smith@example.com appears in three different tables, it will be replaced with the same anonymised email address in all three places.
The tool also analyses foreign key relationships automatically. If a column has referencing foreign keys with ON UPDATE CASCADE, the tool updates the source column and lets PostgreSQL propagate the changes. This ensures that your anonymised database maintains full referential integrity.
Performance Considerations
For large databases, performance matters. pgEdge Anonymizer uses server-side cursors to fetch rows in batches (10,000 by default), and performs updates using efficient CTID-based batch operations. There's also a tiered caching system with an LRU in-memory cache that spills over to SQLite for very large value dictionaries.
All changes are made within a single transaction, so if anything goes wrong, the entire operation is rolled back cleanly.
Custom Patterns
Whilst the built-in patterns cover most common cases, you can define custom patterns for application-specific data. Custom patterns support three types:
Date formats using strftime codes:
patterns:
- name: HIRE_DATE
format: "%Y-%m-%d"
type: date
min_year: 2010
max_year: 2024Number formats using printf codes:
patterns:
- name: ORDER_NUMBER
format: "ORD-%08d"
type: number
min: 1
max: 99999999Mask formats using character placeholders:
patterns:
- name: PRODUCT_SKU
format: "SKU-AA-####"
type: maskBest Practices
A few recommendations from my experience:
Always back up first - Anonymisation is irreversible
Test on a copy - Validate on a non-production database before running against anything important
Review all columns - It's easy to miss a column that contains PII
Use SSL - Especially if connecting over a network
Run during quiet periods - Large anonymisation jobs can generate significant WAL traffic
Getting Started
pgEdge Anonymizer is available on GitHub at github.com/pgEdge/pgedge-anonymizer. The repository includes comprehensive documentation, example configurations, and a test dataset you can use to explore the tool's capabilities.
I'd welcome feedback and contributions. If you encounter any issues or have suggestions for new patterns, please open an issue on GitHub.
Got questions or feedback about the pgEdge Postgres MCP server? Hit us up on the pgEdge Discord or open an issue on GitHub. We're here to help.



