Anonymising PII in PostgreSQL with pgEdge Anonymizer

Data privacy regulations such as GDPR, CCPA, and HIPAA have made it increasingly important for organisations to protect personally identifiable information (PII) in their databases. Whether you're creating a development environment from production data, sharing datasets with third parties, or simply trying to minimise risk, you'll often need to anonymise sensitive data whilst maintaining the structure and relationships within your database.

I've been working on a tool to address this need: pgEdge Anonymizer. It's a command-line utility that replaces PII in PostgreSQL databases with realistic but fake values, all whilst preserving referential integrity and data consistency.

The Problem

Consider a typical scenario: you have a production database containing customer records, and you need to create a copy for your development team. The data includes names, email addresses, phone numbers, National Insurance numbers, and credit card details. You can't simply hand over the production data as that would be a compliance nightmare, but you also need the development database to contain realistic data that exercises the same code paths as production.

Manually anonymising this data is tedious and error-prone. You need to ensure that:

The same customer email appears consistently across all tables
Foreign key relationships remain intact
The anonymised data looks realistic (not just "XXXX" or "test@test.com")
The process is repeatable and auditable

Enter pgEdge Anonymizer

pgEdge Anonymizer addresses these challenges with a simple YAML-based configuration approach. You define which columns contain PII and what type of data they hold, and the tool handles the rest.

Installation

Building from source is straightforward:

git clone https://github.com/pgEdge/pgedge-anonymizer.git
cd pgedge-anonymizer
make build

This produces a single binary in the bin/ directory that you can copy wherever you need it.

Configuration

The configuration file defines your database connection and the columns to anonymise. Here's a typical example:

database:
  host: localhost
  port: 5432
  database: myapp

columns:
  - column: public.customers.first_name
    pattern: PERSON_FIRST_NAME

  - column: public.customers.last_name
    pattern: PERSON_LAST_NAME

  - column: public.customers.email
    pattern: EMAIL

  - column: public.customers.phone
    pattern: UK_PHONE

  - column: public.customers.ni_number
    pattern: UK_NI

  - column: public.employees.first_name
    pattern: PERSON_FIRST_NAME

  - column: public.employees.last_name
    pattern: PERSON_LAST_NAME

  - column: public.audit_log.user_email
    pattern: EMAIL

Each column is specified using its fully-qualified name (schema.table.column) and assigned a pattern that determines how the data should be anonymised.

Running the Anonymizer

Before making any changes, it's wise to validate your configuration:

pgedge-anonymizer validate --user myuser --password mypassword

This checks that the configuration file is valid, the database is accessible, and all specified columns exist. Once you're satisfied, run the anonymisation:

pgedge-anonymizer run --user myuser --password mypassword

You'll see progress output as the tool processes each column:

Processing public.customers.email (est. 50000 rows)...
  10000 rows processed
  20000 rows processed
  30000 rows processed
  40000 rows processed
  50000 rows processed
  Completed: 50000 rows, 48234 values anonymized

=== Anonymization Statistics ===
Total columns processed: 6
Total rows processed:    127500
Total values anonymized: 98432
Total duration:          5.67s
Throughput:              22487 rows/sec

Built-in Patterns

One of the things I'm particularly pleased with is the range of built-in patterns. There are over 100 patterns covering common PII types, with country-specific support for 19 countries.

For those of us in the UK, the relevant patterns include:

Pattern	Description	Example Output
UK_PHONE	UK phone numbers	+44 20 7946 0958
UK_NI	National Insurance numbers	AB123456C
UK_NHS	NHS numbers	485 777 3456
UK_POSTCODE	Postcodes	SW1A 1AA
UK_ADDRESS	Full addresses	42 Oak Street, Manchester, Greater Manchester

Similar patterns exist for the US, Canada, Germany, France, Australia, and many other countries. The tool also includes patterns for credit cards, passports, dates of birth, IP addresses, and free-text fields.

Format Preservation

A nice touch is that the tool preserves the format of the original data where possible. If your phone numbers use dashes (555-123-4567), the anonymised values will too. If they use spaces or parentheses, that format is maintained. The same applies to dates, credit card numbers, and other formatted data.

Consistency and Referential Integrity

Perhaps the most important feature is consistency. Within a single anonymisation run, the same input value always produces the same output value. This means that if john.smith@example.com appears in three different tables, it will be replaced with the same anonymised email address in all three places.

The tool also analyses foreign key relationships automatically. If a column has referencing foreign keys with ON UPDATE CASCADE, the tool updates the source column and lets PostgreSQL propagate the changes. This ensures that your anonymised database maintains full referential integrity.

Performance Considerations

For large databases, performance matters. pgEdge Anonymizer uses server-side cursors to fetch rows in batches (10,000 by default), and performs updates using efficient CTID-based batch operations. There's also a tiered caching system with an LRU in-memory cache that spills over to SQLite for very large value dictionaries.

All changes are made within a single transaction, so if anything goes wrong, the entire operation is rolled back cleanly.

Custom Patterns

Whilst the built-in patterns cover most common cases, you can define custom patterns for application-specific data. Custom patterns support three types:

Date formats using strftime codes:

patterns:
  - name: HIRE_DATE
    format: "%Y-%m-%d"
    type: date
    min_year: 2010
    max_year: 2024

Number formats using printf codes:

patterns:
  - name: ORDER_NUMBER
    format: "ORD-%08d"
    type: number
    min: 1
    max: 99999999

Mask formats using character placeholders:

patterns:
  - name: PRODUCT_SKU
    format: "SKU-AA-####"
    type: mask

Best Practices

A few recommendations from my experience:

Always back up first - Anonymisation is irreversible
Test on a copy - Validate on a non-production database before running against anything important
Review all columns - It's easy to miss a column that contains PII
Use SSL - Especially if connecting over a network
Run during quiet periods - Large anonymisation jobs can generate significant WAL traffic

Getting Started

pgEdge Anonymizer is available on GitHub at github.com/pgEdge/pgedge-anonymizer. The repository includes comprehensive documentation, example configurations, and a test dataset you can use to explore the tool's capabilities.

I'd welcome feedback and contributions. If you encounter any issues or have suggestions for new patterns, please open an issue on GitHub.

Anonymising PII in PostgreSQL with pgEdge Anonymizer

The Problem

Enter pgEdge Anonymizer

Installation

Configuration

Running the Anonymizer

Built-in Patterns

Format Preservation

Consistency and Referential Integrity

Performance Considerations

Custom Patterns

Best Practices

Getting Started

Dave Page

Chief Technology Officer

Get started today.

Anonymising PII in PostgreSQL with pgEdge Anonymizer

The Problem

Enter pgEdge Anonymizer

Installation

Configuration

Running the Anonymizer

Built-in Patterns

Format Preservation

Consistency and Referential Integrity

Performance Considerations

Custom Patterns

Best Practices

Getting Started

Dave Page

Chief Technology Officer

SUBSCRIBE TO BLOG

Get started today.