Data Obfuscation for SQL Server
- Posted in:
- Development
- Machine Learning Services
- R
- SQL Server
- T-SQL
In a previous post, I explored an option for generating fake data in sql server using Machine Learning services and the R language. I've expanded on that by creating some stored procedures that can be used for both generating data sets of fake data, and for obfuscating existing SQL Server data with fake data.
The code is available in a Github repository. For now, it consists of ten stored procedures. For data obfuscation, you'd use one of more of these five stored procs:
- Obfuscator.SetAlphaNumericData
- Obfuscator.SetEmailAddresses
- Obfuscator.SetFullNames
- Obfuscator.SetPhoneNumbers
- Obfuscator.SetSSNs
Each procedure has input parameters for database name, table schema, table name, and column name(s). There is also an optional parameter to indicate whether triggers should be disabled and re-enabled. (Obfuscator.SetSSNs has an additional optional parameter to specify a formatting character.) Here are some examples from the [AdventureWorks] database, showing the data both before and after running each obfuscation proc:
Alphanumeric Data
EXEC MLtools.Obfuscator.SetAlphaNumericData
@DatabaseName = 'Adventureworks',
@TableSchema = 'Production',
@TableName = 'Product',
@AlphaNumericColumn = 'ProductNumber',
@DisableTriggers = 1;
Before | After |
---|---|
Email Addresses
EXEC MLtools.Obfuscator.SetEmailAddresses
@DatabaseName = 'Adventureworks',
@TableSchema = 'Person',
@TableName = 'EmailAddress',
@EmailAddressColumn = 'EmailAddress',
@DisableTriggers = 1;
Before | After |
---|---|
Full Names
EXEC MLtools.Obfuscator.SetFullNames
@DatabaseName = 'Adventureworks',
@TableSchema = 'Person',
@TableName = 'Person',
@FirstNameColumn = 'FirstName',
@LastNameColumn = 'LastName',
@MiddleNameColumn = 'MiddleName',
@DisableTriggers = 1;
Before | After |
---|---|
Phone Numbers
EXEC MLtools.Obfuscator.SetPhoneNumbers
@DatabaseName = 'Adventureworks',
@TableSchema = 'Person',
@TableName = 'PersonPhone',
@PhoneNumColumn = 'PhoneNumber',
@DisableTriggers = 1;
Before | After |
---|---|
Social Security Numbers
EXEC MLtools.Obfuscator.SetSSNs
@DatabaseName = 'Adventureworks',
@TableSchema = 'HumanResources',
@TableName = 'Employee',
@SsnColumn = 'NationalIDNumber',
@FormatChar = '-',
@DisableTriggers = 1;
Before | After |
---|---|
Design Notes
The "Obfuscator" stored procedures each invoke a "Generator" procedure. These returns result sets of fake data without performing any changes to existing data:
- Generator.GetAlphaNumericData
- Generator.GetEmailAddresses
- Generator.GetFullNames
- Generator.GetPhoneNumbers
- Generator.GetSSNs
The "Generator" and "Obfuscator" functionality were separated for a couple of reasons. The first, perhaps more obvious reason, is that each type of procedure has a different purpose. The other reason has to do with script language. My code use the R language and takes advantage of the "generator" package for R (among others). But perhaps R isn't your favorite. The procedures could easily be rewritten using Python (starting with SQL 2017) or Java (starting with SQL 2019). Those languages may offer other packages/libraries with better or different functionality for whatever your needs are. Or you may simply have more experience with (or a preference for) a different scripting language.
All of the stored procedures have a header and code comments--hopefully it is easy to understand and is not too arcane. Happy obfuscating!
Comments