I've been thinking about R and how it can be used by developers, DBAs, and other SQL Server professionals that aren't data scientists per se. A recent article about generating a data set of fake transactional data got me thinking about this again and I wondered, can R be used to obfuscate PII data?
In a word, yes. Well, mostly. (More on this in a bit.) As with anything R-related, there are probably multiple packages that are useful for any given task. For this one, I'll focus on the "generator" package. It includes the following functions (whose names should be fairly self-explanatory) for generating fake data:
All of the functions have a required n parameter for the number of observations to generate. For dates of birth and phone numbers, there are some additional optional parameters for customization. Here's an example (using the functions I find most useful) via SQL Server's Machine Learning Service and sp_execute_external_script:
@language = N'R',
@script = N'library(generator)
num_observations <- 100
df <- as.data.frame(r_full_names(num_observations))
df$phone <- r_phone_numbers(num_observations, use_hyphens = TRUE)
df$DOB <- as.character(r_date_of_births(num_observations))
df$email <- r_email_addresses(num_observations)
df$SSN <- r_national_identification_numbers(num_observations)
OutputDataSet <- df'
WITH RESULT SETS ((
This presents a really interesting option, although there are many questions I haven't broached that are outside the scope of this post. Generating random dates and digits is fairly trivial with T-SQL. Generating random (and meaningful) strings is not. Those fake names are pretty cool, although if you are outside of the US, the diversity of those names (or lack of) may be a concern. The email addresses are not realistic in my opinion. They look too random, as you can see by looking at a few of the top-level domain name suffixes. (.lhw? .nxa?)
There's probably not enough functions in the generator package to roll your own complete PII obfuscation solution. However, it appears to be version 1.0. The documentation notes functions for home address and birthplace (among others) are "To be added". If those functions arrive any time soon, I'll be considering using "generator" for some non-production environments.
Are there any other R packages you know of for generating fake PII? Leave a message in the comments!