An Alternative to BCP
- Posted in:
- Machine Learning Services
- SQL Server
BCP and OPENROWSET are long-lived SQL Server options for working with data in external files. I've blogged about OPENROWSET, including a recent article showing a way to deal with quoted data. One of the shortcomings I've never been able to overcome is an inconsistent data file with data fields in some rows enclosed in double quotes, but not all. I've never found a way around this limitation.
Let's demonstrate with BCP. Below is a sample data file I'll attempt to load into a SQL Server table. Note the data fields highlighted in yellow, which are enclosed in double quotes and contain the field terminator , (comma) character. For reference, the file is also available on Github.
Stanley Lind,5819626239,2013-07-16,firstname.lastname@example.org,241-63-2894,Chartered loss adjuster,"Mante, Mante and Mante"
Hale Bins,3212496148,1903-02-25,email@example.com,206-93-3199,Sports development officer,"Borer, Borer and Borer"
Mikaela O'Keefe,5245629478,1927-09-01,firstname.lastname@example.org,159-33-2513,Museum/gallery conservator,"Jacobs, Jacobs and Jacobs"
Keagan Hackett,1374574537,1992-12-01,email@example.com,803-23-7144,Private music teacher,Quitzon LLC
Kiana Cronin m.d.,3297585362,2017-01-24,firstname.lastname@example.org,133-84-8085,Fisheries officer,Witting Ltd
Harlow Durgan,9239627469,1918-04-13,email@example.com,858-94-5148,"Nurse, mental health",Wilkinson PLC
Sheila Donnelly-Blanda,4657648749,2005-07-19,firstname.lastname@example.org,737-58-1638,"Therapist, occupational",Barrows PLC
Mr Colin Turcotte,5634176974,1952-01-18,email@example.com,946-84-5702,Maintenance engineer,Kub-Kub
Lesta Krajcik,6244527243,1921-04-20,firstname.lastname@example.org,432-99-2126,Nutritional therapist,White-White
Here is the bcp command:
bcp tempdb.guest.ImaginaryPeople in "C:\Output\ImaginaryPeople.csv" -S .\SQL2017 -T -c -t, -r "\n" -F 2
And here are the results when we query the newly loaded table. BCP still includes the double quotes as we see on the first three rows, although the , (comma) character is intact and in the right place within the column/field data. However, lines 7 and 8 are wrong. The "job" data is split across two columns/fields. And there's a field terminator , (comma) character now within the "company".
Now let's look at an alternative to BCP. Using SQL Server Machine Learning services, we can harness all of the flexibility the R language has to offer (or Python or Java, depending on which version of SQL Server you are using). The code below truncates the table and executes sp_execute_external_script, invoking the R function read.csv().
TRUNCATE TABLE tempdb.guest.ImaginaryPeople;
INSERT INTO tempdb.guest.ImaginaryPeople
@language = N'R',
@script = N'
csvFile <- read.csv(file = "C:\\Output\\ImaginaryPeople.csv",
sep = ",",
header = TRUE,
numerals = "no.loss")
csvFile$phone <- as.character(csvFile$phone)',
@output_data_1_name = N'csvFile';
Here are the query results for the newly loaded table. The double quotes that previously enclosed data columns/fields are gone, all the , (comma) characters are where they should be, and all the field data are in the correct columns.
BCP and OPENROWSET have stood the test of time, and rightly so. They're great options. And while you might argue they're imperfect, I will be quick to remind you they are extremely fast. In situations where those tools may not work or are not preferable, there are certainly other options. (SSIS, PowerShell, and C# come to mind.) Maybe R, Python, or Java in-database are other options you'd consider? Comments welcome below. Happy coding!