Getting Started With OPENROWSET and the BULK Rowset Provider - Part 2
- Posted in:
- Development
- SQL Server
In a previous post, we looked at the basics to get started with OPENROWSET(), the BULK rowset provider, and text files of fixed-width data fields. Now, we'll move on to text files with delimited data fields.
Before we get too far, a few words about Ad Hoc Distributed Queries. I will neither advocate nor decry the enabling and usage of this Server Configuration Option. This post is merely an exercise to show some of OPENROWSET's functionality that may fit a particular use case. Is it ok to enable it? Decide for yourself. To use OPENROWSET, know that you will need to enable Ad Hoc Distributed Queries.
I was working with SQL Server 2012 when this post was written. For other versions, your mileage may vary. All of the files used for this post resided directly on the SQL Server host file system.
Here's the basic syntax for OPENROWSET() with the BULK Rowset Provider:
SELECT *
FROM OPENROWSET (
BULK '[Data_Source_File]',
FORMATFILE = '[Format_File]'
) AS p
As you can see, there are two parameters:
Here are a few lines of data from a sample data file. These are "Products" from the AdventureWorks database.
AR-5381,Adjustable Race,1000,0.00,0.00,06/01/2002,
BA-8327,Bearing Ball,1000,0.00,0.00,06/01/2002,
BE-2349,BB Ball Bearing,800,0.00,0.00,06/01/2002,
BE-2908,Headset Ball Bearings,800,0.00,0.00,06/01/2002,
FK-1639,LL Fork,500,65.81,148.22,07/01/2006,06/30/2007
FK-5136,ML Fork,500,77.92,175.49,07/01/2006,06/30/2007
Now let's take a look at a format file for use with the above data:
<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="50" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="2" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="100" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="3" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="7"/>
<FIELD ID="4" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="30"/>
<FIELD ID="5" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="30"/>
<FIELD ID="6" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="24"/>
<FIELD ID="7" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="24"/>
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="ProductNumber" xsi:type="SQLNVARCHAR"/>
<COLUMN SOURCE="2" NAME="ProductName" xsi:type="SQLNVARCHAR"/>
<COLUMN SOURCE="3" NAME="SafetyStockLevel" xsi:type="SQLSMALLINT"/>
<COLUMN SOURCE="4" NAME="StandardCost" xsi:type="SQLMONEY"/>
<COLUMN SOURCE="5" NAME="ListPrice" xsi:type="SQLMONEY"/>
<COLUMN SOURCE="6" NAME="SellStartDate" xsi:type="SQLDATETIME"/>
<COLUMN SOURCE="7" NAME="SellEndDate" xsi:type="SQLDATETIME"/>
</ROW>
</BCPFORMAT>
And now, let's run OPENROWSET() with the above data file and format file. Copy and paste the "Products" data into one file and copy/paste the format file XML into another file. Save them to a local drive on the SQL Server host. For this example, the files will be in C:\SQL Server\ETL. Here's our tsql command, which I'll run from SSMS:
SELECT *
FROM OPENROWSET (
--The data file.
BULK 'C:\SQL Server\ETL\AdventureWorks Products.csv',
--The format file.
FORMATFILE = 'C:\SQL Server\ETL\Product.xml'
) AS p
Here is the result:
SQL Server returns the data as a set of rows and named columns. It looks a lot like querying a table or view, wouldn't you say? You can do more than simple SELECT * queries. Here's a few examples:
/* Only the "seat" products */
SELECT *
FROM OPENROWSET (
BULK 'C:\SQL Server\ETL\AdventureWorks Products.csv',
FORMATFILE = 'C:\SQL Server\ETL\Product.xml'
) AS p
WHERE p.Name LIKE '%seat%'
/* Products that can no longer be sold. */
SELECT *
FROM OPENROWSET (
BULK 'C:\SQL Server\ETL\AdventureWorks Products.csv',
FORMATFILE = 'C:\SQL Server\ETL\Product.xml'
) AS p
WHERE p.SellEndDate < CURRENT_TIMESTAMP
/* Avg product cost and list price */
SELECT AVG(p.StandardCost) AvgCost, AVG(p.ListPrice) AvgPrice
FROM OPENROWSET (
BULK 'C:\SQL Server\ETL\AdventureWorks Products.csv',
FORMATFILE = 'C:\SQL Server\ETL\Product.xml'
) AS p
Data Types
Let's take a closer look at the format file 1Hopefully, it looks familiar. There is a single <RECORD> element with multiple <FIELD> elements that correspond to the fields in the data file. There is also a single <ROW> element with multiple <COLUMN> elements that correspond to table columns. Note the xsi:type attributes that specify the SQL data types for the columns of the returned rowset.
<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="50" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="2" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="100" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="3" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="7"/>
<FIELD ID="4" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="30"/>
<FIELD ID="5" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="30"/>
<FIELD ID="6" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="24"/>
<FIELD ID="7" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="24"/>
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="ProductNumber" xsi:type="SQLNVARCHAR"/>
<COLUMN SOURCE="2" NAME="ProductName" xsi:type="SQLNVARCHAR"/>
<COLUMN SOURCE="3" NAME="SafetyStockLevel" xsi:type="SQLSMALLINT"/>
<COLUMN SOURCE="4" NAME="StandardCost" xsi:type="SQLMONEY"/>
<COLUMN SOURCE="5" NAME="ListPrice" xsi:type="SQLMONEY"/>
<COLUMN SOURCE="6" NAME="SellStartDate" xsi:type="SQLDATETIME"/>
<COLUMN SOURCE="7" NAME="SellEndDate" xsi:type="SQLDATETIME"/>
</ROW>
</BCPFORMAT>
You'll get various results if the data in a field doesn't match the corresponding xsi:type. For instance, let's change "SafetyStockLevel" from SQLSMALLINT to SQLMONEY in the format file:
<COLUMN SOURCE="3" NAME="SafetyStockLevel" xsi:type="SQLMONEY"/>
Here is the result. Notice the decimal point with two digits of precision.
Now let's try the reverse and change "StandardCost" from SQLMONEY to SQLSMALLINT in the format file:
<COLUMN SOURCE="4" NAME="StandardCost" xsi:type="SQLSMALLINT"/>
Unfortunately this doesn't work. Instead of implicitly converting the data from one type to another, we get an error:
Msg 4864, Level 16, State 1, Line 1
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 1, column 4 (StandardCost).
You can see another example of a data conversion error if you change one or more of the dates in the data file to an invalid date. For example, change the "SellEndDate" in last row as follows:
AR-5381,Adjustable Race,1000,0.00,0.00,06/01/2002,
BA-8327,Bearing Ball,1000,0.00,0.00,06/01/2002,
BE-2349,BB Ball Bearing,800,0.00,0.00,06/01/2002,
BE-2908,Headset Ball Bearings,800,0.00,0.00,06/01/2002,
FK-1639,LL Fork,500,65.81,148.22,07/01/2006,06/30/2007
FK-5136,ML Fork,500,77.92,175.49,07/01/2006,13/30/2007
Try any of the previous OPENROWSET() query examples to see the error:
Msg 4864, Level 16, State 1, Line 1
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 6, column 7 (SellEndDate).
Undo Changes
Before we proceed, let's "undo" all our changes. Change the "SellEndDate" in the last line of the data file back to a valid date as follows:
FK-5136,ML Fork,500,77.92,175.49,07/01/2006,06/30/2007
Now change the xsi:types in the XML format file back to their original values. "SafetyStockLevel" will be SQLSMALLINT and "StandardCost" will be SQLMONEY as follows:
<COLUMN SOURCE="3" NAME="SafetyStockLevel" xsi:type="SQLSMALLINT"/>
<COLUMN SOURCE="4" NAME="StandardCost" xsi:type="SQLMONEY"/>
Header Rows
Let's add a header row to the data file:
Prod Num,Prod Name,Stock Level,Std Cost,List Price,Start Date, End Date
AR-5381,Adjustable Race,1000,0.00,0.00,06/01/2002,
BA-8327,Bearing Ball,1000,0.00,0.00,06/01/2002,
BE-2349,BB Ball Bearing,800,0.00,0.00,06/01/2002,
BE-2908,Headset Ball Bearings,800,0.00,0.00,06/01/2002,
FK-1639,LL Fork,500,65.81,148.22,07/01/2006,06/30/2007
FK-5136,ML Fork,500,77.92,175.49,07/01/2006,06/30/2007
A header row will be fairly common when working with a data file of delimited fields. There's a good chance it will cause OPENROWSET to return an error. Even if it doesn't, it is "data" you don't want returned in the result set. The FIRSTROW parameter is your friend. It specifies the number of the first row. The default value is 1 (row numbers are 1-based). Here's an example:
SELECT *
FROM OPENROWSET (
BULK 'C:\SQL Server\ETL\AdventureWorks Products.csv',
FORMATFILE = 'C:\SQL Server\ETL\Product.xml',
FIRSTROW = 2
) AS p
Creating Format Files
When you're ready to try using OPENROWSET() with some real data, you may be asking yourself "Where does the format file come from?" Most of the time, the responsibility of creating the format file falls on you. Chances are, a data file given to you will have something that explains the layout of the data. It might be in an email, a Word document, an Excel spreadsheet, etc. At the very least, I would expect the names of the data fields to be contained within a header row. You'll use that information to create the format file yourself. The MSDN documentation for XML Format Files is a great resource if you want to create it by hand in a text editor.
Alternatively, you may find it easier to use the bcp utility to create format files. For example, let's say you are given the sample data source file (with header row) used here in this post. We know the following data field names and their positions:
Looking at a sample of the data rows, you can probably infer the data type and length (if applicable). With that info in hand, create a temp table to match:
CREATE TABLE AdventureWorks.guest.Product (
ProductNumber VARCHAR(32),
ProductName VARCHAR(64),
StockLevel TINYINT,
StandardCost MONEY,
ListPrice MONEY,
StartDate DATE,
EndDate DATE
)
Next, run bcp with these parameters:
bcp AdventureWorks.guest.Product format nul -c -x -f "C:\SQL Server\ETL\ProductTest.xml" -t, -S SqlHostName -T
The name of the table is specified, -c for character data, -x to specify the XML format file type, -f with the full path of the XML format file to create, -t, specifies the field terminator (a comma), -S is the server name of the SQL host, and -T tells bcp to connect to SQL with a trusted connection.
Here is the contents of the "ProductTest.xml" format file created by bcp:
<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="32" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="2" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="64" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="3" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="5"/>
<FIELD ID="4" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="30"/>
<FIELD ID="5" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="30"/>
<FIELD ID="6" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="11"/>
<FIELD ID="7" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="11"/>
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="ProductNumber" xsi:type="SQLVARYCHAR"/>
<COLUMN SOURCE="2" NAME="ProductName" xsi:type="SQLVARYCHAR"/>
<COLUMN SOURCE="3" NAME="StockLevel" xsi:type="SQLTINYINT"/>
<COLUMN SOURCE="4" NAME="StandardCost" xsi:type="SQLMONEY"/>
<COLUMN SOURCE="5" NAME="ListPrice" xsi:type="SQLMONEY"/>
<COLUMN SOURCE="6" NAME="StartDate" xsi:type="SQLDATE"/>
<COLUMN SOURCE="7" NAME="EndDate" xsi:type="SQLDATE"/>
</ROW>
</BCPFORMAT>
You'll notice some subtle differences between this format file and the original one: the basic structure is the same, but some of the MAX_LENGTH and xsi:type values may be different. If you encounter data conversion errors, you can adjust the MAX_LENGTH and xsi:type values as needed. As an example, MAX_LENGTH of FIELD ID="1" might be extended from 32 to 50.
I find this method easier than creating XML format files by hand. The initial format might not be perfect, but it's easy enough to edit it and make it work. Afterwards, table AdventureWorks.guest.Product can be dropped, if desired.
Data Containing a Field Delimiter
Let's add a another row to the data file. NOTE: I removed the header row to keep things simple.
AR-5381,Adjustable Race,1000,0.00,0.00,06/01/2002,
BA-8327,Bearing Ball,1000,0.00,0.00,06/01/2002,
BE-2349,BB Ball Bearing,800,0.00,0.00,06/01/2002,
BE-2908,Headset Ball Bearings,800,0.00,0.00,06/01/2002,
FK-1639,LL Fork,500,65.81,148.22,07/01/2006,06/30/2007
FK-5136,ML Fork,500,77.92,175.49,07/01/2006,06/30/2007
HL-U509,Sport-100 Helmet, Black,4,13.0863,34.99,07/01/2005,
Notice the Product Name: Sport-100 Helmet, Black. It contains a comma, which is our delimiter. With the data file and the XML format file in their current states, OPENROWSET will encounter errors. Any of our previous OPENROWSET() queries will generate an error:
Msg 4864, Level 16, State 1, Line 9
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 7, column 3 (SafetyStockLevel).
Since it's a data conversion error, I *could* try to get around it by using a flexible xsi:type for every column and a generous MAX_LENGTH value for every FIELD. Let's try that with a new XML format file named "ProductTest.xml" as follows:
<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="100" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="2" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="100" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="3" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="100"/>
<FIELD ID="4" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="100"/>
<FIELD ID="5" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="100"/>
<FIELD ID="6" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="100"/>
<FIELD ID="7" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="100"/>
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="ProductNumber" xsi:type="SQLVARYCHAR"/>
<COLUMN SOURCE="2" NAME="ProductName" xsi:type="SQLVARYCHAR"/>
<COLUMN SOURCE="3" NAME="StockLevel" xsi:type="SQLVARYCHAR"/>
<COLUMN SOURCE="4" NAME="StandardCost" xsi:type="SQLVARYCHAR"/>
<COLUMN SOURCE="5" NAME="ListPrice" xsi:type="SQLVARYCHAR"/>
<COLUMN SOURCE="6" NAME="StartDate" xsi:type="SQLVARYCHAR"/>
<COLUMN SOURCE="7" NAME="EndDate" xsi:type="SQLVARYCHAR"/>
</ROW>
</BCPFORMAT>
Every FIELD has a MAX_LENGTH of 100, and every COLUMN has xsi:type SQLVARYCHAR. Now let's try our standard "SELECT *" query with the new XML format file:
SELECT *
FROM OPENROWSET (
BULK 'C:\SQL Server\ETL\AdventureWorks Products.csv',
FORMATFILE = 'C:\SQL Server\ETL\ProductTest.xml',
) AS p
Below is the result. The data fields in row 7 are skewed, with two data fields merged into the last column (EndDate).
Delimiters within data fields is a common problem with delimited files. Some common tactics to address this include:
- Recreate the data file with all occurrences of the delimiter stripped out of the data fields before they are written to file: this eliminates OPENROWSET errors, but does not preserve the integrity of the data.
- Recreate the data file with a different delimiter character: in my experience, a tab delimiter is a better choice. It's less common to encounter a tab character than a comma within data. But it's certainly not unheard of. I've seen tabs within data too.
- Enclose data fields in double quotes: this requires some tweaks to the XML format file. We'll get to this in a bit...
Manually editing the data file might be do-able for any of the above options. But it can be tedious, especially for large files. (Just opening a file of several GB's in Notepad.exe is an exercise in patience.) Realistically, you'd want the author to re-create it for you. Option #1 should always "work". But again, there's the data integrity issue you may not be able to live with. Option #2 will probably work for many cases, but it's not bulletproof. Option #3 isn't bulletproof either (it's always *possible* to have a delimiter within a data field), but it's about as close as you can get. Plus, it preserves data integrity. But there are some caveats.
On more than a few occasions, I've been given a data file that's formatted like what you see below. Go ahead and copy & paste that into a new file named "Quoted Products.csv".
AR-5381,Adjustable Race,1000,0,0,6/1/2002,
BA-8327,Bearing Ball,1000,0,0,6/1/2002,
BE-2349,BB Ball Bearing,800,0,0,6/1/2002,
BE-2908,Headset Ball Bearings,800,0,0,6/1/2002,
FK-1639,LL Fork,500,65.81,148.22,7/1/2006,6/30/2007
FK-5136,ML Fork,500,77.92,175.49,7/1/2006,6/30/2007
HL-U509,"Sport-100 Helmet, Black",4,13.0863,34.99,7/1/2005,
Let's try the OPENROWSET() query with our original XML format file and the new data source file:
SELECT *
FROM OPENROWSET (
BULK 'C:\SQL Server\ETL\Quoted Products.csv',
FORMATFILE = 'C:\SQL Server\ETL\Product.xml',
) AS p
Once again, we get a data conversion error:
Msg 4864, Level 16, State 1, Line 10
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 7, column 3 (SafetyStockLevel).
OPENROWSET() doesn't recognize the data enclosed in double quotes as one data field, so it tries to read in Black as the third field. But it can't--field three is xsi:type SQLSMALLINT. To get OPENROWSET() to recognize a data field enclosed in double-quotes, we must edit our XML format file as follows:
<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR=',"' MAX_LENGTH="50" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="2" xsi:type="CharTerm" TERMINATOR='",' MAX_LENGTH="100" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="3" xsi:type="CharTerm" TERMINATOR=',' MAX_LENGTH="7"/>
<FIELD ID="4" xsi:type="CharTerm" TERMINATOR=',' MAX_LENGTH="30"/>
<FIELD ID="5" xsi:type="CharTerm" TERMINATOR=',' MAX_LENGTH="30"/>
<FIELD ID="6" xsi:type="CharTerm" TERMINATOR=',' MAX_LENGTH="24"/>
<FIELD ID="7" xsi:type="CharTerm" TERMINATOR='\r\n' MAX_LENGTH="24"/>
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="ProductNumber" xsi:type="SQLNVARCHAR"/>
<COLUMN SOURCE="2" NAME="ProductName" xsi:type="SQLNVARCHAR"/>
<COLUMN SOURCE="3" NAME="SafetyStockLevel" xsi:type="SQLSMALLINT"/>
<COLUMN SOURCE="4" NAME="StandardCost" xsi:type="SQLMONEY"/>
<COLUMN SOURCE="5" NAME="ListPrice" xsi:type="SQLMONEY"/>
<COLUMN SOURCE="6" NAME="SellStartDate" xsi:type="SQLDATETIME"/>
<COLUMN SOURCE="7" NAME="SellEndDate" xsi:type="SQLDATETIME"/>
</ROW>
</BCPFORMAT>
Note the new terminators for FIELDs 1 and 2. This is how we tell OPENROWSET the 2nd data field is enclosed in double-quotes. Also note the terminator for the remaining fields is ','. For consistency, I changed this from double-quotes to single-quotes.
We're getting close, but there's one more problem to fix. If we try the OPENROWSET() query again, we still get an error:
Msg 4864, Level 16, State 1, Line 11
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 1, column 1 (ProductNumber).
We haven't enclosed the ProductName in double-quotes in *all* of the rows in the data file. I have not found a way around this limitation. If the data file has some data fields enclosed in double-quotes (presumably, the fields that contain a delimiter character), but not all, OPENROWSET() won't work for you. It's got to be all or nothing in this regard 2I would be thrilled if I was wrong about this. If you have a solution, please share!
Let's make one final edit to the data file, enclosing all ProductNames in double-quotes:
AR-5381,"Adjustable Race",1000,0,0,6/1/2002,
BA-8327,"Bearing Ball",1000,0,0,6/1/2002,
BE-2349,"BB Ball Bearing",800,0,0,6/1/2002,
BE-2908,"Headset Ball Bearings",800,0,0,6/1/2002,
FK-1639,"LL Fork",500,65.81,148.22,7/1/2006,6/30/2007
FK-5136,"ML Fork",500,77.92,175.49,7/1/2006,6/30/2007
HL-U509,"Sport-100 Helmet, Black",4,13.0863,34.99,7/1/2005,
And now, our OPENROWSET() query should work:
SELECT *
FROM OPENROWSET (
BULK 'C:\SQL Server\ETL\Quoted Products.csv',
FORMATFILE = 'C:\SQL Server\ETL\Product.xml',
) AS p
Here is the result:
Comments