Getting Started With OPENROWSET and the BULK Rowset Provider - Part 1
- Posted in:
- Development
- SQL Server
The built-in SQL Server function OPENROWSET() provides a way to access remote data from an OLE DB data source. It can be used with the BULK rowset provider to read data from a file without loading the data into a target table. This post will show the basics to get started with OPENROWSET(), the BULK rowset provider, and text files of fixed-width data fields.
Before we get too far, a few words about Ad Hoc Distributed Queries. I am neither advocating nor decrying the enabling and usage of this Server Configuration Option. This post is merely an exercise to show some of OPENROWSET's functionality that may fit a particular use case. Is it ok to enable it? Decide for yourself. To use OPENROWSET, you'll need to enable Ad Hoc Distributed Queries.
I was working with SQL Server 2012 when this post was written. For other versions, your mileage may vary. All of the files used for this post resided directly on the SQL Server host file system.
Here's the basic syntax for OPENROWSET() with the BULK Rowset Provider:
As you can see, there are two parameters:
SELECT *
FROM OPENROWSET (
BULK '[Data_Source_File]',
FORMATFILE = '[Format_File]'
) AS d
Here are a few lines of data from a sample data file. These are "Persons" from the AdventureWorks database. You can probably guess what some of the data fields are and where they begin and end. Others may not be so apparent.
EM Ken Sánchez 0 02/08/2003
EM Terri Duffy 1 02/24/2002
EM Gary Altman III 0 12/27/2003
EM Rob Walters 0 12/29/2001
EMMs. Gail Erickson 0 01/30/2002
EMMr. Jossef Goldberg 0 02/17/2002
Now let's take a look at a format file for use with the above data:
<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharFixed" LENGTH="2" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="2" xsi:type="CharFixed" LENGTH="8" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="3" xsi:type="CharFixed" LENGTH="25" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="4" xsi:type="CharFixed" LENGTH="25" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="5" xsi:type="CharFixed" LENGTH="10" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="6" xsi:type="CharFixed" LENGTH="5" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="7" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="10" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="PersonType" xsi:type="SQLCHAR"/>
<COLUMN SOURCE="2" NAME="Title" xsi:type="SQLCHAR"/>
<COLUMN SOURCE="3" NAME="FirstName" xsi:type="SQLCHAR"/>
<COLUMN SOURCE="4" NAME="LastName" xsi:type="SQLCHAR"/>
<COLUMN SOURCE="5" NAME="Suffix" xsi:type="SQLCHAR"/>
<COLUMN SOURCE="6" NAME="EmailPromotion" xsi:type="SQLCHAR"/>
<COLUMN SOURCE="7" NAME="HireDate" xsi:type="SQLCHAR"/>
</ROW>
</BCPFORMAT>
And now, let's run OPENROWSET() with the above data file and format file. Copy and paste the "Persons" data into one file and copy/paste the format file XML into another file. Save them to a local drive on the SQL Server host. For this example, the files will be in C:\SQL Server\ETL. Here's our tsql command, which I'll run from SSMS:
SELECT *
FROM OPENROWSET (
--The data file.
BULK 'C:\SQL Server\ETL\AdventureWorks People.txt',
--The format file.
FORMATFILE = 'C:\SQL Server\ETL\People.xml'
) AS d
Here is the result:
SQL Server returns the data as a set of rows and named columns. It looks a lot like querying a table or view, wouldn't you say? You can do more than just SELECT *. Here's a few examples:
--Only people with last name "Duffy"
SELECT *
FROM OPENROWSET (
BULK 'C:\SQL Server\ETL\AdventureWorks People.txt',
FORMATFILE = 'C:\SQL Server\ETL\People.xml'
) AS d
WHERE d.LastName = 'Duffy'
--People sorted by first name
SELECT *
FROM OPENROWSET (
BULK 'C:\SQL Server\ETL\AdventureWorks People.txt',
FORMATFILE = 'C:\SQL Server\ETL\People.xml'
) AS d
ORDER BY d.FirstName
--Concatenate fields to produce "Full Name"
SELECT RTRIM(d.FirstName) + ' ' + RTRIM(d.LastName) AS [Full Name]
FROM OPENROWSET (
BULK 'C:\SQL Server\ETL\AdventureWorks People.txt',
FORMATFILE = 'C:\SQL Server\ETL\People.xml'
) AS d
So far, so good. Let's look at another query:
--People sorted by hire date
SELECT *
FROM OPENROWSET (
BULK 'C:\SQL Server\ETL\AdventureWorks People.txt',
FORMATFILE = 'C:\SQL Server\ETL\People.xml'
) AS d
ORDER BY d.HireDate
Here is the result:
Does anything look wrong? Is hire date sorted as expected? Probably not for most of us. Hire date is not sorted as it would be if it was a DATE data type (or SMALLDATETIME, DATETIME, etc.) And that's because it isn't. It's a CHAR() data type. For the reason why, let's take a closer look at the format file. There is a single <RECORD> element with multiple <FIELD> elements that correspond to the fields in the data file. There is also a single <ROW> element with multiple <COLUMN> elements that correspond to table columns.
<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharFixed" LENGTH="2" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="2" xsi:type="CharFixed" LENGTH="8" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="3" xsi:type="CharFixed" LENGTH="25" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="4" xsi:type="CharFixed" LENGTH="25" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="5" xsi:type="CharFixed" LENGTH="10" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="6" xsi:type="CharFixed" LENGTH="5" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="7" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="10" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="PersonType" xsi:type="SQLCHAR"/>
<COLUMN SOURCE="2" NAME="Title" xsi:type="SQLCHAR"/>
<COLUMN SOURCE="3" NAME="FirstName" xsi:type="SQLCHAR"/>
<COLUMN SOURCE="4" NAME="LastName" xsi:type="SQLCHAR"/>
<COLUMN SOURCE="5" NAME="Suffix" xsi:type="SQLCHAR"/>
<COLUMN SOURCE="6" NAME="EmailPromotion" xsi:type="SQLCHAR"/>
<COLUMN SOURCE="7" NAME="HireDate" xsi:type="SQLCHAR"/>
</ROW>
</BCPFORMAT>
The "type" for the HireDate column is "SQLCHAR"--that's why OPENROWSET() returns HireDate as a CHAR() data type in the rowset. To have the HireDate properly typed, edit the format file and change the "type" of the HireDate to "SQLDATE" as follows:
<COLUMN SOURCE="7" NAME="HireDate" xsi:type="SQLDATE"/>
Our previous query should sort correctly now:
--People sorted by hire date
SELECT *
FROM OPENROWSET (
BULK 'C:\SQL Server\ETL\AdventureWorks People.txt',
FORMATFILE = 'C:\SQL Server\ETL\People.xml'
) AS d
ORDER BY d.HireDate
Here is the result:
Data Types
Let's add another line of data to our data source file, so that it now looks like this:
If we try any OPENROWSET() queries with our current format file, we'll get an error:
EM Ken Sánchez 0 02/08/2003
EM Terri Duffy 1 02/24/2002
EM Gary Altman III 0 12/27/2003
EM Rob Walters 0 12/29/2001
EMMs. Gail Erickson 0 01/30/2002
EMMr. Jossef Goldberg 0 02/17/2002
IN Kristy Ortega 1 13/31/2008
The error is telling us that there is invalid data for the HireDate field in row 7 (13/31/2008). Our format file still specifies "SQLDATE" for the type, and SQL cannot properly convert the data. To address this error, there are a few options:
Msg 4864, Level 16, State 1, Line 1
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 7, column 7 (HireDate).
Header Rows
Although not common, you may encounter a header row with a data file of fixed-width fields. There's a good chance it will cause an error. Even if it doesn't, it's "data" you don't want returned in the result set. The FIRSTROW parameter is your friend. It specifies the number of the first row. The default value is 1 (row numbers are 1-based). Here's an example:
SELECT *
FROM OPENROWSET (
BULK 'C:\SQL Server\ETL\AdventureWorks People.txt',
FORMATFILE = 'C:\SQL Server\ETL\People.xml',
FIRSTROW = 2
) AS d
Creating Format Files
When you're ready to try using OPENROWSET() with some real data, you may be asking yourself "Where does the format file come from?" Most of the time, the responsibility of creating the format file falls on you. Chances are, a data file given to you will have something that explains the layout of the data. It might be in an email, a Word document, an Excel spreadsheet, etc. It will be pretty rare when someone gives you an actual format file along with the data source file. So you'll have to create the format file yourself. The MSDN documentation for XML Format Files is a great resource if you want to create it by hand in a text editor.
Alternatively, you may find it easier to use the bcp utility to create format files. For example, let's say you are given a data source file of fixed-width fields, along with this layout information:
Field Number | Field Name | Field Width |
1 | Employee ID | 15 |
2 | First Name | 25 |
3 | Last Name | 25 |
4 | Department | 12 |
5 | Supervisor ID | 15 |
You could create a temp table with columns and lengths that match the fields in the data source file, per the layout info:
CREATE TABLE AdventureWorks.guest.Employee(
EmployeeID CHAR(15),
FirstName CHAR(25),
LastName CHAR(25),
Department CHAR(12),
SupervisorID CHAR(15)
)
Notice the use of CHAR data types for all of the columns. The width of each column matches the field widths from the file layout information that was provided. Next, run bcp with these parameters:
bcp AdventureWorks.guest.Employee format nul -c -x -f "C:\SQL Server\ETL\Employee.xml" -t -S SqlHostName -T
The name of the table is specified, -c for character data, -x to specify the XML format file type, -f with the full path of the XML format file to create, -S is the server name of the SQL host, and -T tells bcp to connect to SQL with a trusted connection.
-t specifies the field terminator. Since we will be working with fixed-width fields, the parameter is followed by a space, indicating there is no field terminator.
Here is the contents of the XML format file created by bcp:
<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharFixed" LENGTH="15" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="2" xsi:type="CharFixed" LENGTH="25" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="3" xsi:type="CharFixed" LENGTH="25" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="4" xsi:type="CharFixed" LENGTH="12" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="5" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="15" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="EmployeeID" xsi:type="SQLCHAR"/>
<COLUMN SOURCE="2" NAME="FirstName" xsi:type="SQLCHAR"/>
<COLUMN SOURCE="3" NAME="LastName" xsi:type="SQLCHAR"/>
<COLUMN SOURCE="4" NAME="Department" xsi:type="SQLCHAR"/>
<COLUMN SOURCE="5" NAME="SupervisorID" xsi:type="SQLCHAR"/>
</ROW>
</BCPFORMAT>
Afterwards, table AdventureWorks.guest.Employee can be dropped, if desired. I find this method easier than creating XML format files by hand. The initial format might not be perfect, but it can easily be edited after creation. For instance, we may determine that the EmployeeID and SupervisorID fields are numeric and change their types to "SQLINT".
In the next post, I'll show how to use OPENROWSET(), the BULK Rowset Provider, and text files with delimited data fields...
Comments