Getting Started With OPENROWSET and the BULK Rowset Provider - Part 1

2016-01-18

Posted in:
Development
SQL Server

The built-in SQL Server function OPENROWSET() provides a way to access remote data from an OLE DB data source. It can be used with the BULK rowset provider to read data from a file without loading the data into a target table. This post will show the basics to get started with OPENROWSET(), the BULK rowset provider, and text files of fixed-width data fields.

Before we get too far, a few words about Ad Hoc Distributed Queries. I am neither advocating nor decrying the enabling and usage of this Server Configuration Option. This post is merely an exercise to show some of OPENROWSET's functionality that may fit a particular use case. Is it ok to enable it? Decide for yourself. To use OPENROWSET, you'll need to enable Ad Hoc Distributed Queries.

NOTES:
I was working with SQL Server 2012 when this post was written. For other versions, your mileage may vary. All of the files used for this post resided directly on the SQL Server host file system.

Here's the basic syntax for OPENROWSET() with the BULK Rowset Provider:


SELECT *
FROM OPENROWSET (
    BULK '[Data_Source_File]',
    FORMATFILE = '[Format_File]'
) AS d

As you can see, there are two parameters:

Data_Source_File: this is the full path of the file that contains the data.

Format_File: this is the full path of the format file. It describes the data fields in the data source file and specifies column names and data types for the result set.

Here are a few lines of data from a sample data file. These are "Persons" from the AdventureWorks database. You can probably guess what some of the data fields are and where they begin and end. Others may not be so apparent.


EM        Ken                      Sánchez                            0    02/08/2003
EM        Terri                    Duffy                              1    02/24/2002
EM        Gary                     Altman                   III       0    12/27/2003
EM        Rob                      Walters                            0    12/29/2001
EMMs.     Gail                     Erickson                           0    01/30/2002
EMMr.     Jossef                   Goldberg                           0    02/17/2002

Now let's take a look at a format file for use with the above data:


<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <RECORD>
  <FIELD ID="1" xsi:type="CharFixed" LENGTH="2" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="2" xsi:type="CharFixed" LENGTH="8" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="3" xsi:type="CharFixed" LENGTH="25" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="4" xsi:type="CharFixed" LENGTH="25" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="5" xsi:type="CharFixed" LENGTH="10" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="6" xsi:type="CharFixed" LENGTH="5" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="7" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="10" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
 </RECORD>
 <ROW>
  <COLUMN SOURCE="1" NAME="PersonType" xsi:type="SQLCHAR"/>
  <COLUMN SOURCE="2" NAME="Title" xsi:type="SQLCHAR"/>
  <COLUMN SOURCE="3" NAME="FirstName" xsi:type="SQLCHAR"/>
  <COLUMN SOURCE="4" NAME="LastName" xsi:type="SQLCHAR"/>
  <COLUMN SOURCE="5" NAME="Suffix" xsi:type="SQLCHAR"/>
  <COLUMN SOURCE="6" NAME="EmailPromotion" xsi:type="SQLCHAR"/>
  <COLUMN SOURCE="7" NAME="HireDate" xsi:type="SQLCHAR"/>
 </ROW>
</BCPFORMAT>

And now, let's run OPENROWSET() with the above data file and format file. Copy and paste the "Persons" data into one file and copy/paste the format file XML into another file. Save them to a local drive on the SQL Server host. For this example, the files will be in C:\SQL Server\ETL. Here's our tsql command, which I'll run from SSMS:


SELECT *
FROM OPENROWSET (
    --The data file.
    BULK 'C:\SQL Server\ETL\AdventureWorks People.txt', 

    --The format file.
    FORMATFILE = 'C:\SQL Server\ETL\People.xml'
) AS d

Here is the result:

SQL Server returns the data as a set of rows and named columns. It looks a lot like querying a table or view, wouldn't you say? You can do more than just SELECT *. Here's a few examples:


--Only people with last name "Duffy"
SELECT * 
FROM OPENROWSET (
    BULK 'C:\SQL Server\ETL\AdventureWorks People.txt', 
    FORMATFILE = 'C:\SQL Server\ETL\People.xml'
) AS d
WHERE d.LastName = 'Duffy'

--People sorted by first name
SELECT * 
FROM OPENROWSET (
    BULK 'C:\SQL Server\ETL\AdventureWorks People.txt', 
    FORMATFILE = 'C:\SQL Server\ETL\People.xml'
) AS d
ORDER BY d.FirstName

--Concatenate fields to produce "Full Name"
SELECT RTRIM(d.FirstName) + ' ' + RTRIM(d.LastName) AS [Full Name] 
FROM OPENROWSET (
    BULK 'C:\SQL Server\ETL\AdventureWorks People.txt', 
    FORMATFILE = 'C:\SQL Server\ETL\People.xml'
) AS d

So far, so good. Let's look at another query:


--People sorted by hire date
SELECT * 
FROM OPENROWSET (
    BULK 'C:\SQL Server\ETL\AdventureWorks People.txt', 
    FORMATFILE = 'C:\SQL Server\ETL\People.xml'
) AS d
ORDER BY d.HireDate

Here is the result:

Does anything look wrong? Is hire date sorted as expected? Probably not for most of us. Hire date is not sorted as it would be if it was a DATE data type (or SMALLDATETIME, DATETIME, etc.) And that's because it isn't. It's a CHAR() data type. For the reason why, let's take a closer look at the format file. There is a single <RECORD> element with multiple <FIELD> elements that correspond to the fields in the data file. There is also a single <ROW> element with multiple <COLUMN> elements that correspond to table columns.


<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <RECORD>
  <FIELD ID="1" xsi:type="CharFixed" LENGTH="2" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="2" xsi:type="CharFixed" LENGTH="8" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="3" xsi:type="CharFixed" LENGTH="25" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="4" xsi:type="CharFixed" LENGTH="25" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="5" xsi:type="CharFixed" LENGTH="10" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="6" xsi:type="CharFixed" LENGTH="5" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="7" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="10" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
 </RECORD>
 <ROW>
  <COLUMN SOURCE="1" NAME="PersonType" xsi:type="SQLCHAR"/>
  <COLUMN SOURCE="2" NAME="Title" xsi:type="SQLCHAR"/>
  <COLUMN SOURCE="3" NAME="FirstName" xsi:type="SQLCHAR"/>
  <COLUMN SOURCE="4" NAME="LastName" xsi:type="SQLCHAR"/>
  <COLUMN SOURCE="5" NAME="Suffix" xsi:type="SQLCHAR"/>
  <COLUMN SOURCE="6" NAME="EmailPromotion" xsi:type="SQLCHAR"/>
  <COLUMN SOURCE="7" NAME="HireDate" xsi:type="SQLCHAR"/>
 </ROW>
</BCPFORMAT>

The "type" for the HireDate column is "SQLCHAR"--that's why OPENROWSET() returns HireDate as a CHAR() data type in the rowset. To have the HireDate properly typed, edit the format file and change the "type" of the HireDate to "SQLDATE" as follows:


<COLUMN SOURCE="7" NAME="HireDate" xsi:type="SQLDATE"/>

Our previous query should sort correctly now:


--People sorted by hire date
SELECT * 
FROM OPENROWSET (
    BULK 'C:\SQL Server\ETL\AdventureWorks People.txt', 
    FORMATFILE = 'C:\SQL Server\ETL\People.xml'
) AS d
ORDER BY d.HireDate

Here is the result:

Data Types

Let's add another line of data to our data source file, so that it now looks like this:


EM        Ken                      Sánchez                            0    02/08/2003
EM        Terri                    Duffy                              1    02/24/2002
EM        Gary                     Altman                   III       0    12/27/2003
EM        Rob                      Walters                            0    12/29/2001
EMMs.     Gail                     Erickson                           0    01/30/2002
EMMr.     Jossef                   Goldberg                           0    02/17/2002
IN        Kristy                   Ortega                             1    13/31/2008

If we try any OPENROWSET() queries with our current format file, we'll get an error:


Msg 4864, Level 16, State 1, Line 1
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 7, column 7 (HireDate).

The error is telling us that there is invalid data for the HireDate field in row 7 (13/31/2008). Our format file still specifies "SQLDATE" for the type, and SQL cannot properly convert the data. To address this error, there are a few options:

Edit the format file, and change the HireDate type back to "SQLCHAR". This makes the error go away, but it doesn't really "fix" the problem.

Remove the row of data from the data source file. This also makes the error go away, but it doesn't really "fix" the problem.

Fix the invalid data (change "13/31/2008" to a valid date). For this, you'd have to seek out the creator of the data source file and have them supply you with the correct date.

Header Rows

Although not common, you may encounter a header row with a data file of fixed-width fields. There's a good chance it will cause an error. Even if it doesn't, it's "data" you don't want returned in the result set. The FIRSTROW parameter is your friend. It specifies the number of the first row. The default value is 1 (row numbers are 1-based). Here's an example:


SELECT *
FROM OPENROWSET (
    BULK 'C:\SQL Server\ETL\AdventureWorks People.txt', 
    FORMATFILE = 'C:\SQL Server\ETL\People.xml',
    FIRSTROW = 2
) AS d

Creating Format Files

When you're ready to try using OPENROWSET() with some real data, you may be asking yourself "Where does the format file come from?" Most of the time, the responsibility of creating the format file falls on you. Chances are, a data file given to you will have something that explains the layout of the data. It might be in an email, a Word document, an Excel spreadsheet, etc. It will be pretty rare when someone gives you an actual format file along with the data source file. So you'll have to create the format file yourself. The MSDN documentation for XML Format Files is a great resource if you want to create it by hand in a text editor.

Alternatively, you may find it easier to use the bcp utility to create format files. For example, let's say you are given a data source file of fixed-width fields, along with this layout information:

Field Number	Field Name	Field Width
1	Employee ID	15
2	First Name	25
3	Last Name	25
4	Department	12
5	Supervisor ID	15

You could create a temp table with columns and lengths that match the fields in the data source file, per the layout info:


CREATE TABLE AdventureWorks.guest.Employee(
    EmployeeID CHAR(15),
    FirstName CHAR(25),
    LastName CHAR(25),
    Department CHAR(12),
    SupervisorID CHAR(15)
)

Notice the use of CHAR data types for all of the columns. The width of each column matches the field widths from the file layout information that was provided. Next, run bcp with these parameters:


bcp AdventureWorks.guest.Employee format nul -c -x -f "C:\SQL Server\ETL\Employee.xml" -t -S SqlHostName -T

The name of the table is specified, -c for character data, -x to specify the XML format file type, -f with the full path of the XML format file to create, -S is the server name of the SQL host, and -T tells bcp to connect to SQL with a trusted connection.

-t specifies the field terminator. Since we will be working with fixed-width fields, the parameter is followed by a space, indicating there is no field terminator.

Here is the contents of the XML format file created by bcp:


<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <RECORD>
  <FIELD ID="1" xsi:type="CharFixed" LENGTH="15" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="2" xsi:type="CharFixed" LENGTH="25" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="3" xsi:type="CharFixed" LENGTH="25" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="4" xsi:type="CharFixed" LENGTH="12" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="5" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="15" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
 </RECORD>
 <ROW>
  <COLUMN SOURCE="1" NAME="EmployeeID" xsi:type="SQLCHAR"/>
  <COLUMN SOURCE="2" NAME="FirstName" xsi:type="SQLCHAR"/>
  <COLUMN SOURCE="3" NAME="LastName" xsi:type="SQLCHAR"/>
  <COLUMN SOURCE="4" NAME="Department" xsi:type="SQLCHAR"/>
  <COLUMN SOURCE="5" NAME="SupervisorID" xsi:type="SQLCHAR"/>
 </ROW>
</BCPFORMAT>

Afterwards, table AdventureWorks.guest.Employee can be dropped, if desired. I find this method easier than creating XML format files by hand. The initial format might not be perfect, but it can easily be edited after creation. For instance, we may determine that the EmployeeID and SupervisorID fields are numeric and change their types to "SQLINT".

In the next post, I'll show how to use OPENROWSET(), the BULK Rowset Provider, and text files with delimited data fields...

Data Types

Header Rows

Creating Format Files

Comments