CSV Row Counts

2019-01-17

Posted in:
Development
R
T-SQL

This is one of those posts where if my opening premise is wrong, the whole thing is a moot point. I've been wrong before, so let's roll! There doesn't seem to be an efficient way to get a count of the number of records in a CSV file. Every record can be a different number of characters/bytes. So in my not-so-expert conclusion, you have to open the CSV file and read it into memory in its entirety to get the record count. This can be quite inefficient for large files.

I was reminded of this recently as I was working with R, trying to read a nearly 2 GB data file. I wanted to read in 5% of the data and output it to a smaller file that would make the test code run faster. The particular function I was working with needed a row count as one of its parameters. For me, that meant I had to determine the number of rows in the source file and multiply by 0.05. I tied the code for all of those tasks into one script block.

Now, none to my surprise, it was slow. In my short experience, I've found R isn't particularly snappy--even when the data can fit comfortably in memory. I was pretty sure I could beat R's record count performance handily with C#. And I did. I found some related questions on StackOverflow. A small handful of answers discussed the efficiency of various approaches. I only tried two C# variations: my original attempt, and a second version that was supposed to be faster (the improvement was nominal).

Sometime later I remembered OPENROWSET for SQL Server. And I wondered...how fast would SQL Server count records in a CSV file? Pretty fast, it turns out. I didn't spend a ton of time on the R, C#, or T-SQL code. It's below if you want to take a look. In short, R was the slowest, C# was pretty fast, and T-SQL with OPENROWSET was the fastest.

R	C#	T-SQL
122.60 seconds	1st Version: 12.7982 seconds 2nd Version: 12.2451 seconds	6.996 seconds

R


> begin <- proc.time()
> setwd("C:/Data/")
> filepath <- file("yellow_tripdata_2016-01.csv",open="r")
> maxLinesToRead <- 20000
> counter <- 0
> ( while((linesread <- length(readLines(con = filepath, n = maxLinesToRead))) > 0 ) 
+   counter <- counter+linesread )
NULL
Warning message:
closing unused connection 3 (yellow_tripdata_2016-01.csv) 
> close(filepath)
> counter
[1] 10906859
> proc.time() - begin
   user  system elapsed 
 116.90    3.08  122.60 
> paste("Record Count", counter, sep = ": ")
[1] "Record Count: 10906859"

C#


//Version 1
int counter = 0;
DateTime start = DateTime.Now;

using (System.IO.StreamReader file = 
    new System.IO.StreamReader(@"C:\Data\yellow_tripdata_2016-01.csv"))
{
    while (file.ReadLine() != null)
    {
        counter++;
    }

    file.Close();
}

DateTime end = DateTime.Now;
TimeSpan ellapsed = end - start;
Console.WriteLine("Record count: " + counter.ToString());
Console.WriteLine(ellapsed.ToString("c"));


//Version 2
int counter = 0;
DateTime start = DateTime.Now;
counter = System.IO.File.ReadLines(@"C:\Data\yellow_tripdata_2016-01.csv").Count();
DateTime end = DateTime.Now;
TimeSpan ellapsed = end - start;
Console.WriteLine("Record count: " + counter.ToString());
Console.WriteLine(ellapsed.ToString("c"));

T-SQL


DECLARE @Start DATETIME2(7) = CURRENT_TIMESTAMP;

SELECT COUNT(*)
FROM OPENROWSET (
    BULK 'C:\Data\yellow_tripdata_2016-01.csv',
    FORMATFILE = 'C:\Data\VarCharMax.xml'
) AS d

SELECT DATEDIFF_BIG(MILLISECOND, @Start, CURRENT_TIMESTAMP) / 1000.0

OPENROWSET Format File


<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <RECORD>
  <FIELD ID="1" xsi:type="CharTerm" TERMINATOR="\r\n" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
 </RECORD>
 <ROW>
  <COLUMN SOURCE="1" NAME="VarCharMax" xsi:type="SQLVARYCHAR"/>
 </ROW>
</BCPFORMAT>

R

C#

T-SQL

OPENROWSET Format File

Comments