CSV Row Counts
- Posted in:
- Development
- R
- T-SQL
This is one of those posts where if my opening premise is wrong, the whole thing is a moot point. I've been wrong before, so let's roll! There doesn't seem to be an efficient way to get a count of the number of records in a CSV file. Every record can be a different number of characters/bytes. So in my not-so-expert conclusion, you have to open the CSV file and read it into memory in its entirety to get the record count. This can be quite inefficient for large files.
I was reminded of this recently as I was working with R, trying to read a nearly 2 GB data file. I wanted to read in 5% of the data and output it to a smaller file that would make the test code run faster. The particular function I was working with needed a row count as one of its parameters. For me, that meant I had to determine the number of rows in the source file and multiply by 0.05. I tied the code for all of those tasks into one script block.
Now, none to my surprise, it was slow. In my short experience, I've found R isn't particularly snappy--even when the data can fit comfortably in memory. I was pretty sure I could beat R's record count performance handily with C#. And I did. I found some related questions on StackOverflow. A small handful of answers discussed the efficiency of various approaches. I only tried two C# variations: my original attempt, and a second version that was supposed to be faster (the improvement was nominal).
Sometime later I remembered OPENROWSET for SQL Server. And I wondered...how fast would SQL Server count records in a CSV file? Pretty fast, it turns out. I didn't spend a ton of time on the R, C#, or T-SQL code. It's below if you want to take a look. In short, R was the slowest, C# was pretty fast, and T-SQL with OPENROWSET was the fastest.
R | C# | T-SQL |
---|---|---|
122.60 seconds | 1st Version: 12.7982 seconds 2nd Version: 12.2451 seconds | 6.996 seconds |
R
> begin <- proc.time()
> setwd("C:/Data/")
> filepath <- file("yellow_tripdata_2016-01.csv",open="r")
> maxLinesToRead <- 20000
> counter <- 0
> ( while((linesread <- length(readLines(con = filepath, n = maxLinesToRead))) > 0 )
+ counter <- counter+linesread )
NULL
Warning message:
closing unused connection 3 (yellow_tripdata_2016-01.csv)
> close(filepath)
> counter
[1] 10906859
> proc.time() - begin
user system elapsed
116.90 3.08 122.60
> paste("Record Count", counter, sep = ": ")
[1] "Record Count: 10906859"
C#
//Version 1
int counter = 0;
DateTime start = DateTime.Now;
using (System.IO.StreamReader file =
new System.IO.StreamReader(@"C:\Data\yellow_tripdata_2016-01.csv"))
{
while (file.ReadLine() != null)
{
counter++;
}
file.Close();
}
DateTime end = DateTime.Now;
TimeSpan ellapsed = end - start;
Console.WriteLine("Record count: " + counter.ToString());
Console.WriteLine(ellapsed.ToString("c"));
//Version 2
int counter = 0;
DateTime start = DateTime.Now;
counter = System.IO.File.ReadLines(@"C:\Data\yellow_tripdata_2016-01.csv").Count();
DateTime end = DateTime.Now;
TimeSpan ellapsed = end - start;
Console.WriteLine("Record count: " + counter.ToString());
Console.WriteLine(ellapsed.ToString("c"));
T-SQL
DECLARE @Start DATETIME2(7) = CURRENT_TIMESTAMP;
SELECT COUNT(*)
FROM OPENROWSET (
BULK 'C:\Data\yellow_tripdata_2016-01.csv',
FORMATFILE = 'C:\Data\VarCharMax.xml'
) AS d
SELECT DATEDIFF_BIG(MILLISECOND, @Start, CURRENT_TIMESTAMP) / 1000.0
OPENROWSET Format File
<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR="\r\n" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="VarCharMax" xsi:type="SQLVARYCHAR"/>
</ROW>
</BCPFORMAT>
Comments