Fastest way to bulk insert new rows to MSSQL?
I have a slightly strange case where my code's only interaction with an MSSQL database is to insert a lot of entries that may already exist in the table. If the entry already exists, skip over it, otherwise insert. The common use-case is that most of the data being inserted will already exist in the table. A .csv or .xlsx is uploaded to a web portal, validated, and then pushed over to the database. Currently running it as a .NET Core 2.2 console app.
I've tried a couple different ways of doing this, but it's a bit slower than I'd like on the test set I'm using (78,043 rows by 27 columns). The actual sets will vary from about that size to maybe 5x as large.
- I first tried using EFCore's
AddRangehowever it was painstakingly slow. So slow that I never let it complete. - Then I installed the EFCore.BulkExtensions package and used
BulkInsertOrUpdate. This caused the test set to be inserted in between 35-60 seconds. When I ran it again, with all the data already existing in the table, it took the same amount of time. I figured that this couldn't be the quickest way, because it writes the entire entry no matter what. - So I created a unique index on the primary key (single column) and set it to ignore duplicate key. Then instead of
BulkInsertOrUpdateI usedBulkInsert. There was no noticeable improvement. It also took about the same amount of time when all the data already existed in the database. - Then I ditched EFCore and tried
System.Data.SqlClient.SqlBulkCopy.WriteToServerwhile still having the 'ignore duplicate key' unique index. I had anIEnumerableinitially for the data set and didn't want to change code to test it out, so I used FastMember's ObjectReader.Create to feed it intoWriteToServer(see bottom of FastMember readme for the code example I used). Once again, no noticeable improvement. It might have even been slower.
So I kind of expect the similar result in 2, 3, and 4 because I think they may all use the same underlying calls. However it seems significantly slower than it should be. I'm also not sure where the bottleneck is:
- the CPU wasn't limiting anything. Negligible usage except for .csv and .xlsx preparation.
- plenty of memory available
- A .csv file is parsed and fully modeled in ~0.6 seconds.
- A .xlsx file is parsed and fully modeled in ~7 seconds.
- network usage stays around 5Mbps for the entirety of the operation on 250Mbps+ wired connections to the database.
I'm pretty rusty with database work in general as I haven't touched it much for a couple years. This is a pretty short, one-off project that will get replaced in a couple months so performance isn't that important, but I'd still like to figure out the 'best' way to do this within constraints.
Another point of reference for the speed is that a SELECT * FROM TableName, when the table is filled with the 78,043 rows of 27 columns, takes 25-30 seconds. I'm going to check later on if there's some rate limiting set on the MS SQL server. The SELECT * taking 25-30 seconds kinda gives me the idea that there might be, though I still doubt it. Another spitball guess is that some hardware on the database server is failing.
Does anyone have any guesses on how to improve the insertion time? Or is it wishful thinking and it's already running about as fast as it will run?
0 comments:
Post a Comment