Dataflow vs Channels: Evolution of AsyncFileWriter
Last week I published this reddit: https://redd.it/8a7uri
Where I experimented with using Dataflow classes as means to funnel bytes from multiple threads into a single file.
Some important notes and realizations:
1) System.IO.FileStream can append data from multiple threads if the FileShare mode is set to Write. But multiple FileStream objects being created and disposed of to write individual blocks of data is quite inefficient and can be very slow.
A single FileStream object is not thread-safe.
2) Asynchronous writes or even setting the FileStream to be asynchronous does not mean more speed. It just means potentially less total blocking at the cost of more latency. If you are trying to shove as many bytes into a file as fast as possible, repeated calls to .WriteAsync is likely to lengthen the total time it takes to write all bytes to the file.
3) Pipelines have an unintuitive potential to fill at each segment before moving on to the next segment. Therefore a single segment pipeline can be completely full before the consumer can begin draining the queue. This is extremely problematic when the producing threads have equal or greater priority to the subsequent consumer. Great care has to be taken in order to ensure the consumer (draining the queue) is either not interrupted or somehow takes priority over producing Tasks/Threads. This effect is less obvious for unbound queues as an entire queue can fill up before starting to drain. Bound queues can then easily become problematic if data is postponed or forced to block/wait.
Introducing Channels
Detailed Explanation: https://github.com/stephentoub/corefxlab/blob/master/src/System.Threading.Tasks.Channels/README.md
Source Code: https://github.com/dotnet/corefx/tree/master/src/System.Threading.Channels (currently pre-release on Nuget)
Principal Author: Stephen Toub (Thank you!)
Are Channels Better?
The simple answer is YES. They're much faster than Dataflow blocks, and as explained in the read-me above, may end up replacing the internal mechanisms of some Dataflow blocks. When it comes to producer/consumer queues, Channels not only are more performant, but the API is more elegant and simple.
But if you need interlocking block like functionality then Dataflow may be easier to implement and understand.
How do they differ from Dataflow?
Channels provide a .Writer and a .Reader property which can be (obviously) written to and read from. They are specific to producer/consumer scenarios and don't offer some of the control flow that Dataflow blocks offer.
.WaitToWriteAsync and .WaitToReadAsync offer a simplistic means of waiting writing and reading for availability.
Final Implementation
https://github.com/electricessence/AsyncFileWriter/blob/master/AsyncFileWriter/AsyncFileWriter.cs
Below is the internal task that not only does the work of writing the bytes, but offers up the Completion task to the user.
async Task ProcessBytes(CancellationToken token) { while (await _channel.Reader.WaitToReadAsync(token)) { using (var fs = new FileStream(FilePath, FileMode.Append, FileAccess.Write, FileShareMode)) { while (_channel.Reader.TryRead(out byte[] bytes)) { token.ThrowIfCancellationRequested(); fs.Write(bytes, 0, bytes.Length); } } } }
Testing & Performance
Notice below that the higher the bounded capacity the faster the end result. Obviously mileage may vary depending on file system performance. And the number of synchronous queuing vs asynchronous will have an effect on its comparison to the benchmarks.
Synchronized file stream benchmark
Simply uses a single FileStream and acquires a lock before writing.
Total Time: 2.8247813 seconds Total Bytes: 114,888,890 Total Blocking Time: 00:00:11.1608481 ------------------------
100,000 bounded capacity
For testing, anything over 100,000 entries seemed to have little or no effect, keeping in mind that my test environment has extremely fast file access (practically a RAM Drive).
Total Time: 1.956621 seconds Total Bytes: 114,888,890 Total Blocking Time: 00:00:03.6122002 ------------------------
10,000 bounded capacity.
Total Time: 2.2669638 seconds Total Bytes: 114,888,890 Total Blocking Time: 00:00:07.0796629 ------------------------
1,000 bounded capacity.
Total Time: 6.875023 seconds Total Bytes: 114,888,890 Total Blocking Time: 00:00:58.8355981 ------------------------
100 bounded capacity.
Total Time: 30.9377557 seconds Total Bytes: 114,888,890 Total Blocking Time: 00:12:17.8732101 ------------------------
Multiple file stream benchmark.
As shown here (and as expected), FileStream instances per write will be much slower than the alternatives.
Total Time: 63.881702 seconds Total Bytes: 114,888,890 Total Blocking Time: 00:45:23.6739778 ------------------------
0 comments:
Post a Comment