When building a Kappa Architecture replaying historic events is an important property of the system. EventHub and IoTHub support EventHub Capture a way to automatically archive all incoming messages on Azure Blob or Azure Data Lake Store, this takes care of the archiving part.
To replay those messages back onto an EventHub (preferably a different one!) I created nathan-gs/eventhubs-reingest, a Spark based application that reads the Avro messages, sorts, repartitions (by random chance) and writes them as fast as possible to EventHub.
- 3 worker node HDI cluster
- a target EventHub
- 12 partitions
- 12 throughput units
- 5.6gb of capture files, with some small and some large events:
- 1,592 blobs
- 5,616,207,929 bytes
We manage to process the data in