When building a Kappa Architecture replaying historic events is an important property of the system. EventHub and IoTHub support EventHub Capture a way to automatically archive all incoming messages on Azure Blob or Azure Data Lake Store, this takes care of the archiving part.

To replay those messages back onto an EventHub (preferably a different one!) I created nathan-gs/eventhubs-reingest, a Spark based application that reads the Avro messages, sorts, repartitions (by random chance) and writes them as fast as possible to EventHub.

Other sources

Not only EventHub Capture Avro messages are supported, any Spark SQL or Hive datasource. Just specify a query as part of the configuration. Take a look at the README for more information.

Performance

On a:

  • 3 worker node HDI cluster
  • a target EventHub
    • 12 partitions
    • 12 throughput units
  • 5.6gb of capture files, with some small and some large events:
    • 1,592 blobs
    • 5,616,207,929 bytes

We manage to process the data in 15 minutes.

Throughput in mb

Incoming Megabytes Incoming Megabytes per Minute

Throughput in msg/s

Incoming Messages Incoming Messages per Minute

Distribution of Spark tasks

Spark Task Distribution

Do notice that the task time is highly correlated with the input size. Spark Task Distribution, related to size