I’ve used LinkedIn’s Kafka->HDFS pipeline Camus. Unfortunately the generated HDFS files are too small (something about 20k to 4m) in my case. That small files are a killer for MapReduce jobs running afterwards. Their processing time was up to 5 hours per job.
The Camus repository contains a project called Camus Sweeper. Camus Sweeper is a M/R Job collecting the hourly stored files and aggregating them to daily files. So instead of 24 per day, you’ll end up with 1 bigger one. Much better for M/R jobs.
But I had some issues to get the Sweeper run. So I fixed the errors I found and added the possibility to configure some Kafka-Topic to Avro-Schema definitions (the Schema Classes therefor have to be in the class path).
So here it is…
[GARD]