I agree with RaGe's deduplicate on the consumer side. And we use Redis to deduplicate Kafka message.
Assume the Message class has a member called 'uniqId', which is filled by the producer side and is guaranteed to be unique. We use a 12 length random string. (regexp is '^[A-Za-z0-9]{12}$'
)
The consumer side use Redis's SETNX to deduplicate and EXPIRE to purge expired keys automatically. Sample code:
Message msg = ... // eg. ConsumerIterator.next().message().fromJson();Jedis jedis = ... // eg. JedisPool.getResource();String key = "SPOUT:"+ msg.uniqId; // prefix name at willString val = Long.toString(System.currentTimeMillis());long rsps = jedis.setnx(key, val);if (rsps <= 0) { log.warn("kafka dup: {}", msg.toJson()); // and other logic} else { jedis.expire(key, 7200); // 2 hours is ok for production environment;}
The above code did detect duplicate messages several times when Kafka(version 0.8.x) had situations. With our input/output balance audit log, no message lost or dup happened.