Why events at all

An event-driven architecture replaces synchronous calls with asynchronous facts. Instead of "service A calls service B and waits", you have "service A publishes an event, and any interested service consumes it independently." The benefits are well-known: looser coupling, better resilience to downstream failures, the ability to add new consumers without changing producers, and a natural audit trail of business activity.

The costs are also real. You give up immediate consistency. You take on the complexity of partial failures, retries, duplicates, and ordering. Debugging gets harder because the trace is implicit, not a call stack. Observability has to be built deliberately or you will be blind.

Kafka is excellent at the substrate: high throughput, durable storage, replay, and ordered delivery within partitions. But Kafka does not make any of the application-level decisions for you. The quality of your event-driven system depends almost entirely on those decisions.

Topic design is the foundation

Every event-driven failure we have helped untangle traces back to topic design choices made early and never revisited. The most common mistakes:

  • One topic per service. A "user-service" topic that carries every kind of user event mixes domains and ties every consumer to every change.
  • One topic per database table. Treating Kafka as a CDC bus without semantics. Consumers end up depending on internal data shape, defeating the loose coupling promise.
  • One topic for everything. A single "events" topic with type fields. Looks simple, makes consumers process events they should not care about, makes scaling and retention impossible to tune.

The healthier default is one topic per business event type at the domain level: "order-placed", "shipment-dispatched", "invoice-issued". Each topic carries a well-defined business fact with a clear owner. Consumers subscribe only to what they need. Retention, partitioning, and access control can be tuned per topic.

Schemas and compatibility

Events outlive the code that produced them. The shape of an event today must remain consumable next year, when a new team has joined, when an old service has been rewritten, and when an event from six months ago needs to be replayed. Schemas are the contract that makes this possible.

Use a schema registry. Avro and Protobuf are both production-grade choices. JSON Schema is acceptable but easier to misuse. The schema registry enforces compatibility rules: backward compatibility (new consumers can read old events), forward compatibility (old consumers can read new events), or full compatibility. Pick a rule and enforce it in CI.

Two practical disciplines pay for themselves quickly. First, never remove or rename fields. Always add new fields with sensible defaults. Field deprecation is a multi-stage process, not a single commit. Second, distinguish facts that travel together: do not split a logically atomic event into multiple events that consumers have to recombine. The cost of that split is paid by every consumer, forever.

Idempotency is not optional

Kafka guarantees at-least-once delivery by default. In practice, this means every consumer must assume that an event may be delivered more than once, and the application logic must handle it. Idempotency is the property that processing the same event twice produces the same result as processing it once.

Common patterns to achieve idempotency:

  • Deterministic keys. Use the event ID (or a deterministic hash of its content) as the natural key in the downstream database. Inserts become upserts. Duplicates become no-ops.
  • Outbox pattern on the producer. Producers write the event to a local outbox table in the same database transaction as the business change. A separate process publishes from the outbox, guaranteeing the event is sent at least once and exactly the events that correspond to committed changes.
  • Processed-event log on the consumer. Maintain a record of event IDs already processed (for a reasonable window). Skip events whose IDs are already known.

Exactly-once semantics are possible with Kafka Streams and transactional producers, but they are narrower than they sound: they apply to the Kafka path itself, not to side effects in external systems. For most business flows, designing for at-least-once with idempotency is more reliable than trying to achieve exactly-once end to end.

Ordering, partitioning, and the keys you choose

Kafka guarantees ordering only within a partition. The partition is determined by the message key. The key you choose therefore determines what stays ordered relative to what. This is one of the most important design decisions in any Kafka-based system, and it is often made carelessly.

The rule of thumb is: choose a key such that all events that must be processed in order share that key. For order events, the order ID. For shipment events, the shipment ID. For user events, the user ID. Avoid random keys unless ordering truly does not matter, and avoid "broadcast" keys that send everything to one partition (you have just serialized your entire system).

Partition count is also a long-term decision. Increasing partitions later breaks ordering for keys that get reshuffled. It is easier to start with more partitions than you currently need (within reason) than to repartition under load.

Dead-letter queues and retry strategy

Some events cannot be processed: a bug in the consumer, a corrupted message, a downstream system rejecting them. Naively blocking the consumer on those messages will halt the entire partition forever. Naively skipping them silently loses data.

The standard pattern is a tiered retry strategy. The consumer attempts the message a few times in place with backoff. If still failing, the message is moved to a retry topic with a delay. If still failing after several retry tiers, it ends up in a dead-letter queue (DLQ) for manual inspection. The DLQ is monitored, and someone is responsible for triaging it.

Two anti-patterns to avoid: a DLQ that nobody watches becomes a graveyard, and aggressive retries against a downstream system that is already struggling will turn a small failure into a cascading outage.

Operational pitfalls that bite at scale

Most event-driven systems work fine in staging and fall apart in production. The reasons are almost always operational, not algorithmic.

Unbounded retention

Topics with infinite retention quietly grow until disk is full. Set retention policies deliberately, and use log compaction where appropriate.

Consumer lag without alerting

Lag is your earliest warning signal. Without alerts, a stuck consumer can lose hours of business activity before anyone notices.

Schema drift in the dark

Producers shipping new schemas without registry checks. Consumers blowing up in production. Compatibility rules must be enforced by tooling.

No tracing

Following a business flow across services without distributed tracing is impossible. Propagate correlation IDs from the start.

Forgotten reprocessing scenarios

You will eventually need to replay events. Design for it from day one: idempotent consumers, stable schemas, deliberate offset management.

Treating Kafka as a database

Kafka is great storage for events. It is a poor query engine. Consumers should project events into databases or caches that are designed for reads.

Final takeaway

Event-driven architecture is an investment, not a free upgrade. Done well, it unlocks loose coupling and resilience that no synchronous design can match. Done casually, it produces a distributed system whose bugs are nearly impossible to reproduce. The discipline lives in topic design, schema management, idempotency, and operations — not in the broker itself.

Designing or operating an event-driven system?

If you are starting an event-driven journey or trying to stabilize one in production, we can help you set the architecture, the patterns, and the operational practices that keep it healthy.

Talk to Soutello IT about event-driven architecture