Saturday, July 3, 2021

Event-Driven (Notification) Architecture with AWS

Event-driven architecture has been becoming very popular. We recently built a billing system that is entirely based on event-driven architecture. Here are what we have learned.

When should we use Event-Driven architecture?

Event-driven architecture brings extra complexity (more error scenarios) and costs (more infrastructures) comparing to API calls. So DO NOT use it unless you want:
  • to reverse dependencies
  • to split loads, eg. 10k invoice generation takes a few mins but 10k payments take hours, so we can use a queue to decouple the 2 processes
  • eventual consistency (requires additional event storage to provide the capability of replay on the events' producer)
And make sure the producer does not care about the process results (success or failure) from consumers.


There are 2 patterns used in our systems.

Pattern 1: one producer → many consumers (SNS)

We create an AWS SNS topic that allows many consumers to subscribe
  • Lambda is used if a processor (such as generating invoices) is required to handle the events
  • An SQS is introduced if a service needs to know the events
  • Emails are sent if a third party system (such as alerting) needs to be integrated

Error Handling

Consumers handle errors based on their own situations. The following scenarios are only for the producer:

The effort to mitigate errors on the event producer instead of the upstream isn't worth it because it is impossible to 100% guarantee to save incoming events successfully. So make the event producer stateless to remove the unnecessary complexity if you can.
Additionally, the event producer should provide the capability to re-send events or query for historical events (it can rely on upstream to re-create the events).

Pattern 2: one producer → one consumer (SQS)

a lambda gets triggered by every message

consumer pulls messages from the queue regularly

Note: the push model does not reverse the dependencies since the implementation of AWS SQS and AWS Lamda is tightly coupled. However, the coupling can be solved by adding an SNS between the event producer and SQS, please refer to Pattern 1.

Error Handling:

All the scenarios in Pattern 1 apply.

Additionally, for the push model:
  • A DLQ can be easily configured with the original SQS.
  • A simple lambda can be introduced to copy messages between queues for replaying failed messages.


Best practices:

In an event-driven architecture, the delivery of messages is very hard to be predictable.  So to reduce the scenarios in error handling, we can follow these practices:
  • idempotency (retry-safe design) that each event should be safely executed multiple times without side effects
  • event order insensitive: the system should not assume the events always come in order. It should be able to handle the events regardless of the order.
  • the fully automated monitoring is in place
  • the producer should not care about the process results (success or failure) from consumers

An Example

The diagram describes a billing system that listens to an agreement (purchase) event and generates recursive invoices every day.
a billing system

Summary

Event-driven architecture is very useful when we want to decouple some parts of a system. However, it also brings extra complexity to our overall system design and requires good developers who can follow the best practices to run it well. So use it carefully and make sure the cost is well returned. 

References: