Federico Bevione

Posted on Jan 22

Transactions in Microservices: Part 2 - SAGA Pattern with Choreography

#go #distributedsystems #sagapattern #microservices

In the first article of this series, we introduced the SAGA pattern and demonstrated how a minimal Orchestration can manage distributed transactions with a central orchestrator.

Let’s get real! This time, we’ll dive into the Choreography approach, where services coordinate workflows by autonomously emitting and consuming events.

To make this practical, we’ll implement a multi-service healthcare workflow using Go and RabbitMQ. Each service will have its own main.go, making it easy to scale, test, and run independently.

What is SAGA Choreography?

Choreography relies on decentralized communication. Each service listens for events and triggers subsequent steps by emitting new events. There’s no central orchestrator; the flow emerges from the interactions of individual services.

Key Benefits:

Decoupled Services: Each service operates independently.
Scalability: Event-driven systems handle high loads efficiently.
Flexibility: Adding new services doesn’t require changing the workflow logic.

Challenges:

Debugging Complexity: Tracking events across multiple services can be tricky. (I'll write an article dedicated to this topic, stay tuned!)
Infrastructure Setup: Services require a robust message broker (e.g., RabbitMQ) to connect all the dots.
Event Storms: Poorly designed workflows can overwhelm the system with events.

Practical Example: Healthcare Workflow

Let’s revisit our healthcare workflow from the first article:

Patient Service: Verifies patient details and insurance coverage.
Scheduler Service: Schedules the procedure.
Inventory Service: Reserves medical supplies.
Billing Service: Processes billing.

Each service will:

Listen for specific events using RabbitMQ.
Emit new events to trigger subsequent steps.

Setting Up RabbitMQ with Docker

We’ll use RabbitMQ as the event queue. Run it locally using Docker:

docker run --rm --name rabbitmq -p 5672:5672 -p 15672:15672 rabbitmq:4.0.5-management

Access the RabbitMQ management interface at http://localhost:15672 (username: guest, password: guest).

Exchanges, Queues, and Bindings Setup

We need to configure RabbitMQ to accommodate our events. Here’s an example init.go file for setting up the RabbitMQ infrastructure:

package main

import (
    "log"

    "github.com/rabbitmq/amqp091-go"
)

func main() {
    conn, err := amqp091.Dial("amqp://guest:guest@localhost:5672/")
    if err != nil {
        log.Fatalf("Failed to connect to RabbitMQ: %v", err)
    }
    defer conn.Close()

    ch, err := conn.Channel()
    if err != nil {
        log.Fatalf("Failed to open a channel: %v", err)
    }
    defer ch.Close()

    err = ch.ExchangeDeclare("events", "direct", true, false, false, false, nil)
    if err != nil {
        log.Fatalf("Failed to declare an exchange: %v", err)
    }

    _, err = ch.QueueDeclare("PatientVerified", true, false, false, false, nil)
    if err != nil {
        log.Fatalf("Failed to declare a queue: %v", err)
    }

    err = ch.QueueBind("PatientVerified", "PatientVerified", "events", false, nil)
    if err != nil {
        log.Fatalf("Failed to bind a queue: %v", err)
    }
}

Full code here!

Note: In a production setting, you might want to manage this setup using a GitOps approach (e.g., with Terraform) or let each service handle its own queues dynamically.

Implementation: Service Files

Each service will have its own main.go. We’ll also include compensation actions for handling failures gracefully.

1. Patient Service

This service verifies patient details and emits a PatientVerified event. It also compensates by notifying the patient if a downstream failure occurs.

// patient/main.go
package main

import (
    "fmt"
    "log"

    "github.com/rabbitmq/amqp091-go"
    "github.com/thegoodapi/saga_tutorial/choreography/common"
)

func main() {
    conn, err := amqp091.Dial("amqp://guest:guest@localhost:5672/")
    if err != nil {
        log.Fatalf("Failed to connect to RabbitMQ: %v", err)
    }
    defer conn.Close()

    ch, err := conn.Channel()
    if err != nil {
        log.Fatalf("Failed to open a channel: %v", err)
    }
    defer ch.Close()

    go func() {
        fmt.Println("[PatientService] Waiting for events...")
        msgs, err := common.ConsumeEvent(ch, "ProcedureScheduleCancelled")
        if err != nil {
            log.Fatalf("Failed to consume event: %v", err)
        }

        for range msgs {
            fmt.Println("[PatientService] Processing event: ProcedureScheduleCancelled")
            if err := notifyProcedureScheduleCancellation(); err != nil {
                log.Fatalf("Failed to notify patient: %v", err)
            }
        }
    }()

    common.PublishEvent(ch, "events", "PatientVerified", "Patient details verified")
    fmt.Println("[PatientService] Event published: PatientVerified")

    select {}
}

func notifyProcedureScheduleCancellation() error {
    fmt.Println("Compensation: Notify patient of procedure cancellation.")
    return nil
}

2. Scheduler Service

This service listens for PatientVerified and emits ProcedureScheduled. It compensates by canceling the procedure if a downstream failure occurs.

// scheduler/main.go
package main

import (
    "fmt"
    "log"

    "github.com/rabbitmq/amqp091-go"
    "github.com/thegoodapi/saga_tutorial/choreography/common"
)

func main() {
    conn, err := amqp091.Dial("amqp://guest:guest@localhost:5672/")
    if err != nil {
        log.Fatalf("Failed to connect to RabbitMQ: %v", err)
    }
    defer conn.Close()

    ch, err := conn.Channel()
    if err != nil {
        log.Fatalf("Failed to open a channel: %v", err)
    }
    defer ch.Close()

    go func() {
        fmt.Println("[SchedulerService] Waiting for events...")
        msgs, err := common.ConsumeEvent(ch, "PatientVerified")
        if err != nil {
            log.Fatalf("Failed to consume event: %v", err)
        }

        for range msgs {
            fmt.Println("[SchedulerService] Processing event: PatientVerified")
            if err := scheduleProcedure(); err != nil {
                common.PublishEvent(ch, "events", "ProcedureScheduleFailed", "Failed to schedule procedure")
                fmt.Println("[SchedulerService] Compensation triggered: ProcedureScheduleFailed")
            } else {
                common.PublishEvent(ch, "events", "ProcedureScheduled", "Procedure scheduled successfully")
                fmt.Println("[SchedulerService] Event published: ProcedureScheduled")
            }
        }
    }()

    select {}
}

func scheduleProcedure() error {
    fmt.Println("Step 2: Scheduling procedure...")
    return nil // or simulate a failure
}

Additional Services

Include Inventory Service and Billing Service implementations, following the same structure as above. Each service listens for the previous event and emits the next one, ensuring compensation logic is in place for failures.

Full code here!

Running the Workflow

Start RabbitMQ:

   docker run --rm --name rabbitmq -p 5672:5672 -p 15672:15672 rabbitmq:4.0.5-management

Run Each Service:
Open separate terminals and run:

   // one-time script to setup rabbitmq
   go run choerography/init/main.go
   // services
   go run choerography/billing/main.go
   go run choerography/inventory/main.go
   go run choerography/scheduler/main.go
   go run choerography/patient/main.go

Observe Output:
Each service processes events in sequence, logging the workflow progress.

What happened?

Let's break it down!

First of all, for the purpose of this article, we are not implementing SuppliesReserveFailed and ProcedureScheduleFailed,l to avoid unseless complexity.

We are implementing the following events

Steps (or transactions):

T1: (init): PatientVerified
T2: ProcedureScheduled
T3: SuppliesReserved
T4: BillingSuccessful

Compensations:

C4: BillingFailed
C3: ReservedSuppliesReleased
C2: ProcedureScheduleCancelled
C1: NotifyFailureToUser (not implemented)

Folowing this implementation diagram

This diagram represents a common approach to documenting choreography. However, I find it somewhat difficult to understand and a bit frustrating, particularly for those who are not familiar with the implementation or the pattern.

Let's break it down!

The diagram above is way more verbose and it breaks down each step making it easier to understand what's going on.

In a nutshell:

Patient service verifies patient details successfully
Patient service emits PatientVerified
Scheduler service consumes PatientVerified
Scheduler service schedule the appintment successfully
Scheduler service emits ProcedureScheduled
Inventory service consumes ProcedureScheduled
Inventory service reserves the supplies successfully
Inventory service emits SuppliesReserved
Billing service consumes SuppliesReserved
Billing service failes to charge the customer and starts the compensation
Billing service emits BillingFailed
Inventory service consumes BillingFailed
Inventory service releases the supplies, reserved in step 7
Inventory service emits ReservedSuppliesReleased
Scheduler service consumes ReservedSuppliesReleased
Scheduler service deletes the appointment scheduled in step 4
Scheduler service emits ProcedureScheduleCancelled
Patient service consumes ProcedureScheduleCancelled
Patient service notifies the customer of the error

Note that we are not implementing failures for steps 1, 4, and 7 for the sake of brevity; however, the approach would be the same. Each of these failures would trigger a rollback of the preceding steps.

Observability

Observability is essential for debugging and monitoring distributed systems. Implementing logs, metrics, and traces ensures that developers can understand system behavior and diagnose issues efficiently.

Logging

Use structured logging (e.g., JSON format) to capture events and metadata.
Include correlation IDs in logs to trace workflows across services.

Metrics

Monitor queue sizes and event processing times.
Use tools like Prometheus to collect and visualize metrics.

Tracing

Implement distributed tracing (e.g., with OpenTelemetry) to track events across services.
Annotate spans with relevant data (e.g., event names, timestamps) for better insights.

We'll dive into observability in choerography later in this serie, stay tuned!

Key Takeaways

Decentralized Control: Choreography enables autonomous collaboration.
Event-Driven Simplicity: RabbitMQ simplifies message exchange.
Scalable Architecture: Adding new services is seamless.
Choerography can be very overwelming at first, but as always: practice make you ~~perfect~~ better!

Stay tuned for the next article, where we’ll explore Orchestration!

Check out the full repository for this series here. Let’s discuss in the comments!

DEV Community