11 min read

Monolith to Microservices: What Actually Happened

Monolith to Microservices: What Actually Happened

Nobody tells you how much the transition from monolith to microservices actually costs — in time, in team stress, in unexpected failure modes you never anticipated. They tell you about Netflix. They tell you about Amazon. They don't tell you about the three weeks your on-call rotation spent debugging a cascade failure that started in one service, propagated through three others, and surfaced as a completely unrelated error in a fourth.

I've been through this migration on a real .NET backend system. Not a toy project, not a proof of concept — a production system with paying users, a deployment cadence that was already causing pain, and a team that genuinely needed relief.

This post is my honest account: why we decided to migrate, what prerequisites we had to build first, how the extraction actually worked step by step, what went wrong, and what I would do differently if I started today.


Why We Had to Leave the Monolith

Let me be specific about the pain, because "the monolith was getting hard to maintain" is not a real reason to migrate. You need a real reason — one that shows up in sprint planning, in your incident postmortems, and in your engineers' daily experience.

Ours showed up in three places.

Deployment coupling was killing our velocity

We had four product teams working in the same codebase. When Team A shipped a feature, it required all four teams to coordinate a deployment window, run a full regression suite, and be on standby for rollback. Our deployment cycle had slowed to once every two weeks — not because the features weren't ready, but because the coordination overhead was overwhelming.

Atlassian faced this exact pattern before their Vertigo project — code conflicts between teams became frequent and changes to one area started introducing bugs in unrelated features. It took them two years to fully migrate, but the outcome was autonomous teams and a significantly healthier DevOps culture.

One module was consuming disproportionate resources

Our reporting module — complex aggregation queries running across large datasets — was competing for database connections with our real-time transactional APIs. We couldn't scale them independently. Scaling the whole application to relieve reporting pressure meant overprovisioning the API layer we didn't need to scale. The cost was real, and the response time for transactional operations was suffering.

Our domain knowledge had finally stabilized

This is the one I want to emphasize most, because it's the prerequisite that gets skipped in every migration checklist: we finally understood our domain well enough to draw stable service boundaries.

Two years earlier, we would have cut the wrong services. We'd had enough production experience to know exactly where the natural seams were — where one domain could change independently of another, where the data ownership was clear, where team responsibility mapped cleanly to functionality.

As the InfoQ analysis of hard-earned migration lessons confirms, a distributed architecture adds enormous complexity — and that complexity is only worth paying if you understand the system well enough to cut it at the right places.


What We Had to Build Before Writing a Single Line of Migration Code

This is the part most migration blog posts skip entirely. Before we extracted a single service, we spent two months building the infrastructure that would make the migration survivable.

Distributed tracing — non-negotiable

In a monolith, a single request produces a single log stream. You can follow it in sequence. In a microservices environment, one user request can touch five services, and each service writes to its own log. Without distributed tracing, debugging a production failure means manually correlating timestamps across five separate log streams. That's not debugging — that's archaeology.

We instrumented OpenTelemetry across all services before any went to production:

// Program.cs — applied to every service
builder.Services.AddOpenTelemetry()
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddEntityFrameworkCoreInstrumentation()
        .AddOtlpExporter(options =>
        {
            options.Endpoint = new Uri(builder.Configuration["Otel:Endpoint"]!);
        }));

Every HTTP call between services automatically propagated the traceparent header, which gave us a single trace ID to filter on across all services when an incident occurred.

Independent CI/CD pipelines per service

Before we had microservices, we built the deployment pipelines. Each extracted service needed its own pipeline: build, test, containerize, deploy to staging, deploy to production. We used GitHub Actions with service-specific workflow files. Getting this right took time, but it meant that when we extracted a service, it was immediately deployable independently — which is the whole point.

Health endpoints and centralized alerting

Every service needed a /health and /health/ready endpoint before it went live. We configured alerts on P95 latency per service, error rate thresholds, and CPU/memory watermarks. The alerting needed to be in place before the first service was extracted — not retrofitted after the first incident.


The Extraction: How We Actually Did It

We used the strangler fig pattern — extracting services one at a time while the monolith remained live and continued handling production traffic. Microservices.io describes this incremental extraction approach as the only reliable path: you implement the new service, run it alongside the monolith, and gradually route traffic from the monolith to the new service until the monolith no longer handles that domain.

Step 1: Pick the least risky service first

We chose the Notifications domain as our first extraction. It was:

  • Read-heavy with simple, stable business logic
  • Already partially isolated in its own folder in the monolith
  • Non-critical in terms of data consistency — if a notification was slightly delayed, no user would notice

This made it low-risk for our first attempt, even if we got the extraction wrong.

Step 2: Define the service boundary with an API contract

Before writing the new service, we documented its contract:

// Notifications service — public interface
[ApiController]
[Route("api/notifications")]
public class NotificationsController : ControllerBase
{
    [HttpPost("send")]
    public async Task<IActionResult> SendNotification(
        [FromBody] SendNotificationRequest request,
        CancellationToken cancellationToken)
    { ... }

    [HttpGet("{userId}")]
    public async Task<IActionResult> GetUserNotifications(
        Guid userId,
        CancellationToken cancellationToken)
    { ... }
}

The monolith would call this via HTTP. Other services would never reach into the notifications database directly.

Step 3: Introduce schema isolation before splitting the database

We did not split the database on day one. That was a deliberate decision.

Instead, we moved notification tables into their own schema within the existing shared database:

-- Move tables to isolated schema
CREATE SCHEMA notifications;
ALTER TABLE dbo.UserNotifications SET SCHEMA notifications;
ALTER TABLE dbo.NotificationTemplates SET SCHEMA notifications;

We then enforced a rule: only the Notifications service's DbContext was allowed to reference the notifications schema. No other service, no other DbContext, no direct SQL joins from outside. The boundary existed in code before it existed in infrastructure.

public class NotificationsDbContext : DbContext
{
    public DbSet<UserNotification> UserNotifications { get; set; }
    public DbSet<NotificationTemplate> Templates { get; set; }

    protected override void OnModelCreating(ModelBuilder modelBuilder)
    {
        modelBuilder.HasDefaultSchema("notifications");
    }
}

This meant the eventual database split — moving to a separate physical database — was a deployment-day decision, not an architectural one. The code was already correct.

Step 4: Route traffic through an API gateway

We introduced an API gateway (YARP, a .NET-native reverse proxy) in front of both the monolith and the new Notifications service. Initially, all traffic flowed to the monolith unchanged. We then incrementally shifted notification-related routes to the new service:

{
  "Routes": [
    {
      "RouteId": "notifications",
      "ClusterId": "notifications-service",
      "Match": { "Path": "/api/notifications/{**catch-all}" }
    },
    {
      "RouteId": "monolith",
      "ClusterId": "monolith",
      "Match": { "Path": "{**catch-all}" }
    }
  ]
}

The monolith continued running unchanged. Users and other services saw no difference. We ran both in parallel for two weeks, compared logs, verified behavior, and then decommissioned the notification code from the monolith.

We repeated this pattern — boundary definition, schema isolation, new service, gateway routing, parallel running, decommission — for each subsequent domain: Billing, Reporting, then User Management.


What Went Wrong

I'd be doing you a disservice if I only described what worked.

We built a distributed monolith first

Our second extraction — Billing — was done too fast. We were confident after Notifications, so we cut corners on boundary definition. The result: Billing called into the Orders service synchronously on every transaction, Orders sometimes called back into Billing for balance checks, and we had a circular synchronous dependency between two "independent" services.

This is precisely the distributed monolith anti-pattern — all the operational complexity of microservices, with none of the independence benefits. We had to go back and introduce an event-driven integration: Billing subscribed to OrderCompleted events instead of calling Orders directly. It cost us a full sprint to fix what proper boundary analysis would have prevented.

Cross-service debugging took 5x longer than expected

Even with distributed tracing in place, diagnosing failures across services was significantly harder than in the monolith. The monolith's single log stream, its ability to use a debugger across the entire call stack, and the simplicity of reproducing issues locally — all of that disappeared.

We partially addressed this with local development tooling that spun up all services in Docker Compose, but it required ongoing maintenance as services evolved. The debugging gap is real and permanent — it's not something you "solve." You adapt to it.

Network failures that the monolith never had

Two classes of failures emerged that simply didn't exist in the monolith:

Timeout cascades. A slow response from Billing caused the Orders service to hold connections open, which eventually exhausted its thread pool, which caused timeouts for incoming API requests. We hadn't implemented circuit breakers. After this incident, we added Polly's circuit breaker pattern to every inter-service HTTP client:

// Resilient HTTP client with circuit breaker
builder.Services.AddHttpClient<IBillingClient, BillingClient>()
    .AddResilienceHandler("billing-pipeline", pipeline =>
    {
        pipeline.AddCircuitBreaker(new CircuitBreakerStrategyOptions
        {
            FailureRatio = 0.5,
            SamplingDuration = TimeSpan.FromSeconds(30),
            MinimumThroughput = 10,
            BreakDuration = TimeSpan.FromSeconds(15)
        });
        pipeline.AddTimeout(TimeSpan.FromSeconds(5));
        pipeline.AddRetry(new RetryStrategyOptions
        {
            MaxRetryAttempts = 2,
            Delay = TimeSpan.FromMilliseconds(300)
        });
    });

Message ordering problems with async events. When we moved Billing to an event-driven model, we discovered that events could arrive out of order under load. An OrderCancelled event occasionally arrived before the corresponding OrderCreated event was processed. We had to make all event handlers idempotent and add ordering guarantees using sequence numbers on the event payload.


What I'd Do Differently

Three things, clearly:

1. Spend more time on domain modeling before touching code. Every service boundary problem we had was traceable to a domain boundary that wasn't stable or well-understood when we started extracting. Domain-Driven Design's bounded context concept is the right tool here — not because it's academically correct, but because it forces a conversation about data ownership and team responsibility that reveals the boundary issues before they become code issues.

2. Instrument everything before migrating anything. Observability is not a feature you add later. If your distributed tracing, centralized logging, and service health dashboards aren't in place before you extract a single service, you're flying blind from day one.

3. Migrate the team before migrating the code. Microservices require a different engineering culture — service ownership, API contracts as a first-class concern, and a mindset of "my service must degrade gracefully when its dependencies fail." This takes months to develop. Starting the cultural migration 3–6 months before the technical one is not optional, it's leverage.


Key Takeaways

  • Migrate because of real, measured pain — deployment coupling, resource contention, team bottlenecks — not because microservices sound modern.
  • Stable domain knowledge is a prerequisite. If you can't confidently draw your service boundaries today, you'll draw them wrong and pay for it later.
  • Build observability before you extract the first service. Distributed tracing, centralized logging, and per-service health dashboards are not optional — they're the foundation.
  • Use the strangler fig pattern. Extract one service at a time, run it in parallel with the monolith, verify, then decommission. Never attempt a big-bang rewrite.
  • Isolate schemas before splitting databases. Schema isolation in code enforces domain boundaries months before the database is physically split — with zero infrastructure risk.
  • Circuit breakers and timeouts are mandatory, not optional. Every synchronous inter-service call needs a timeout, a retry policy, and a circuit breaker. Without them, your services will cascade-fail under load.
  • The debugging cost is permanent. Distributed systems are fundamentally harder to debug than monoliths. Invest in tooling and team training to absorb this cost — it doesn't go away.
  • A distributed monolith is the worst possible outcome. If your services are tightly coupled after extraction, you have all the operational complexity and none of the independence benefits. Enforce boundaries ruthlessly.

Conclusion

Monolith to microservices is not an upgrade — it's a trade. You trade deployment simplicity for independent scalability. You trade debugging ease for team autonomy. You trade a single point of failure for distributed fault isolation. Every one of those trades has a real cost on both sides.

Done well — with clear domain boundaries, mature DevOps, and a team that understands distributed systems — the migration absolutely delivers. Our deploys went from biweekly coordination nightmares to daily independent releases per team. Our reporting module no longer competed with transactional APIs for resources. Engineers stopped stepping on each other's code.

Done poorly, you get YAML sprawl, cascading timeouts, a FinOps team, and engineers who miss the monolith. I've seen both outcomes up close.

The difference isn't the architecture. It's the preparation.

If this resonated with your own experience — or if you're in the middle of a migration and want to compare notes — drop a comment below or reach out on steve-bang.com. And if you want to go deeper on the related patterns, there's a lot more to explore here.


FAQ

Q: How long does a monolith to microservices migration take? A: Most real migrations take 6–18 months of incremental work. Attempting a full rewrite in a single phase is extremely high risk. The strangler fig pattern — extracting one service at a time while the monolith keeps running — consistently produces better outcomes than big-bang rewrites.

Q: What is the strangler fig pattern and why is it used for monolith migration? A: The strangler fig pattern means gradually extracting functionality from a monolith into new services while the monolith stays live. Traffic is incrementally rerouted via an API gateway. It reduces risk by keeping the original system running until each new service is verified in production.

Q: What is a distributed monolith and how do you avoid it? A: A distributed monolith is when services are split into separate deployables but remain tightly coupled — sharing databases, calling each other in synchronous chains, or requiring coordinated deployments. Avoid it by defining clear domain boundaries, ensuring each service owns its own data, and designing for independent deployability from the start.

Q: How do you handle database splitting when migrating to microservices? A: Introduce schema separation in the shared database first — give each domain its own schema and enforce that no service queries another service's schema directly. Once that discipline is in place, splitting into separate physical databases becomes a deployment concern rather than an architectural one.

Q: What monitoring do you need before going live with microservices? A: At minimum: distributed tracing (to follow a request across services), centralized structured logging (to correlate logs by trace ID), health check endpoints on every service, and alerting on P95 latency per service. Without distributed tracing, debugging cross-service failures will consume a disproportionate amount of your on-call time.