Monolith vs. Microservices: How We Made the Decision
Our team's actual decision-making process for whether to break up a Rails monolith. Spoiler: we didn't go full microservices.

Forty-five minute deploys. That's what started the conversation. Our CI pipeline would run tests, build the artifact, push to production โ forty-five minutes from merge to live. For a single service. Every time someone merged a PR.
About 45 developers across 6 teams, all committing to the same Rails monolith. Authentication, order processing, inventory management, notifications, and several other domains โ all in one codebase. Started as a small app years ago. Grew the way monoliths always grow: features added, teams expanded, and eventually what was a clean little application became a massive thing that nobody fully understood anymore.
Deploy time was the visible symptom. Not the only one. Teams stepping on each other's database migrations. A change to the Order model breaking a test in the Notification module that nobody on the Orders team knew existed. Code reviews dragging on because PRs touched files from multiple domains and nobody knew who should approve what.
Something needed to change. The question: what kind of change?
The Obvious Answer We Didn't Take
Microservices. What everyone on Hacker News suggests. Break the monolith into independent services. Each team owns their service, deploys independently, picks their own stack. Clean boundaries. Independent scaling. Total autonomy.
Sounds great. Works for some organizations. Netflix runs hundreds of microservices. Amazon does too. Both companies have thousands of engineers and dedicated platform teams whose entire job is maintaining the infrastructure that makes microservices viable.
We have 45 engineers and zero platform team members. That detail turned out to be pretty important, I think.
What Microservices Would Have Actually Meant
Before talking about what we decided, I want to lay out what adopting microservices would have required. Not in theory. In practice.
Every function call becomes a network call. In the monolith, when Orders needs to check inventory, it calls Inventory.check_stock(product_id). Microseconds. Never fails due to network issues. Separate services: that becomes an HTTP or gRPC call. Now you handle timeouts. What if the Inventory service is slow? Down? Retry? How many times? What backoff strategy?
Every inter-service call needs error handling that didn't exist before. Circuit breakers detecting when a downstream service is struggling. Retry logic with exponential backoff so you don't pile onto a struggling service. Timeout configuration โ too short and you get false failures, too long and one slow service cascades latency everywhere upstream.
None of that code existed in our monolith. None would need to exist if services stayed in the same process. Pure overhead created by architectural boundaries.
Database transactions become distributed problems. Our order creation flow:
ActiveRecord::Base.transaction do
order = Order.create!(params)
InventoryItem.decrement!(product_id, quantity)
Payment.charge!(order.total, customer.payment_method)
Notification.schedule!(customer, :order_confirmation)
end
Anything fails, everything rolls back. Inventory doesn't decrement without payment. Payment doesn't charge without an order. ACID guarantees. Simple.
Separate services: no shared database, no cross-service transactions. You'd implement the Saga pattern โ each service performs its local transaction and publishes an event. If a later step fails, compensating events reverse earlier steps.
Order service creates order โ publishes OrderCreated โ Inventory service decrements stock โ publishes StockReserved โ Payment charges โ if payment fails, publishes PaymentFailed โ Inventory restores stock โ Order gets marked failed.
That's the happy path of the failure path. What if the Payment service crashes after charging but before publishing? What if the message broker loses a message? What if compensation fails?
Not unsolvable. People solve these daily. But the complexity jump from "wrap it in a transaction" to "implement distributed sagas with compensating actions" is enormous. Failure modes are subtle โ you might not discover them for months until a specific timing-and-failure combination produces an inconsistent state.
Debugging requires distributed tracing. Monolith: an error produces a stack trace. See exactly which line failed, what called it, what the parameters were. Add a breakpoint, reproduce, step through.
Microservices: a user-facing error might originate in any of the involved services. Error in the API gateway caused by a timeout from Auth, caused by slow queries in User service, caused by a missing database index. Following that chain requires distributed tracing โ correlation IDs in every header, trace collection infrastructure (OpenTelemetry), visualization tools (Jaeger, Zipkin, Datadog). I covered the observability requirements for this kind of distributed setup in my post on cloud-native architecture.
Setting up that infrastructure is a full project. Without it, debugging cross-service issues is guesswork.
The Modular Monolith Option
A middle ground that gets less attention. Instead of splitting into separate services, enforce boundaries within the monolith itself. Each domain (Orders, Inventory, Payments, Notifications) becomes a module with an explicit public API. Modules can't directly access each other's database tables โ they go through the module's API.
Organizational boundaries without the operational complexity of distributed systems. Teams work independently on their modules. Module APIs serve as contracts. But you still have a single deployment, a single database, and transaction support across modules when needed.
We explored this using Rails engines. Each domain becomes a gem/engine with its own models, routes, and tests. The engine defines a public interface โ service objects or query objects โ and other engines interact only through that interface.
Not perfect. Enforcing boundaries requires discipline because Ruby doesn't have the same module access controls as, say, Java packages. You can technically reach across boundaries, and developers under deadline pressure sometimes do. We'd need custom linting or CI checks to catch violations.
But compared to microservices, the operational overhead is close to zero. Same deployment pipeline. Same database. Same debugging workflow. Complexity increase is primarily in code organization, not infrastructure.
What We Chose
Hybrid approach. Felt like a compromise at the time โ some people wanted full microservices, others wanted to just fix the test suite and call it done.
The tightly coupled domains โ Orders, Inventory, Payments โ stay in the monolith. We're refactoring into a modular monolith with strict internal boundaries. These domains have strong transactional requirements and tight coupling that would make separation painful. The cost of distributed sagas between orders and payments outweighed deployment independence.
Notifications and Search were pulled out into separate services. Notifications are loosely coupled by nature โ an order triggers a notification, but the notification doesn't need to share the transaction. A message queue handles the async communication naturally. Search has different scaling characteristics โ read-heavy, benefits from specialized infrastructure (Elasticsearch) that doesn't fit neatly in the Rails monolith.
For the CI problem, we reorganized the test suite so each module's tests run independently. PR only touches the Notification engine? Only Notification tests run. Average CI time dropped from 45 minutes to about 12 for most PRs. Full suite still runs on merges to main, but the fast feedback loop on feature branches made a visible difference to how productive people felt.
What I'd Do Differently
Spent too long evaluating the full microservices option. Architecture spikes, proofs of concept, cost analyses. A month of engineering time exploring an approach we mostly didn't take. If doing this again, I'd start with the modular monolith and see how far it got before even considering extraction.
More upfront investment in boundary enforcement too. Our module boundaries are currently conventions backed by code review. That's fragile. Automated checks โ lint rules flagging cross-module database queries, CI gates catching boundary violations โ would have been worth doing from the start. Adding them now, but violations have already crept in.
The two extracted services work fine. Notifications deploys independently about twice a day. Search runs in Elasticsearch without issues. If you're considering extraction, the decision about GraphQL vs REST for inter-service communication is worth thinking through early โ it affects how tightly coupled your services end up being. Neither extraction was as scary as expected, partly because we picked the easiest candidates first.
Whether we extract more in the future โ probably, eventually, as the team grows and remaining domains develop different scaling needs. But I've grown more skeptical of microservices as a default architecture than I was before this project. The operational complexity is real and it's ongoing. Not a one-time migration cost โ a permanent increase in the surface area of things that can break.
The Testing Story
One thing that changed more than I expected after the extractions: how we test.
In the monolith, integration tests were straightforward. Set up the test database, seed some data, call the endpoint, assert on the response. Everything runs in one process. If the test passes, the feature works.
With the extracted Notifications service, integration testing got more complicated. The service depends on receiving events from the monolith. To test end-to-end, you either need the monolith running alongside it (heavy, slow, fragile) or you need to mock the event stream (faster, but you're testing against your assumptions about what the events look like, not reality).
We went with contract testing. The monolith publishes event schemas. The Notifications service tests against those schemas. If either side changes the event format without updating the contract, the relevant tests fail. It's not as satisfying as a true end-to-end test, but it catches the most common integration bugs โ field renames, type changes, missing required fields โ without requiring both services to be running simultaneously.
For the monolith's internal modules, testing got better after the refactoring. Each module has its own test suite that can run in isolation. Before the modular boundaries, tests were tangled โ an Order test might depend on Notification helpers that depended on User fixtures. Untangling those dependencies was tedious but the payoff is meaningful. A developer working on Orders can run the Order tests in under a minute instead of waiting for the full 45-minute suite.
The Database Question Nobody Asks Early Enough
One decision that gets overlooked during the monolith-vs-microservices debate: what happens to the database? In a monolith, everything shares one database. Every module reads from and writes to the same tables. Joins are easy. Transactions are easy. Reporting queries that span multiple domains just work.
The moment you extract a service, you face a choice. Does the new service get its own database, or does it share the monolith's database? Sharing is simpler in the short term but creates tight coupling โ the service can't be deployed or scaled independently if it depends on the monolith's database schema. A migration in the monolith can break the extracted service. Separate databases enforce true independence but make cross-domain queries painful or impossible without building data synchronization pipelines.
We gave Notifications its own database. Easy call โ notification data is write-heavy, rarely queried across domains, and the schema is simple. Search uses Elasticsearch, so that was already separate by nature. But if we ever extract Orders or Inventory, the database split gets ugly fast. Order reporting currently joins across five tables spanning three modules. Splitting those into separate databases means either duplicating data or building a data warehouse that aggregates from multiple sources. Neither option is free.
My advice to teams starting fresh: even inside a monolith, keep your schemas modular. Don't write cross-module joins in application code. Use the module's API to fetch data from another domain, even if it's technically in the same database. That discipline makes future extraction possible without rewriting every query that touches multiple domains.
What I Tell Other Teams
For most teams at our size โ 30 to 60 engineers โ the modular monolith is probably the right starting point. Not because microservices don't work. Because they carry ongoing operational costs that you need a certain scale and staffing level to justify. A platform team maintaining the service mesh, the tracing infrastructure, the deployment pipelines, the cross-service testing frameworks โ if you don't have people dedicated to that, the engineers building product features are the ones maintaining it, and their velocity suffers.
Extract services when there's a specific, concrete reason. A domain with different scaling characteristics. A team that actually needs independent deployment cycles. A component that would benefit from different technology (like our Elasticsearch-backed Search). These are extraction triggers. "Microservices are the modern way" is not an extraction trigger.
Our two extractions were justified. Six probably would not have been.
Communication Patterns Between Services
For the two services we did extract, we had to choose how they'd communicate with the monolith. Two main patterns exist: synchronous (HTTP/gRPC calls) and asynchronous (message queues/event streams).
Notifications is fully async. The monolith publishes events to a RabbitMQ queue โ OrderCreated, PasswordReset, AccountLocked. The Notifications service consumes those events and sends the appropriate email or push notification. If the Notifications service is down for five minutes, the events queue up and get processed when it comes back. Nobody misses a notification โ they just arrive slightly delayed. This tolerance for delay is what makes async work so well for this use case.
Search is a mix. Writes are async โ when a product is created or updated in the monolith, an event goes to the queue, and the Search service indexes it in Elasticsearch. Reads are synchronous โ when a user performs a search, the frontend calls the Search service's API directly and waits for results. The sync/async split matches the tolerance for delay: a product appearing in search results 30 seconds after creation is fine. A user waiting 30 seconds for search results is not.
The async communication introduced a new category of debugging challenge: eventual consistency. A product gets updated in the monolith but the search results still show the old version because the indexing event hasn't been processed yet. Support tickets come in: "I changed my product description but it still shows the old one." The answer is "wait 30 seconds and refresh." Not a bug, but it feels like one to the user, and explaining eventual consistency to a non-technical person is probably harder than it should be.
We added a simple mechanism to help: after updating a product, the monolith returns the event timestamp in the API response. The frontend displays a "changes may take a moment to appear in search" message until enough time has passed. Band-aid solution. Works well enough. Beats the alternative of making search synchronous and coupling the two services tightly.
We also learned to be careful about event schema evolution. When the monolith changes the shape of an event โ say, renaming a field from user_name to userName โ the Notifications service breaks because it's expecting the old field name. Contract testing catches this, but only if both sides run their tests before deploying. If the monolith deploys first and the Notifications service hasn't been updated, events start failing in production. We added a schema version field to every event and built backward-compatible deserialization on the consumer side โ if the version is old, map the old field names to the new ones. Extra boilerplate. Prevents outages during deploy windows where the services are temporarily out of sync.
One mistake we made early: not setting up a dead letter queue. When the Notifications service encountered an event it couldn't process (malformed data, a notification type it didn't recognize), it would fail, retry, fail again, retry, and eventually the event would block all subsequent events in the queue. A dead letter queue catches failed events after a configured number of retries and moves them aside so the rest of the queue keeps flowing. Should have been in place from day one. Wasn't. Added it after the first time we noticed emails were delayed by an hour because one bad event was jamming the pipe.
Written by
Anurag Sinha
Full-stack developer specializing in React, Next.js, cloud infrastructure, and AI. Writing about web development, DevOps, and the tools I actually use in production.
Stay Updated
New articles and tutorials sent to your inbox. No spam, no fluff, unsubscribe whenever.
I send one email per week, max. Usually less.
Comments
Loading comments...
Related Articles

System Design โ Not Interview Prep, Real Decisions
The system design concepts I actually use at work: load balancers, caching layers, message queues, and why picking the right trade-off matters more than knowing the right answer.

SQLite โ The Most Underrated Database in Your Toolbox
Why I stopped reaching for Postgres by default and started shipping production apps with SQLite. WAL mode, embedded analytics, and when it genuinely beats the big databases.

MongoDB vs PostgreSQL โ An Honest Comparison After Using Both in Production
When the document model actually helps, when relational wins, and the real project stories behind the decision.