Skip to content
← Blog

Why we chose OpenFGA when starting a platform with 17 microservices

5 min read··· views
#openfga#authorization#rebac#microservices#multi-tenant#architecture

At work we are starting a multi-tenant platform made of a long fleet of independent microservices, organized around a hierarchy of tenants, teams and projects. One of the first architecture decisions we had to make was how authorization would work. The default option, the one that happens if you do not stop to think about it, is that each microservice checks its own permissions. We decided not to do that and put OpenFGA in from day one. This post explains why, and which concrete problem that avoids.

I am assuming you already know OpenFGA and its ReBAC model. If not, in this other post I explain it from scratch with a model, tuples and the Check operation, and then you can come back here.

The problem we were about to have#

The question "can this user do X on resource Y?" appears in every endpoint of every microservice. The intuitive answer, especially when you start with two or three services, is to put the rule inside the service that owns the endpoint. It works for a while, partly because it is you or a teammate writing both services and your head remembers to keep the two bits of logic in sync.

The problem appears when the catalogue grows. Someone adds the viewer role, then guest appears, later the tenant admin arrives with access to everything except billing, and suddenly you have five slightly different role tables in five different microservices. The chance that the sixth service silently allows what the other five block is exactly the one you are thinking of, and when that bug appears, it appears at the worst possible time.

The middle-ground solution (extracting a shared authorization library) does not solve the problem either; it moves it. You end up with two services on different versions of the library, role migrations halfway through, patches that are supposed to be for one service but in practice leak into three others because they live in the same monorepo. Knowing that upfront, and knowing we were already going to start with more than fifteen services, we preferred not to enter that cycle at all.

How OpenFGA fits into a gRPC microservices system#

The usual pattern in a fleet of gRPC services is to concentrate the check at the edge. The HTTP request reaches the gateway with a JWT, the gateway extracts the user identity, calls OpenFGA with the triplet (subject, relation, resource), and only if the answer is positive forwards the call to the corresponding microservice. The microservice trusts the gateway because the link between the two travels over mTLS or through an internal network nobody else can reach, and it stops worrying about authorization because that decision has already been made.

The mental picture we wanted was this, not anything more complicated:

Renderizando diagrama...

There is a subtle trap worth making explicit from the beginning. If a microservice can receive calls from another microservice (a worker, a job runner, a fan-out), then the gateway is not the only entry point and the microservice does need to authorize. The clean way to handle that is to propagate the original user identity in every internal gRPC call and ask OpenFGA again whenever a trust boundary is crossed. It sounds expensive and it is not, because the service can handle thousands of checks per second without blinking.

The rule we wrote down was simple enough: external requests are authorized at the gateway; internal calls preserve the original user; and any service that can receive work outside the normal HTTP flow must treat that entry as a new boundary, not as a magical continuation of the previous one.

The audit bonus#

There is a side effect that I think is more valuable than it looks at first. Tuples are created and revoked through an API, and changes end up in a log. When the day comes (and that day does come) where someone asks "who had access to this resource on March 5 at 14:00?", the log answers for you. If your product stores sensitive records append-only for traceability, you already have half the work done because the same principle applies to permissions.

In our case this mattered quite a bit because many permissions are born from business events: a project is created, someone joins a team, an invitation is revoked, a resource owner changes. If each event writes or deletes an explicit tuple, the access history stops being hidden across five databases and starts living in one place you can inspect.

Why we put OpenFGA in from the first commit#

When you start a system like this, there are two moments where you can introduce OpenFGA: now, before a single endpoint exists, or later, when you already have permissions wired into fifteen places and need to migrate them. The first option is writing a model. The second is a multi-month project with regression risk in every service. The cost difference between those two paths is what made us decide before writing the first microservice.

The promise we are buying is this: in a platform with a long list of microservices that share identity and hierarchy, OpenFGA reduces a whole family of bugs to one statement that is much easier to reason about: "did you remember to write the tuple when you created the resource?". It is a bounded problem, with a clear answer and a single place to put it. The alternative problem, "did the third service deploy the new role table, and does the fourth interpret the same enum in the same way, and what about the fifth?", is one you never have a fully reassuring answer to.

The first afternoon is spent fighting the model, the froms and the ors, and it is worth assuming that from the start. In exchange, writing a new endpoint no longer includes the step "now I will copy the permissions logic from the neighboring endpoint and hope I did not break anything". In a system that is going to grow to seventeen services, removing that step is exactly the technical debt we do not want to take on.

We did not choose OpenFGA because it made complexity disappear. We chose it because we preferred to carry that complexity once in the model rather than spread it across seventeen different services and discover the differences once real data was already moving through the system.