Revocation

How to revoke a token

Revocation identifiers

Once a token has been created, we need a way to invalidate it, either due to its lifecycle, like logging out or decommissioning a service, or because it was compromised. Tokens can come with an expiration date, but they are not sufficient, as there will always be some delay between the leak and the expiration date. So we need a way to revoke currently existing tokens.

Revoking bearer tokens like biscuits is usually done through revocation lists: a list of tokens that are no longer accepted is shared with all verifying parties. When authorizing a biscuit token, the library will look it up in the list and refuse the request if it finds it.

Such a mechanism relies on being able to uniquely identify tokens: we want to revoke only the tokens that are not valid anymore, without affecting other tokens (even tokens with the same content that have been issued to another holder). With offline attenuation, biscuits introduce another constraint: revoking a token should also revoke all derived tokens (else it would be trivial to circument revocation).

The biscuit spec (and libraries) provides you with:

a way to uniquely identify tokens (two biscuits with the same payload and secret key will be different)
a way to identify groups of tokens derived from the same parent token
a way to reject tokens based on that identification during authorization

Biscuit's revocation identifiers are unique and generated directly from the token's structure, there is no need to add them explicitely, as would be done with Macaroons or the jti claim in JWT.

The biscuit spec does not mandate how to publish revocation ids within your system; that depends a lot on the architecture and constraints. You can start simple with static revocation lists read through environment variables, and migrate to more complex systems as needed. We describe in this document various ways to achieve it.

Listing revocation ids for a token

The CLI can be used to inspect revocation ids:

❯ biscuit inspect test9_expired_token.bc --raw-input
Authority block:
== Datalog ==

== Revocation id ==
16d0a9d7f3d29ee2112d67451c8e4ff07bd5366a6cdb082cf4fcb66e6d15a57a22009ef1018fc4d0f9184edb0900df161807bc6f8287275f32eae6b5b1c57100

==========

Block n°1:
== Datalog ==
check if resource("file1");
check if time($date), $date <= 2018-12-20T00:00:00+00:00;

== Revocation id ==
0670d948462e0cc248ce45b7ea04cbfb126a7559c8d60b533f7f0a92696900ee4e432780b526462b845d372c9b7b223c43efc22e0441b14b0bc4661e05ebfe03

==========

🙈 Public key check skipped 🔑
🙈 Datalog check skipped 🛡️

Providing a revocation list during biscuit authorization

Why should we plan for token revocation?

There are two competing approaches to session management in authorization, that will drive architectural decisions:

in stateful systems, all authorizations are performed through one service or database that holds the list of currently active sessions
in stateless systems, authorization can be performed independently in any service, only using information from the token and the service. In particular, the service cannot know about all of the currently active sessions (there may not even be a concept of session)

Those two solutions are often compared on their ability to close a session. Why? Can't we just set an expiration date? Unfortunately, even an with expiration date we would still need a way to close a session, to implement the log out functionality. That feature is common, expected by users, and needed in multiple situations (ex: public computer, disconnecting sessions from a stolen phone). Even for purely service to service communication, we will need to close the access once the client service is decommissioned.

In stateful systems, closing a session is easy: we delete the session's information from the database and that's it. In stateless systems, this is more complex: how do we make sure all services know that the session is invalid? That means reintroducing some shared state, so is the stateless design impossible after all? Shouldn't we go back to stateful systems?

If the architecture we are designing can rely on a central state, it will be the simplest approach and probably the right solution. But there are good reasons to choose the stateless design:

scaling: a central authorization service that is queried on every request is a single point of failure for the entire system. If it is down, nobody can log in, but existing sessions will fail too. A stateless approach will decouple session creation from authorization, so existing sessions can still work when the authentication service fails.
isolation: the service receiving the request might be less trusted and should not be able to access session information.
authentication delegation: authentication could be in a separate service (example: SSO) that can't be queried on every request. That service could even be managed by a different company.

In those cases, separating authorization from session creation makes sense, but then how do we close a session? It is usually done through token revocation: the authorizer needs to know a list of tokens that must be refused, and that list changes dynamically, so we are reintroducing some state in the system.

But revocation has properties that make it nice to implement in stateless architectures:

we do not need to know about all of the tokens, only those that were revoked, so the list will be much smaller
the list of revoked tokens will naturally grow, but if tokens have an expiration date, they can be purged from the list after a while
it is read-oriented and highly cacheable: once a token was added to the revocation list, we won't modify its entry (except when purging), so we don't need synchronization or consensus
revocation lists do not hold any critical or private information, they can be shared with every service

So handling revocation is adding some shared state, but much more limited than what we would have with a fully centralized architecture.

How to implement revocation in our infrastructure?

We need a reliable way to transmit revocation information to services. That will depend on how quickly we want to disseminate it, and how much complexity we can bear.

The basic solution: read the revocation list at startup

In some cases, like communication between automated services, revocation is rare, mostly when a service is stopped or a token is leaked, so the revocation list is mostly static and small. If we can accept some manual operations, and a (slight) delay in synchronization, we can have services read the revocation list at startup. They will check tokens from an in memory list, that will stay the same for the entire life of the service (until it is restarted).

The tradeoff here is that if we need to revoke a token urgently, we will need to redeploy a lot of services at once.

In the case where there is only one service accepting tokens, the revocation list can be read from config (a config file or environment variables).

In the case where more services accept tokens, it will become necessary to have a centrally defined list that is then distributed to all services. Since the revocation list is small and changes rarely, it can be stored as a file in an object store like S3, and downloaded via HTTP. That file can be updated independently whenever a service stops, or when one of the token expires or is leaked.

Depending on how quick you want revocation to take effect, you can either wait for the services to restart or trigger a restart of all affected services.

Slightly more advanced: download the revocation list regularly

The natural next step from the previous solution: instead of downloading the list once, it is downloaded regularly to keep it up to date. There is still a gap between revocation and its deployment, but that gap is configurable, we can decide how often a service checks the new list.

The list can still be stored in an object store. It is a good idea to rely on HTTP caching solutions like the ETag or If-Modified-Since headers. If the revocation list grows and/or becomes more dynamic, this solution will incur a lot of traffic.

Download diffs

When the revocation list grows, it might be easier to only download the list of recently revoked tokens. Since that list is append-only, the easiest way might be to store the list in a database table with an incremented id column and give the latest id with the revocation list. When services try to download the revocation list, they can send their last known id, and the server can send the most recent changes.

While this relies on a central revocation service, it can be lighter than a stateful system because that central service is queried out of band, on regular intervals, instead of queried on each request of each service. Services will also be able to serve requests when the revocation service is down.

This can still be implemented over HTTP and rely on caching. It still suffers from a small delay before revocation is actually deployed.

Queue based systems

When we want a more dynamic solution, where revocation spreads as soon as possible, we should instead rely on a queue based system like RabbitMQ or Kafka, or even simpler with Server Sent Events or WebSockets. In this architecture, every service subscribes on a queue on startup, and receive newly revoked tokens as they are published.

This is the safest solution, as tokens are revoked everywhere as quickly as possible. It is also more complex to deploy because it needs a queueing system that must be monitored, scaled, etc. And every service must then integrate the client to connect to that queue.

Its usage will depend on the kind of queue provided by your system. With durable queues, a new service would read all of the messages from the beginning, then receive a new message for a new revoked token. If the service disconnects or restarts, it could reuse a saved local state and an offset in this queue to avoid reading everything again. This requires regular maintenance on the queue to remove expired tokens. With ephemeral queues, the service would need to get the initial state out of band then receive the stream of updates.

How the revocation service receives and stores data

The revocation service establishes the list of revoked tokens and regularly purges expired ones. While this looks simple, there are details to consider.

First, the service that creates tokens (user authentication, or microservice manager) should store the first block's revocation id, along with some metadata, like the creation date, expiration date and expected usage (user id, service id, etc). If a token expires, it is removed from the list. If a user logs out or a service is shut down, the revocation id and expiration date are sent to the revocation service.

If we want to revoke an attenuated token, there are more steps. The user cannot just provide the revocation ids, because we would have no way of knowing if they are trying to revoke a parent token. In that case, the entire token should be presented, then we look up the root block's expiration date in the data we already have, we extract the list of revocation ids from the token, and send the latest one with the expiration date to the revocation service.

All tokens should come with an expiration date, to prevent the revocation list from growing indefinitely.

OAuth specific usage

In OAuth based systems, API clients hold an access token, used to query the API, and a refresh token, used to get a new access token. The idea here is that the access token is used often and potentially on less trusted services, so it has a short expiration, while the refresh token has a longer lifetime because it is only used once in a while, and only with the authorization server.

While it is common to see applications with a permanent refresh token, this will causes issues with the revocation list, causing it to grow, and current practices evolved towards a different approach.

It is now recommended to have a refresh token with an expiration date, that can be long, and have that refresh token be single use. When it is sent to the authorization server to get a new access token, the authorization server will revoke the old refresh token and issue new refresh and access tokens. The interesting property here is that if the authorization server sees the same refresh token twice, it means that the token was stolen: either the thief or the legitimate client already used the refresh token, and the other one is now requesting an access token too. In that case, the authorization server must revoke all current refresh and access tokens for this client.

This solution also has the nice side effect that refresh token expiration can be much shorter, since it is changed any time we change an access token.