Infrastructure – ottonova.tech

by Sergey Podgornyy - 2023-10-132023-11-13

Migrating from Self-Managed RabbitMQ to Cloud-Native AWS Amazon MQ: A Technical Odyssey

In the ever-evolving world of cloud-native solutions, it can be a daunting task to maintain message brokers. For a while, our team was responsible for a self-managed RabbitMQ instance. While this worked well initially, we encountered challenges in terms of maintenance, version updates, and data recovery. This led us to explore Amazon MQ, a fully managed message broker service offered by AWS.

In this article, we’ll discuss the advantages of both self-managed RabbitMQ and Amazon MQ, the reasons behind our migration, and the hurdles we faced during the transition. Our journey offers insights for other developers, who consider a similar migration path.

The Self-Managed RabbitMQ Era

Our experience with self-managed RabbitMQ was characterized by control, high availability, and the responsibility to ensure data integrity. Here are some of the advantages of this approach:

Total Control
Running your own RabbitMQ server gives you complete control over configuration, security, and updates. You can fine-tune the setup to meet your specific requirements: ideal for organizations with complex or unique messaging needs.
High Availability
It’s worth noting that our entity was running on AWS EC2, whose SLA guarantees only 99.99%, but de-facto we achieved a remarkable uptime rate of 99.999% with our self-managed RabbitMQ setup. The downtime was almost non-existent, which ensured a reliable message flow through our system. High availability is crucial for many mission-critical applications.
Data Recovery
Ironically, data recovery was a challenge with our self-managed RabbitMQ. In the event of a crash, we lacked confidence in our ability to restore data fully. This vulnerability urged us to consider Amazon MQ, a fully managed solution.

The Shift to Amazon MQ

As time passed, it became apparent that managing RabbitMQ was no longer sustainable for our team. Here are the primary reasons that drove us to explore Amazon MQ as an alternative:

Skills Gap
Our team lacked in-house experts dedicated to managing RabbitMQ, which posed a risk to our operations. As RabbitMQ versions evolved, staying up-to-date became increasingly challenging. This skill gap urged us to consider Amazon MQ, a fully managed solution.
AWS Integration
As an AWS service, Amazon MQ seamlessly integrated with our existing AWS infrastructure, providing us with a more cohesive and consistent cloud environment. It allowed us to leverage existing AWS services and tools, which resulted in a smooth migration process.
Managed Service
The promise of offloading the operational burden to AWS was enticing. Amazon MQ handles tasks like patching, maintenance, and scaling. This allows our team to focus on more strategic initiatives.
Enhanced Security
One key advantage of switching to AmazonMQ is its strong foundation on AWS infrastructure. This not only ensures robust security practices but also regular updates are integrated into the system. So, it gives us confidence, as we know that any potential vulnerabilities are under active monitoring and management.

The Amazon MQ Experience

While the move to Amazon MQ presented numerous benefits, we also encountered some challenges that are worth noting:

SLA Guarantees
Amazon MQ’s service level agreement (SLA) guarantees 99.9% availability. This is generally acceptable for many businesses but was a step down from our self-managed RabbitMQ’s 99.999% uptime. While the difference might seem small, it translated into more downtime. A trade-off we had to accept.
Limited Configuration
Amazon MQ abstracts many configuration details. This simplifies management for the most users. However, this simplicity comes at the cost of fine-grained control. For organizations with highly specialized requirements, this might be a drawback.
Cost Considerations
Amazon MQ is a managed service, which means there are associated costs. While the managed service helps reduce operational overhead, it’s crucial to factor in the cost implications when migrating.

What do three nines (99.9) really mean?

Here are my calculations according to Amazon MQ SLA:

if the monthly downtime is lower than ~43 minutes, they will charge 100% of the costs
if the monthly downtime is between ~43 minutes to ~7 hours, they will charge 90% of the costs of this downtime
if the monthly downtime is between ~7 hours to ~1day, they will charge 75% of the costs of this downtime
and if the monthly downtime is higher than ~1 day, they won’t charge any costs for this downtime

Conclusion

Our migration from self-managed RabbitMQ to Amazon MQ represented a shift in the way we approach message brokers. While Amazon MQ offered many benefits, such as reduced operational burden and seamless AWS integration, it came with some trade-offs, including a lower SLA guarantee and less granular control.

Ultimately, the decision to migrate should be based on your organization’s specific needs, resources, and objectives. For us, the trade-offs were acceptable given the advantages of a managed service within our AWS ecosystem.

The path to a cloud-native solution isn’t always straightforward, but it can lead to more streamlined operations and a greater focus on innovation rather than infrastructure management. Understanding the pros and cons of both approaches is vital for an informed decision about your messaging infrastructure.

As technology continues to evolve, it’s essential to stay adaptable and leverage the right tools and services to meet your business needs. In our case, the migration to Amazon MQ allowed us to do just that.

by Jan Heller - 2022-06-302022-07-04

Introducing the ottonova Tech Radar

We always promoted openness when it came to our tech stack. The ottonova Tech Radar is the next step in that direction.

What is the Tech Radar?

The ottonova Tech Radar is a list of technologies. It’s defined by an assessment outcome, called ring assignment and has four rings with the following definitions:

ADOPT – Technologies we have high confidence in to serve our purpose, also in large scale. Technologies with a usage culture in our ottonova production environment, low risk and recommended to be widely used.
TRIAL – Technologies that we have seen work with success in project work to solve a real problem; first serious usage experience that confirm benefits and can uncover limitations. TRIAL technologies are slightly more risky; some engineers in our organization walked this path and will share knowledge and experiences.
ASSESS – Technologies that are promising and have clear potential value-add for us; technologies worth to invest some research and prototyping efforts in to see if it has impact. ASSESS technologies have higher risks; they are often brand new and highly unproven in our organisation. You will find some engineers that have knowledge in the technology and promote it, you may even find teams that have started a prototyping effort.
HOLD – Technologies not recommended to be used for new projects. Technologies that we think are not (yet) worth to (further) invest in. HOLD technologies should not be used for new projects, but usually can be continued for existing projects.

What do we use it for?

The Tech Radar is a tool to inspire and support engineering teams at ottonova to pick the best technologies for new projects. It provides a platform to share knowledge and experience in technologies, to reflect on technology decisions and continuously evolve our technology landscape.

Based on the pioneering work of ThoughtWorks, our Tech Radar sets out the changes in technologies that are interesting in software development — changes that we think our engineering teams should pay attention to and use in their projects.

When and how is the radar updated?

In general discussions around technology and their implementation is driven everywhere across our tech departments. Once we identify that a new technology is raised, we discuss and consolidate it in our Architecture Team.

We collect these entries and once per quarter the Architecture Team rates and assigns them to the appropriate ring definition.

Disclaimer: We used Zalando’s open source code to create our Tech Radar and were heavily influenced by their implementation. Feel free to do the same to create your own version.

Owned by the author

Jan Heller

Part of ottonova’s Software Engineering team since the very beginning.
Always on the lookout to improve our teams and to take the next step.
Clean code and KISS evangelist.

by Sergey Podgornyy - 2020-12-052022-05-03

How and why we updated RabbitMQ queues on production

In this article, I would like to share with you and the whole internet our experience of dealing with RabbitMQ Live updates. You will learn some details about our architecture and use cases. Let’s start from the simplest… Why do we need RabbitMQ in our business?

Backend with synchronous tasks processing

Our Architecture

As a health insurance company, our business depends on many different third-party services to analyze risks, process claimable documents, charge monthly payments etc. All these processes take some time to be processed, so to keep our services fast and autonomous from each other, we are using asynchronous processing of tasks that can be done in the background. This approach speeds up responses and allows to do more in the background, ie. email sending, policy creation, acceptance verification etc.

Whenever a client expresses some intent to the API by making a request to it, this intent can create follow-up tasks. These tasks do not need to be handled synchronously, i.e. they do not need to be handled while processing the initial request. Instead, we put a message about this intent onto the message queue where it can be picked up asynchronously by another process and handled independently from the original request.

Problem

But with great opportunities comes great responsibility. Message processing is very important and critical for our business. Some messages could expire without being consumed or inconsistent with queue restricted arguments. In theory, this should not happen or might happen in a very rare case. But as we are working with customers data, we do not want to lose important messages. To keep dead messages saved in the message broker and do not stuck them in the original queue, we are using dead-letter feature.

Messages are published to exchange and can be sent to multiple queues depending on the routing key. As you can see from the image above, we used the same dead-letter scheme as for the original queues, so dead messages may end up in the wrong dead-letter queues. It is not very critical if you pick up dead messages manually (considering that they are rare), but nevertheless, it is still strange to find these messages in the wrong place.

To solve this problem, we need to add a new argument to the properties of the queues, it is x-dead-letter-routing-key and it should be unique. As a unique value for the routing key, we can use the queue name itself. This idea brought our team one step closer to a good solution: we don’t need a dead-letter exchange anymore 🎉. To simplify it, we can use default nameless exchange "" with the dead-letter queue as the routing key and it will forward the message directly to the proper queue.

Dead-letter implementation with proper routing

Unfortunately, doing everything is not as easy as writing or talking about it 😒. To maintain the consistency and stability of the message broker, the RabbitMQ does not allow changing the arguments of already existing queues.

Deployment preparation

So, RabbitMQ does not allow you to change queue arguments in the runtime, so the only possible way to do it by removing queues and re-creating them again with updated arguments. But it is not possible in production, as we might lose some messages when they already removed, but new ones still do not exist. To solve this problem we need to introduce temporary queues to handle these messages, while old queues will be removed. For a simple system, this will be possible with 4 releases:

Create temporary queues, but do not handle messages from them for now.
Switch to the new queues and remove old queues. At this step, we already have a properly configured queues, but names are different. To return to old names, we need to do the same steps again.
Create new queues with old names, but with updated arguments. Do not consume messages from them for now.
Switch to the new queues with updated arguments.

4 releases, not a few, right? This requires not only a lot of small work, but also attention to make sure everything went right every time. How can we reduce them? 🤔

The simplest thing we can do is agree to rename the queues. This will reduce the number of releases by 2 times, since we will not need to rename them back. This was acceptable to us, and we even got more of it as we improved the message handling process. But that’s a completely different story 😉.

What else can you do? Enabling consumers and message handling in the new queues right away will reduce release count to only one, but we should accept the risk of duplicated messages when new queues already created but old ones are still processing.

At this point, I was stopped by the teammate, because I did not take into account the process of our deployment. We have blue-green deployment process, it’s when you have multiple instances of the same thing. And when you deploy, you take one down, upgrade, then put it up, then take the other one down to upgrade. This guarantees there is something always up. In our case, this means there is always a consumer there.

So, messages can definitely be duplicated if deployed during business hours. Deployment takes several minutes, which means that both old and new queues will be active for several minutes.

Time to analyze and decide whether it is safe to deploy the application at night (and do we really want to do it 🙂) when the message flow is low, or it is worth implementing a third-party service like a Redis to check if the message has already been processed by some consumer, old or new.

Release

The easiest way to check the load on our message broker is to check the number of logs by day of the week and time. Since we are a highly focused company working only in Germany, we have a very low message load from late evening to early morning.

It is not such a big highload as it could be, so we can accept the risk that some messages may be duplicated, but even if this happens, their number will be extremely small and we can manually solve them. This will save the resources and time that would be required for two releases.

After trying to release after midnight we found out that we couldn’t do it at night. Some of our third-party services are not available, so the container simply cannot be booted. Well, it was worth trying once, now we know it for sure. Nighttime for sleeping 😴.

But we can still do it late in the evening or early in the morning. One has only to pay attention to the RabbitMQ load.

Late in the evening:

Early in the morning:

We made the decision to press the release button early in the morning after a good night’s sleep. This time everything went fine and there were no duplicates.

It was not an easy way to solve this problem, but it was worth it. Solving this problem, our team and I learned a lot of interesting things about message consuming and deployment processes. Now it is even better than before, with correct queue settings and decoupled message handling 😎.

TL;DR

RabbitMQ does not allow to rename queues or change queue arguments;
to change something in the queue, you have to remove it and re-create;
to re-create it safe, you need to use temporary queues;
stable system could be run under multiple instances, so be aware of duplicated messages between old queues and new queues;
if your business is tied to one timezone and is not high loaded at night, it is acceptable to have duplicated messages instead of over-engineering your consumers.