At pamediakopes.gr we have a lot of systems and several external and internal HTTP APIs. Unsurprisingly, these systems often need to communicate between themselves to get work done and we routinely see tens of thousands of calls being made from system to system.
When a system calls another there are two typical cases:
- The caller needs an immediate response in order to proceed with its work.
- The caller does not require an immediate response and just wants to send information to the called system.
In this post I’m going to talk about the second case which usually manifests itself as a simple POST to an API. Back in the olden days, we handled this transfer of information with the minimum amount of code and fuss we could - meaning that we just implemented the POST to the API.
Figure 1: The happy path.
Things were simple, the code was small (so the bugs were kept to a minimum) and it was beautiful. Of course occasionally bad things happened, manifesting themselves as network timeouts or planned system outages but hey, nothing beats simplicity, right?
Figure 2: Two flavors of timeouts.
Then we grew and in came the background worker based on RedisQueue.Net, an open source library of ours. This application run in the background and could either pick up keys from Redis to see what was due for a transfer or scanned the database to see if something needed to be sent over, then tried to do the POST itself.
Then we grew some more and the background workers multiplied. And multiplied. And pretty soon it wasn’t fun anymore. Amidst the scores of running worker instances we threw our cherished notions of simplicity, right between the developers who started struggling to maintain the rapidly increasing code base. All was not good in the state of Denmark and after a while developers suddenly started crying for no apparent reason.
Then we grew some more and we decided to
write some more background workers find a more generic way of dealing with
the asynchronous transfer of information from a system to another. We quickly arrived at the conclusion that a
message queue of some sort was what we desperately needed, implicitly arriving at the decision that we would go
for at-least-once delivery (and skip the pains of
trying to achieve exactly-once delivery).
Figure 3: Queue-based transfer of information.
Our criteria for choosing a solution were the following:
- The queue system had to be as unobtrusive as possible. Something that “just worked” without much fuss would be the best.
- The queue system should be highly available.
- In case the final recipient becomes unavailable, the queue system should be able to store messages for at least a week’s time.
- The queue system should have SDKs for both our major application languages, C# and Ruby.
- The queue system should ideally operate without any maintenance from us.
We considered solutions like RabbitMQ and even MSMQ but since we are hosted in the Amazon cloud, we looked more closely at AWS SQS. It seemed that SQS was a great fit for our requirements and it could solve our woes.
Figure 4: Using SQS as the queue.
Then AWS SNS caught our attention. SNS works a bit like SQS in the sense that a caller can send a message to it and it would take care of sending the message to its intended recipient through an HTTP POST without requiring any coding at all on the recipient size. But there were also more benefits. SNS can send the same message to a number of recipients with several different mechanisms opening up interesting possibilities. We could have the same message posted to different servers, which would allow the possibility of having a staging or test system be fed with production data. Or we could support a parallel run if we wanted to have a major (and thus risky) upgrade to one of our systems. We could also send a formatted email with the contents of the message and we could keep messages in a queue if we ever wanted to use them for other purposes.
Figure 5: SNS and some possibilities.
It appeared that SNS would solve our problems in a completely clean and simple way, so off we went! We created our first topic, the caller used the AWS SDK to send notifications to that topic and the recipient readied their shinny new endpoint, waiting for messages. At go-live date we released and happily patted ourselves on the back seeing information flowing from system to system with a satisfyingly surprising ease. Pizzas were ordered and readily consumed and that was that.
Figure 6: Our initial production setup.
….not. After a classic case of RTFMism that was done later than sooner, we discovered the following phrase in the AWS documentation:
“The maximum lifetime of a message in the system is one hour.”
We found this out when we started looking for ways to configure the delivery retry policy of SNS. This meant that we could not satisfy our requirement of keeping messages for a week in case a system becomes unavailable. Two things happened when we read this:
- A few brain neurons were damaged by the shock (the same effect could have been more joyfully achieved by consuming strong alcoholic beverages).
- We started looking for a way to overcome this.
Figure 7: SNS keeps messages for HTTP only for 1 hour.
We clearly liked the benefits of SNS but it was also equally clear that our systems, cool and robust as they might be, could still experience a downtime of more than an hour for a variety of reasons. We thought of throwing SNS out the window and go back to the idea of directly using SQS but this would have the undesired effect that recipients would need to code for this and access SQS on their own. Yikes.
It was clear that the functionality missing from SNS was key to us. At that point we decided that we could write it ourselves and thus AWS Redrive was born. This turned out to be a slight turn back towards the era of the background worker - but with a significant twist. The AWS Redrive was coded as a multi-threading application that could service any number of queues and therefore deliver messages from many queues to many API endpoints. The resulting setup is depicted below.
As you can see, we stuck to the initial plan of posting messages to an SNS topic. But instead of sending these directly to the final endpoint, we hooked an SQS queue to the topic and AWS Redrive carries out the mundane work of reading messages from the queue and posting them to the final endpoint. If AWS Redrive succeeds with the POST, it removes the message from the queue, otherwise it leaves it there and tries again later. We also hooked a dead-letter queue to the scheme and monitor it from Zabbix - messages that appear in the dead-letter queue signify that something went wrong and we need to take a look.
Figure 8: AWS Redrive.
Some significant aspects worth mentioning:
- Depending on the nature of messages flowing from a queue to an API, we configure delivery delays if needed. Generally we want messages to appear in the queue as soon as the caller puts them in SNS but there are exceptions and in one case we wanted the recipient to see messages after one minute.
- Since the final endpoint might not be available or might not like a particular message, special care is needed to ensure that the default visibility timeout is set to an appropriate value. If the API responds to a message with 500 and the default visibility timeout is set to, say, 1 second, then chances are that this message will be retried very quickly. Setting a value such as 60 seconds ensures that other messages get their chance to be processed if one fails.
- AWS Redrive uses long-polling to avoid hammering SQS (which also helps limit the number of operations it performs, keeps down network traffic, consumes less resources and minimizes the bill).
- Setting an appropriate number of retries before something is sent to the dead-letter queue also depends upon the purpose of each queue. In some cases, the final endpoint might not be ready to process a message because it’s missing some information but it might be ready to do so after a while. In cases such as these, it would make sense to allow a message to be read 10 times before it goes into the dead-letter queue.
- It is important to remember that SQS does not guarantee that the messages are delivered in the order they were received (good thing that wasn’t a requirement).
- With a few millions of messages delivered on a monthly basis, the bill runs at about $2 per month.
To this date, AWS Redrive is one of the smaller and trouble-free applications we’ve released to production (9 instances of SNS/SQS/AWS Redrive combos so far), so much so that we occasionally forget that it’s there. It’s one of those rare cases where a couple of days of coding provide an inordinate amount of satisfaction. At the same time, it’s one of those cases we’d be happy when the day comes to finally trash our creation and that will, of course, happen when enough users provide feedback to Amazon asking them to implement this sorely needed feature.
So, call-for-action for all AWS users out there. If you happen to find yourselves in the AWS SNS console then please do everybody a favor, click on the feedback button and copy-paste the following:
“Please allow SNS to support a maximum lifetime of 14 days for messages delivered to HTTP/HTTPS endpoints”.