Benchmarking SQS

SQS, Simple Message Queue, is a message-queue-as-a-service offering from Amazon Web Services. It supports only a handful of messaging operations, far from the complexity of e.g. AMQP, but thanks to the easy to understand interfaces, and the as-a-service nature, it is very useful in a number of situations.

But how fast is SQS? How does it scale? Is it useful only for low-volume messaging, or can it be used for high-load applications as well?


If you know how SQS works, and want to skip the details on the testing methodology, you can jump straight to the test results.

SQS semantics

SQS exposes an HTTP-based interface. To access it, you need AWS credentials to sign the requests. But that’s usually done by a client library (there are libraries for most popular languages; we’ll use the official Java SDK).

The basic message-related operations are:

  • send a message, up to 256 KB in size, encoded as a string. Messages can be sent in bulks of up to 10 (but the total size is capped at 256 KB).
  • receive a message. Up to 10 messages can be received in bulk, if available in the queue
  • long-polling of messages. The request will wait up to 20 seconds for messages, if none are available initially
  • delete a message

There are also some other operations, concerning security, delaying message delivery, and changing a messages’ visibility timeout, but we won’t use them in the tests.

SQS offers at-least-once delivery guarantee. If a message is received, then it is blocked for a period called “visibility timeout”. Unless the message is deleted within that period, it will become available for delivery again. Hence if a node processing a message crashes, it will be delivered again. However, we also run into the risk of processing a message twice (if e.g. the network connection when deleting the message dies, or if an SQS server dies), which we have to manage on the application side.

SQS is a replicated message queue, so you can be sure that once a message is sent, it is safe and will be delivered; quoting from the website:

Amazon SQS runs within Amazon’s high-availability data centers, so queues will be available whenever applications need them. To prevent messages from being lost or becoming unavailable, all messages are stored redundantly across multiple servers and data centers.

Testing methodology

To test how fast SQS is and how it scales, we will be running various numbers of nodes, each running various number of threads either sending or receiving simple, 100-byte messages.

Each sending node is parametrised with the number of messages to send, and it tries to do so as fast as possible. Messages are sent in bulk, with bulk sizes chosen randomly between 1 and 10. Message sends are synchronous, that is we want to be sure that the request completed successfully before sending the next bulk. At the end the node reports the average number of messages per second that were sent.

The receiving node receives messages in maximum bulks of 10. The AmazonSQSBufferedAsyncClient is used, which pre-fetches messages to speed up delivery. After receiving, each message is asynchronously deleted. The node assumes that testing is complete once it didn’t receive any messages within a minute, and reports the average number of messages per second that it received.

Each test sends from 10 000 to 50 000 messages per thread. So the tests are relatively short, 2-5 minutes. There are also longer tests, which last about 15 minutes.

The full (but still short) code is here: Sender, Receiver, SqsMq. One set of nodes runs the MqSender code, the other runs the MqReceiver code.

The sending and receiving nodes are m3.large EC2 servers in the eu-west region, hence with the following parameters:

  • 2 cores
  • Intel Xeon E5-2670 v2
  • 7.5 GB RAMs

The queue is of course also created in the eu-west region.

Minimal setup

The minimal setup consists of 1 sending node and 1 receiving node, both running a single thread. The results are, in messages/second:

  average min max
sender 429 365 466
receiver 427 363 463

Scaling threads

How do these results scale when we add more threads (still using one sender and one receiver node)? The tests were run with 1, 5, 25, 50 and 75 threads. The numbers are an average msg/second throughput.

number of threads: 1 5 25 50 75
sender per thread 429,33 407,35 354,15 289,88 193,71
sender total 429,33 2 036,76 8 853,75 14 493,83 14 528,25
receiver per thread 427,86 381,55 166,38 83,92 47,46
receiver total 427,86 1 907,76 4 159,50 4 196,17 3 559,50

As you can see, on the sender side, we get near-to-linear scalability as the number of thread increases, peaking at 14k msgs/second sent (on a single node!) with 50 threads. Going any further doesn’t seem to make a difference.


The receiving side is slower, and that is kind of expected, as receiving a single message is in fact two operations: receive + delete, while sending is a single operation. The scalability is worse, but still we can get as much as 4k msgs/second received.

Scaling nodes

Another (more promising) method of scaling is adding nodes, which is quite easy as we are “in the cloud”. The test results when running multiple nodes, each running a single thread are:

number of nodes: 1 2 4 8
sender per node 429,33 370,36 350,30 337,84
sender total 429,33 740,71 1 401,19 2 702,75
receiver per node 427,86 360,60 329,54 306,40
receiver total 427,86 721,19 1 318,15 2 451,23

In this case, both on the sending&receiving side, we get near-linear scalability, reaching 2.5k messages sent&received per second with 8 nodes.


Scaling nodes and threads

The natural next step is, of course, to scale up both the nodes, and the threads! Here are the results, when using 25 threads on each node:

number of nodes: 1 2 4 8
sender per node&thread 354,15 338,52 305,03 317,33
sender total 8 853,75 16 925,83 30 503,33 63 466,00
receiver per node&thread 166,38 159,13 170,09 174,26
receiver total 4 159,50 7 956,33 17 008,67 34 851,33

Again, we get great scalability results, with the number of receive operations about half the number of send operations per second. 34k msgs/second processed is a very nice number!


To the extreme

The highest results I managed to get are:

  • 108k msgs/second sent when using 50 threads and 8 nodes
  • 35k msgs/second received when using 25 threads and 8 nodes

I also tried running longer “stress” tests with 200k messages/thread, 8 nodes and 25 threads, and the results were the same as with the shorter tests.

Running the tests – technically

To run the tests, I built Docker images containing the Sender/Receiver binaries, pushed to Docker’s Hub, and downloaded on the nodes by Chef. To provision the servers, I used Amazon OpsWorks. This enabled me to quickly spin up and provision a lot of nodes for testing (up to 16 in the above tests). For details on how this works, see my “Cluster-wide Java/Scala application deployments with Docker, Chef and Amazon OpsWorks” blog.

The Sender/Receiver daemons monitored (by checking each second the last-modification date) a file on S3. If a modification was detected, the file was downloaded – it contained the test parameters – and the test started.

Summing up

SQS has good performance and really great scalability characteristics. I wasn’t able to reach the peak of its possibilities – which would probably require more than 16 nodes in total. But once your requirements get above 35k messages per second, chances are you need custom solutions anyway; not to mention that while SQS is cheap, it may become expensive with such loads.

From the results above, I think it is clear that SQS can be safely used for high-volume messaging applications, and scaled on-demand. Together with its reliability guarantees, it is a great fit both for small and large applications, which do any kind of asynchronous processing; especially if your service already resides in the Amazon cloud.

As benchmarking isn’t easy, any remarks on the testing methodology, ideas how to improve the testing code are welcome!

  • Michael Barreiro

    Thank you for this comparison. I was looking for some code to test it myself and it looks like you did all the hard work and already have (current) results.

  • Alex

    By default the Java SDK only allows a maximum of 50 concurrent connections, that’s probably where the performance cap at 50 threads comes from. If you change your client configuration you can probably go even faster per node.

  • Ah, that would make sense, thanks! Still I would expect it to max out not much further afterwards, but who knows :)

  • Martin

    This may be a simple question and I may have over looked it, but in each thread did you create a new client SQS object in each thread or were you sharing the same SQS object across threads?

  • wondering how much throughput is correlated with message size. did you ever try this with messages larger than 100 bytes?

  • No, I didn’t do such tests, but of course that would be useful to do as well

  • ok, i ran some tests today (using my own test scripts, so absolute numbers are not directly comparable).

    the effect of message size seems to be much smaller than expected — i only get around 10% less throughput when using 10KB messages, compared to 100 byte messages.

  • Thanks! Most probably most of the latency comes from actually doing the HTTP request than its size.

  • yeah, i agree. that’s also why batching makes a lot of sense. too bad the max batch size in SQS is 10.

  • Vivek Chaurasiya

    I have a simple sender/receiver and when I calcuate the trip time (currentTime – MessageenqueueTime) it comes ~ 500 ms. Is this expected? I am doing long polling.


  • I didn’t test for latency specifically (though a future update of may include such tests), so hard to say.

    Also, you are comparing timings from two different clocks, one of which you can’t control – so the results can be not precise.

  • Looking for a lover for meetin