Why keep TLS termination away from an MQTT Broker

Which performance correlation exists between TLS termination managed inside or outside a broker?
I start with a short introduction about TLS and why you should always expose to the world your secure port (8883) moving TLS termination outside it.

Do you already know TLS basis?
[yo-yo-yo] skip this paragraph!

TLS is a cryptographic protocol that provides end-to-end security of data sent between applications over the Internet. TLS uses asymmetric cryptography for securely generating and exchanging a session key. The session key is then used for encrypting the data transmitted by one party, and for decrypting the data received at the other end. Once the session is over, the session key is discarded. Asymmetric cryptography uses key pairs — a public key, and a private key. The public key is mathematically related to the private key, but given enough key length, it is computationally impractical to derive the private key from the public key. This allows the public key of the recipient to be used by the sender to encrypt the data they wish to send to them, but that data can only be decrypted with the private key of the recipient.

[2]

The advantage of asymmetric cryptography is that the process of sharing encryption keys does not have to be secure, but the mathematical relationship between public and private keys means that much larger key sizes are required. [1]

Each broker MQTT uses port 8883 exclusively for MQTT over TLS and you should always use it.

In the perfect world, if TLS is implemented properly on the client and server there are some mechanisms like session resumption (it allows the reuse of an already negotiated TLS session after reconnecting to the server) to avoid performing a new handshake after reconnect. But in such a Valhalla doesn’t exist also DDoS attacks. Session renegotiation requires an overbalanced amount of server-side resources, making it a weak entry point for denial-of-service attacks. Why? To reuse sessions across load balancers, servers must use a shared session cache (such as Redis) in their session handlers.

The MQTT protocol strongly recommends maintaining a long-living TCP connection. If you deal with many reconnect then the overhead can be significant. You can limit the client connect, or move TLS termination outside the broker action field.

If you manage a weak infrastructure with a single broker instance, during a restart or update you may collide in degraded performance when all the clients try to reconnect concurrently.

The solution?
You should end the TLS handshake outside the broker. You may use NGINX or HAProxy to redirect the traffic to the internal 1883 port.

Test settings:

  • Jmeter running on GCE (e2-standard-4 –— 4cpu/16Gb)
  • EMQX deployed in GKE with no resource constraint (3 replicas) to avoid throttling issues.
  • Prometheus + Grafana for graphs

EMQX exposed in two ways:

  • External load balancer (8883) — [DIRECT]
  • NGINX ingress controller (8883 with redirect to 1883) — [NGINX]

I’ve performed these tests with two different setups:

  • Connect Only [C.O.]: 1000 users sending CONNECT messages simultaneously and DISCONNECT after 800ms (repeated 100 times)
  • Connect and Publish [C.P.]: As C.O. with some PUBLISH (10 messages) before DISCONNECT.

This current configuration produces an input avg ratio of 3 client/s (200/60s in the graph) client connect in [C.O.] and a publish avg of 33 msg/s [C.P.].

These values compared with the given result show that the problem exists also at low rates.

In the following pictures we can observe two humps:

  • 1° hump perform TLS termination inside the BROKER — [DIRECT]
  • 2° hump perform TLS termination inside NGINX — [NGINX]

N.B.
1. In the following pictures the height is equal between the two humps. This verifies that input traffic is constant between the two simulations.

2. In the CPU graph there are many lines. The orange line represents the CPU sum across all broker replicas. This is a consequence given by the application deployed with three replicas.

[C.O.] Connection rate over time [DIRECT] — [NGINX]
[C.O.] Cpu usage over time [DIRECT] — [NGINX]

In the above Connect Only scenario, we have an avg CPU use of:

  • ~ 3.5 CPU [DIRECT]
  • ~ 1.5 CPU [NGINX]

which means 57% less consumption from a Broker point of view.

[C.P] Client and messages over time[DIRECT] — [NGINX]
[C.P] Client and messages over time[DIRECT] — [NGINX]

In this Connect and Publish scenario, we have an avg CPU use of:

  • ~ 4 CPU [DIRECT]
  • ~ 2.5 CPU [NGINX]

which means 40% less consumption moved away from the Broker.

CONCLUSIONS:

Jumping to the conclusion in the former case there is a drop of around 57% and in the latter 40%. [Yeah Mr. White! Yeah Science!]

CPU decrement between the two setup

Given that TLS is a pre-requisite for everyone, you deal with the overhead in CPU and bandwidth (negligible in respect to the security it delivers).

A way to sweeten the pills is to keep TLS termination far from the broker. Mostly when the clients perform not stable connection, or if there is a single point of failure between the client and the broker for the whole system. In the latter case, start to review your infrastructure soon!

EMQX is one of the strongest open-source brokers at all, I’ll try soon the same test against other brokers.

Are you interested in the NGINX/HAProxy configuration? Leave a comment for the next episode!

DevOps Engineer at CARFAX EU | Working with Python, C++, Node.js, Kubernetes, Terraform, Docker and more