Heartbeats and failovers

Most critical network components nowadays have some sort of failover mechanism for redundancy purposes. For obvious reasons, if all your network traffic pass through a proxy or firewall, and that node fails, ideally the traffic would be handed over seamlessly to another (similar) unit to avoid any network interruption. There are usually two implementations of redundancy: active-active and active-passive. For those of you used to cisco technology, Gateway Load Balancing Protocol (GLBP) is an example of active active implementation. IF you have two gateway routers, both would actively play a role in passing network traffic simultaneously. An example of active-passive is the Virtual Router Redundancy Protocol (VRRP). In this case, only one gateway router is active and passing network traffic, the other is on stand-by and only comes online if the other fails.

Most implementations use active-passive failover, basically because it’s simpler to implement. Active-passive failovers depend on a “heartbeat”… a special packet that is sent from the active node to the backup node just to let it know that it’s still alive and kicking. If the backup node does not receive this heartbeat in a specified period of time, it will assume the primary active node has “died” and it will take over.

The heartbeat interval is the amount of time between heartbeat packets that the primary sends. The heartbeat failure threshold is the number of heartbeat packets that can be “missed” before failing over to the backup.

So for example, if the heartbeat interval is one second, and the failure threshold is 5 heartbeats, then the backup will only take over after 5 seconds, since 5 consecutive heartbeats would not have made it through to the backup unit. The failure threshold is included to avoid “flapping” and transient issues.

However, some network admins run with the above idea. Most admins would want the backup unit to come up as soon as possible, so they set a very low heartbeat interval (such as one second), and reduce the failure threshold, to something like 2. So the backup unit will come online after 2 seconds. By comparison, on a BlueCoat Proxy SG and most other vendors, the heartbeat interval could be something like 40 seconds, with a failure rate of 3 (so this means the backup comes online only after about 2 minutes). It could also be some other combination to make up 2 minutes (such as a heartbeat interval of 60seconds and a failure rate of 2).

Network admins who drastically reduce the heartbeat values may end up with a backup unit that comes online quicker, but they run into other issues that you should be aware of. Two of the less well known but quite important points are:

  1. This may create an effect similar to a “broadcast storm”. Even on very high performing networks, with enough units and a low enough heartbeat interval, you could bring parts of your network to its needs simply due to the sheer volume of generated traffic.
  2. The other point which not a lot of admins keep in mind is MAC address anti-spoofing.

To further elaborate on point 2, as we can see from the below article from Microsoft, most network nodes keep an IP to MAC address binding for 2 minutes in cache before refreshing.

http://technet.microsoft.com/en-us/library/cc758357(WS.10).aspx

 

Now, a lot of very damaging and effective attacks can be achieved by man in the middle attacks, which are usually perpetuated by using ARP spoofing. That is to say, an attacker compromises a host’s ARP cache to trick it into sending traffic to the wrong host. To prevent this, a lot of security appliances such as cisco and sonicwall intergrate anti MAC spoofing into their systems. They basically “learn” an IP to MAC address binding, and if another IP to MAC binding comes through, they block this and mark it as an attack. So, if a new IP to MAC address binding is received in that 2 minute time-frame it is ignored.

You see where this ties in to our previous discussion. On some security-aware networks having a unit failover in under 2 minutes could cause ARP spoofing issues, which is why vendors’ default values are around the two minute mark.

That issue could be overcome with something like Virtual MAC addresses, in which both the active and passive unit are represented by the same MAC address (SonicWALL does this and cisco GLBP also works this way) but not all units support this, so keep an eye out for this