Why computer networks have problems staying reliable

published May 06, 2015 02:55 by admin ( last modified May 06, 2015 03:04 )

A good discussion why computer networks have problems staying reliable

Some choice quotes:

The 90-second network partition caused fileservers using Pacemaker and DRBD for HA failover to declare each other dead, and to issue STONITH (Shoot The Other Node In The Head) messages to one another. The network partition delayed delivery of those messages, causing some fileserver pairs to believe they were both active. When the network recovered, both nodes shot each other at the same time. With both nodes dead, files belonging to the pair were unavailable.

From what we can gather informally, all the major managed hosting providers experience regular network failures. One company running 100-200 nodes on a major hosting provider reported that in a 90-day period the provider’s network went through five distinct periods of partitions.

Large-scale virtualized environments are notorious for transient latency, dropped packets, and full-blown network partitions, often affecting a particular software version or availability zone. Sometimes the failures occur between specific subsections of the provider’s datacenter, revealing planes of cleavage in the underlying hardware topology

Marc Donges and Michael Chan bring us a thrilling report of the popular Broadcom BCM5709 chipset abruptly dropping inbound but not outbound packets to a machine. Because the NIC dropped inbound packets, the node was unable to service requests. However, because it could still send heartbeats to its hot spare via keepalived, the spare considered the primary alive and refused to take over. The service was unavailable for five hours and did not recover without a reboot.