Too Little, Too Noisy

02/10/2025 << back to Debugging Myself

All of us have had a colleague like this; I’m sure you can picture them: they arrive every morning with a paper cup of coffee, loudly complaining about how bad the machine is while the rest continue working diligently, their eyes fixed on their screens. At any opportunity, they draw attention to how bad the coffee tastes, how hot or cold it is, or how long it takes. When there's any kind of survey about improving the work environment, they propose changing the coffee machine—perhaps one with more variety of flavors.

You can swap the coffee for any insubstantial problem that the majority is indifferent to, but which someone complains about disproportionately.

One day, when the boss arrives and announces that the company has listened to the needs of its employees and will take steps to address their concerns, the disappointment is great when the big news is revealed to be a Nespresso machine in the lobby—which everyone can use by bringing their own capsules from home (and taking care of periodically refilling the 400ml water tank).

The old vending machine that would bail you out for €0.80 now requires you to come prepared with capsules, a cup, and refill the water... The worst part is that no one's real problems have been solved, yet they expect you to be grateful. Essentially, management has spent resources on a cosmetic solution that validates the noise, but which actually harms the silent majority.

Something similar happens in systems when there is a component or service of very little importance, but which generates a huge amount of errors. These errors do not have a critical business impact, but volume is confused with severity, which ends up masking other problems that might be truly important. This is what we call alert fatigue (or noise) in software engineering.

Carlo María Cipolla wrote a satirical essay defining the Basic Laws of Human Stupidity. He described four types of people, based on how the average of their actions benefited or harmed others and themselves. The worst type were the stupid, who typically harmed others and themselves—we can have "stupid" components in our systems in this same way:

They are noisy: an endpoint with constant bot traffic or a microservice that performs internal reports can be poorly written or configured and generate a permanent flow of errors (5xx) and/or logs.
They harm others (the team): They generate alerts, notifications, and on-call calls. The team is forced to investigate and ends up silencing alerts, wasting time, energy, and trust in the monitoring system.
They harm themselves: They end up being silenced due to their poor reputation; like the boy who cried wolf, if they ever have a real problem, it will go unnoticed.

The damage caused by these components should not be underestimated. The problem, we can all agree, is that they shouldn't exist in the first place, but reality is stubborn. They always exist due to a rushed or incomplete requirement, an implementation by a junior or an unsupervised external company, or a piece of legacy code that was never migrated and that nobody understands. The causes are diverse, but the reason they exist is always the same: there is no time to fix them because they are not a priority. Thus, the price the team pays is:

Alert fatigue: The tolerance threshold rises, until the whole building is on fire, no one seems concerned.
Signal masking: Noisy components flood dashboards and inboxes. Critical alerts can be buried under the avalanche of noise.
Demoralization: The team spends so much time extinguishing fires with no real impact that they end up frustrated and feel like they aren't working on important things.

The solution requires a dual approach:

1. Symptom Management (Short-Term Noise)

Improving how alerts are received, distinguishing between those that have a real impact and those that do not, adjusting the triggering thresholds, and cleaning up those alerts that don't require immediate engineer action. This includes implementing hierarchical filtering based on the importance and impact of the alerts.

2. Cause Elimination (Long-Term Technical Debt)

Most importantly: dedicate explicit time to resolving technical debt, fixing or rewriting the unimportant components so that they cease to be noisy.

And returning to the initial point about noisy colleagues: how can we effectively manage this "human noise?" Far from being a joke, people who make noise about trivialities are often subconsciously trying to draw attention to other issues that they either don't know how, or don't dare, to express (lack of recognition, tooling issues, workload, etc.). Good human resource management must be able to detect and decipher the signal hidden behind the noise and try to solve employees' real problems.

exit(0);

<< back to Debugging Myself