In physics, the observer effect means that measuring a system changes it. At (DNS) scale, the risk real: if you are careless, observability can eat scheduler time, memory, and attention that belong to answering queries — and even outright kill the patient. This talk is about the recent observability efforts around around our authoritative nameserver, and how we use telemetry, OpenMetrics, system tracing, and probabilistic structures so we can see what we need at millions of packets per second — while keeping observability from becoming a meaningful slice of server performance. We also treat cardinality and production tracing as problems you design for up front, not surprises you discover when the node is already unhappy. At the core of our global DNS infrastructure lies erldns, a resilient Erlang authoritative nameserver. We recently rewrote its parent application to Elixir. Modernizing the codebase was one piece of work; another was deciding how much visibility we wanted and at what cost: data is gold, but mining it cannot cost more than the service it is meant to protect. At millions of packets per second, naive patterns fail in predictable ways. High-cardinality labels are a well-known way to blow up memory; tracing on a busy node can do real damage if you use it like a dev laptop. There is an old, dark joke that always comes to my mind: "The autopsy concluded that the cause of death was the autopsy". We treat observability as powerful and dangerous precisely so our investigations never look like that. This session walks through that operational picture: how we aggregate, sample, and expose metrics to OpenMetrics so reporting stays cheap at throughput; how we use probabilistic data structures (Count-Min Sketch, Quantile Sketch, etc) to estimate DDoS-ish signals; and how we approach cardinality and tracing as first-class design concerns, not afterthoughts. We want to cover lessons on: - **Metrics that stay cheap:** Aggregation, sampling, and OpenMetrics - **Cardinality by design:** Why label dimensions explode, and where probabilistic structures fit - **Tracing without self-inflicted outages:** What makes tracing risky on loaded nodes
Nelson is a child of a multicultural journey. Born in Venezuela to a family of engineers and economists, he grew up in Spain to study pure maths in university and moved to Poland to become a self-taught programmer. After a few years as a C developer in security/telecommunication domains, he's now a BEAM evangelist with an emphasis on performance and security. In his free time he's a sports addict, practising yoga and callisthenics, and also a history fanboy, devouring books every night.