13 Network Health Checks That Should Be Default — But Aren't

13 Network Health Checks That Should Be Default — But Aren't

Every network has a health dashboard. Most of them check three things: ping responds, syslog isn't on fire, BGP sessions are up. That tells you nothing about the network you're actually running.

Below is the list I've been refining for about seven years across half a dozen networks I help operate. Thirteen checks. Every one of them has fired a real alert and stopped a real problem. Every one of them also has a false-positive mode that operations teams have to engineer around. Both are documented.

One — ROA validity for your own prefixes

Checks: Your published ROAs match what your edge routers are actually announcing.
Why it matters: A ROA-prefix mismatch makes your prefix invalid from the perspective of any ROV-enforcing peer. Half the internet now drops invalids.
False positive risk: Low. The mismatch is real or it isn't.
Fix time when it fires: Hours, sometimes minutes. Update the ROA via the RIR portal.

Two — IRR consistency

Checks: Your announced prefixes match the route objects in IRR databases your transits filter against.
Why it matters: Transit operators still filter on IRR. Mismatch means dropped prefixes.
False positive risk: Medium. Old IRR objects linger for years. Stale records trigger noise.
Fix time: Hours, but the cleanup of stale objects is the real work.

Three — AS-path stability

Checks: AS-path lengths and AS-path content from external vantage points (RIPE Atlas or similar).
Why it matters: Sudden AS-path changes can indicate route leaks, hijacks, or upstream issues you don't see from inside.
False positive risk: Medium. Normal BGP churn looks like AS-path changes.
Fix time: Variable. Some are real incidents in minutes, some are noise from another AS's flap.

Four — Route-leak detection

Checks: Your prefixes propagating in ways that suggest a customer or peer re-announced them upstream.
Why it matters: Leaks redirect your traffic.
False positive risk: Medium. Anomalous propagation can have benign causes.
Fix time: Slow. You're chasing somebody else's misconfiguration. Often hours of coordination.

Five — RPKI cache freshness

Checks: The age of the RPKI data your routers are using to validate.
Why it matters: Stale cache means stale validation. If your RP can't reach the RIR, validation degrades silently.
False positive risk: Low. The cache age is a hard number.
Fix time: Minutes. Investigate why the RP isn't refreshing.

Six — Abuse contact reachability

Checks: The email and phone listed in your WHOIS/RDAP actually work.
Why it matters: When somebody else's network thinks you're attacking them, this is how they reach you. A dead contact means they escalate to your upstream instead.
False positive risk: Low. Send a test message and look for delivery confirmation.
Fix time: Minutes. Update the RIR record.

Seven — IXP presence verification

Checks: The IXPs where you claim to peer in PeeringDB match where you actually peer.
Why it matters: Other networks make peering decisions based on PeeringDB. Stale entries lose you peering opportunities.
False positive risk: Very low.
Fix time: Minutes via PeeringDB self-service.

Eight — BGP session flap rate

Checks: Number of BGP session state changes per session per hour.
Why it matters: Flapping sessions cascade into route instability for downstream traffic.
False positive risk: Low. Flap counts don't lie.
Fix time: Variable. Could be a flaky physical link or an upstream issue.

Nine — Route-views diff

Checks: What RouteViews/RIPE RIS say about your prefix, compared to what you intend to announce.
Why it matters: Aggregated external view catches things your edge routers can't see.
False positive risk: Low. The aggregate is the aggregate.
Fix time: Variable.

Ten — Geofeed publish status

Checks: Your geofeed is reachable, parseable, and matches the IP allocations you actually use.
Why it matters: Hyperscalers and CDNs use geofeed for geolocation. Stale geofeed means your customers in Berlin get a CDN edge in Sydney.
False positive risk: Low.
Fix time: Minutes once the file is right.

Eleven — ASPA records (where supported)

Checks: Your ASPA records exist and match your actual transit relationships.
Why it matters: Future-proofing. As ASPA enforcement spreads, missing records become operational liabilities.
False positive risk: Low.
Fix time: Minutes to publish, longer to keep them current as relationships change.

Twelve — DDoS sink test

Checks: Periodic verification that your DDoS scrubbing path is actually engaged when triggered.
Why it matters: A sink that hasn't been tested in twelve months might not work when you need it at 03:00.
False positive risk: Low if you control the test.
Fix time: Hours, depends on scrubber vendor.

Thirteen — BCP38 verification

Checks: Your edge actually drops packets with source addresses you don't own.
Why it matters: BCP38 has been "best practice" since 2000. Most networks don't actually verify their implementation works.
False positive risk: Low — either the spoofed packet egresses or it doesn't.
Fix time: Variable depending on edge complexity.

The thirteenth is the controversial one

BCP38 is in the list because every network operator claims to implement it and a measurable fraction don't actually verify their implementation works. Synthetic-source-IP testing reveals failure rates that don't match the "yes we do it" survey responses.

It's the thirteenth check because if you have to drop one from the list for resource reasons, it's the one that gets dropped first. It probably shouldn't.

The implementation pattern that works

These checks belong in a tool that runs them per-AS on a schedule (hourly for most, daily for a few), persists historical state, alerts only on transitions or thresholds (not on every snapshot), and integrates with whatever monitoring stack you already use.

Most networks don't have a tool that does this. Most networks have a partial implementation across three different vendors and a fourth check buried in somebody's shell script. The consolidation work is unglamorous and worth doing.

The 47-percent-of-RPKI-coverage problem isn't because RPKI is hard. It's because the operational discipline of running checks like these isn't universal. Treat them as defaults, not extras, and the network you run gets quieter.

Quieter is the goal.