IoT telemetry pipelines that don't fall over

There’s a moment in every IoT project where someone realizes the telemetry pipeline isn’t actually reliable. Usually it’s during a traffic spike, or when someone asks “why is this data missing?”

The answer is almost always the same: the pipeline was built for the happy path. It works great when devices send perfect data at predictable rates. It falls apart the moment reality gets messy.

The core problem

IoT telemetry has characteristics that break naive architectures:

Bursty traffic. Devices reconnect after outages and dump hours of buffered data at once.
Unreliable sources. Devices lie, send duplicates, have clock drift, and occasionally send garbage.
Scale asymmetry. You might have 10,000 devices sending data, but only 10 engineers to debug problems.

If you can’t replay events, quarantine bad data, and trace what happened, you don’t have a pipeline. You have a prayer.

The architecture that works

Ingestion layer: Stateless, horizontally scalable, validates payloads at the edge. Bad data gets rejected or quarantined here, not downstream. Use explicit schema versioning — devices in the field will be running old firmware for years.

Durable queue: Between ingestion and processing. This is your shock absorber. When processing slows down, the queue buffers. When you need to replay events, they’re still there. Kafka, SQS, whatever — the point is durability and replay.

Processing layer: Enriches events with device metadata, normalizes to a stable schema, writes to storage. Must be idempotent — you will process events multiple times due to retries.

Storage: Time-series store for metrics, relational store for device state. Keep these separate. Mixing operational queries with analytics queries creates contention that bites you at the worst time.

API layer: Stable endpoints for downstream consumers. Pagination, rate limits, clear errors. Never expose raw telemetry directly — always go through a defined interface.

Schema versioning isn’t optional

Your newest devices run firmware 2.3. Your oldest devices run firmware 1.1 from three years ago. They both send telemetry.

If you don’t version your schemas, you’re playing whack-a-mole with parsing errors forever. Version in the payload. Route by version. Transform old formats to current format in a single place.

When you change the schema, old versions keep working. That’s the whole point.

Backpressure saves you

When processing can’t keep up, you have two choices: drop data silently, or apply backpressure.

Silent data loss is never acceptable. You won’t notice until someone asks why the graphs have gaps.

Backpressure means the queue grows, ingestion slows down, and maybe devices buffer locally. It’s visible. It’s debuggable. It recovers automatically when the bottleneck clears.

Idempotency is non-negotiable

Devices retry. Networks duplicate. Processing restarts. You will see the same event multiple times.

If your pipeline creates duplicates in storage, your data is wrong. Use stable event IDs. Make writes idempotent. Test with duplicate sends — it’s the most common real-world failure mode.

Monitoring for pipelines

The standard stuff applies: throughput, error rates, latency. But for telemetry pipelines specifically:

Ingestion lag: How far behind is the newest processed event from the newest received event?
Device coverage: Are all devices sending? Which ones went silent?
Schema errors: Which devices are sending unparseable payloads?
Queue depth: Is the buffer growing? Draining?

If you can’t answer these questions from a dashboard, you’ll be answering them during an incident.

What breaks pipelines

No schema versioning — every firmware update causes parsing failures
Tight coupling between processing and storage — can’t replay without re-storing
No idempotency — duplicates everywhere
No backpressure — silent data loss under load
Raw payloads with sensitive data logged without scrubbing

Security note

Telemetry often contains identifiers, locations, usage patterns. Mask or hash where possible. Encrypt in transit and at rest. Limit access to raw payloads. Provide sanitized views for most users.

The team debugging a parsing error doesn’t need to see customer device IDs. Build for least privilege.

XIThing builds telemetry infrastructure for IoT products. Get in touch if your pipeline needs work.