Observability

When your software runs inside customer clouds, the usual self-hosting playbook breaks. There are no shared dashboards, no central log storage, and no way to ask "is this customer healthy?" without a support ticket. Alien is built so you can answer that question without leaving your control plane.

Observability in Alien covers three signal families across every deployment:

Health and lifecycle — release version, platform, region, deployment status.
Application telemetry — logs, metrics, and traces produced by your code.
Runtime signals — container health, HTTP traffic, queue depth, and other operational metrics gathered by Alien's data plane.

All three are correlated by deployment, so when something looks off you can move from "all deployments" down to "this customer, this release" without piecing it together by hand.

Deployment health

The manager already knows about every deployment, the release each is running, the cloud and region it's pinned to, and when it last checked in. That state is exposed on the dashboard and through the manager API, so support and engineering teams can:

See which customer environments are healthy, stale, updating, or pinned to an older release.
Spot deployments that haven't picked up the latest release (push vs pull difference, network reachability, configuration drift).
Trigger pinned rollbacks for one deployment or all of them — see Releases.

There's no need to scrape customer cloud consoles or wait for screenshots to know what's running.

Application telemetry

Inside customer environments your code produces normal logs, metrics, and traces. Alien forwards those signals out of the customer cloud over OpenTelemetry-compatible pipelines.

You wire the destination once in the manager configuration — your existing observability backend (Datadog, Grafana, Honeycomb, S3, etc.). Every active deployment automatically tags telemetry with:

Deployment id, customer label, and cloud/region
Release version and stack name
Resource and runtime metadata (worker name, container name, replica)

See Self-Hosting → Configuration for the exporter settings.

Log body normalization

Alien-collected log bodies are normalized before they are stored or forwarded through Alien's OpenTelemetry log pipeline. ANSI terminal escape sequences are removed, embedded line breaks are neutralized, and other terminal control characters are dropped. Printable text, tabs, timestamps, severity fields, resource metadata, and OpenTelemetry attributes are preserved.

This keeps logs searchable, copyable, and safe to display in the dashboard. Treat captured stdout and stderr as operational telemetry, not as a byte-for-byte terminal recording. If color or formatting carries meaning in your application, emit structured JSON fields or OpenTelemetry attributes instead of relying on terminal color codes.

Provider-native logs, such as CloudWatch or container runtime files viewed outside Alien, may still show the exact bytes written by the process.

Runtime signals

You don't have to instrument everything by hand. Alien's data plane already produces a small set of useful signals out of the box:

Container and replica health, restarts, and last error.
HTTP-level metrics (request counts, status classes, latency) for ingress-facing containers.
Queue depth and consumer lag for managed queues.
Storage and KV usage where the underlying cloud service exposes it.

These travel through the same telemetry pipeline as your application data and arrive tagged with the same deployment context.

Operational, not customer data

Observability data is deliberately scoped to runtime, infrastructure, and product health rather than the customer's application data. The goal is to keep operational telemetry safe for your support engineers to look at, without dragging regulated customer records back to your control plane.

If you do need to inspect customer-side data on demand, the right tool is a remote command — it runs inside the customer environment, returns only what you ask for, and leaves an audit trail.

Putting it together

A typical investigation looks like this:

The dashboard shows a deployment marked unhealthy after a new release.
You filter telemetry by that deployment and that release — error rate spikes, latency rises.
Logs show the offending request pattern; a trace narrows it down to a single Worker.
You pin the deployment back to the previous release while you ship a fix.

No customer hand-holding, no asking "can you send us the logs", no version drift. Every deployment behaves like part of one operational surface.

Observability

Deployment health

Application telemetry

Log body normalization

Runtime signals

Operational, not customer data

Putting it together

What's next

Self-Hosting

Releases

Remote Commands

On this page