Discussions about LLMs for code still revolve too often around the latest model, context size, and benchmark scores. Those matter, but they are not what separates a brittle demo from reliable production performance. If teams want better AI for code, the missing layer is observability: the ability to see, measure, and steer what the system actually does across your stack, not just what the model returns in a single chat turn.
Better AI for code starts with observability
A conversational answer is one hop. Coding agents and AI-assisted delivery are workflows: repository state, tools, sandboxes, CI, APIs, retries, human edits, and policy gates. Performance is defined by the end-to-end path. Latency and reliability of the whole run, cost per useful outcome (merged change, green build, fixed incident), quality of what ships, and behaviour when dependencies or prompts change.
None of that is visible from a provider dashboard that only shows tokens per request. You need correlated telemetry: traces, structured logs, and metrics tied to a stable request or session identity, with enough context to slice by repo, branch, tool, model version, and customer.
The log aggregation problem
A single agent run can touch a gateway, authentication, vector search, object storage, the model API, a sandbox, linters, test runners, and webhooks. Each layer has its own format, clocks, retention, and sampling rules. Without a shared correlation or trace id propagated everywhere, you cannot reconstruct one logical run. PII and secrets force redaction or split streams, which fragments the story unless you design for it. Aggressive sampling on noisy services often hides the tail that drives cost and latency.
Small projects still generate massive log volume
Teams often underestimate how much data arrives even in a "small" setup. A single product with one API, one background worker, one queue, one database, and one frontend can already produce high daily volume:
- API and edge access logs: 300k to 1.5M events per day
- Application logs (backend + worker): 100k to 600k events per day
- Queue and job lifecycle logs: 50k to 300k events per day
- Database slow query and connection logs: 20k to 150k events per day
- Frontend client errors and telemetry: 10k to 120k events per day
That puts many small projects in the range of 500k to 2.5M events per day. With 14 days of retention, that is often 7M to 35M events. At a rough average of 800 bytes to 2 KB per indexed event (payload + metadata + index overhead), teams quickly land in the neighborhood of 10 GB to 70+ GB of searchable data for a short retention window.
Why searching this data is hard
At this scale, search becomes a separate reliability and cost problem. Plain text grep workflows break down quickly. Centralized log tools help, but real investigations still hit:
- very wide result sets for broad filters ("errors last 30 minutes")
- missing signal when filters are too narrow or fields are inconsistent
- slow queries on high-cardinality dimensions (user id, request id, dynamic tags)
- expensive scans when indices or partitioning are not aligned with query patterns
The practical issue is not only volume. It is query precision under pressure. During an incident, if correlation ids are missing, field names differ between services, or timestamps are misaligned, finding one failing path can take much longer than fixing the bug itself.
Centralizing logs in one index is only the start. The hard part is making signals composable: comparable units (wall time, dollars, tokens), consistent identifiers, and views that teams can trust for decisions, not another wall of unstructured text.
Full observability was already hard before LLMs
Needing a clear picture of production is not a problem invented by language models. It is why ecosystems around Prometheus, Grafana, OpenTelemetry, and hosted APM products grew so large: teams had to stitch together metrics, logs, and traces across many moving parts and still answer simple questions under pressure.
A serious stack is rarely one service. You get databases, replicas and migrations, caches, message queues and workers, one or more backends, API gateways, auth, object storage, scheduled jobs, and client-side errors from browsers or mobile apps, often plus edge CDNs and Kubernetes noise on top. Each layer emits its own signals, owners, and failure modes. Standing up collectors, retention, alerting, and dashboards is real engineering work, and keeping cardinality and cost under control is an ongoing discipline.
Software does not replace that discipline. A good observability platform gives you storage, query languages, and panels. It does not decide your service boundaries, SLIs, sampling policy, or which fields make an incident debuggable. Expertise still matters: people who know how production fails, what to correlate, and how to change the system when the charts move.
Lidless: observability for code agents at AI2H
At AI2H we work with clients who run LLM-powered coding workflows in production. Lidless is our internal observability capability aimed specifically at code agents. It is the tool we use to assemble the richest context we can for those workflows: what ran, in what order, at what cost, and with what outcome, so people are not stitching the story together across five different UIs. From that base you can visualize and measure production behaviour, then optimize costs, latencies, output quality, and how agents behave under real load, without guessing from isolated vendor metrics alone. We describe it alongside our broader automation and tooling work on our automation and tools page.
Lidless does not remove the need for expertise. It narrows the gap between raw telemetry and actionable understanding for agent-heavy code paths. Models are still one component among many; delivery and operations need instrumentation and judgment. If you are also wrestling with token spend and security around AI-assisted coding, our articles on hidden costs and security risks in vibe coding and regaining control when AI-driven coding loops go wrong complement this picture from adjacent angles.
Conclusion
The next leap in LLM performance for code is less about which model tops a leaderboard and more about whether you can observe and steer the full agent lifecycle. Until every dependency contributes to a coherent, correlated view, you leave speed, quality, and margin on the table. Observability is not a nice-to-have; it is the performance layer.