Memory profilers, call graphs, exception reports, and telemetry: here’s what happened when we gave AI coding tools system-wide context

February 3, 2025

To meaningfully solve problems, developers aggregate context from several disparate sources that live both within and outside code. This context is often scattered across a daunting array of tools that developers navigate daily: from the AWS Console's labyrinth of services to Kubernetes' multi-layered configs, Datadog's dense metric displays to distributed traces spanning dozens of services. Each tool demands its own expertise, creating constant context-switching that interrupts flow and drains productivity [1].

Enter AI coding assistants: they are getting really good at understanding and generating code, but can lack crucial context about the broader system environment. This missing context doesn’t necessarily lead to hallucinations (although it might!) but rather, it does mean developers spend more time helping AI understand the full picture—manually bridging the gap between code and the several information sources that make up the complex system it lives within.

The context challenge

Consider a typical debugging scenario: a developer notices production alerts about timeouts. What seems like a simple query issue quickly reveals a complex system-wide problem spanning multiple services:

  • Exception monitoring shows intermittent timeouts
  • Server logs reveal patterns in affected requests
  • Infrastructure metrics show resource constraints
  • Performance dashboards indicate systemic bottlenecks
  • Recent deployments suggest potential triggers
  • Architecture diagrams expose impacted dependencies

Here's the challenge: even "simple" timeouts require context from half a dozen different systems. Traditional AI coding assistants, limited to analyzing code files, can't synthesize this crucial operational context.

Early experiments with context ingestion

We decided to test what happens when we give AI coding assistants access to the same operational context developers use. We experimented with several types of system data:

  • Call graphs generated from production trace data
  • Memory profiler outputs revealing resource bottlenecks
  • Exception reports from monitoring systems
  • Telemetry information showing actual usage patterns

Our goal was to test how different types of context affect debugging accuracy when given to AI coding assistants.

Experiment 1: Giving call graph information to Aider

In our first experiment, we tasked Aider with fixing a reproducible bug in the Polars library (#20042). We provided two key inputs: the relevant test files, and function call graphs generated using pycg.

The results were intriguing: the function call graph transformed Aider's navigation from vague requests for "DataFrame initialization code" to the precise identification of the relevant sequence_to_pydf function in the relevant polars/_utils/construction.py file.

However, we also discovered clear limitations. When we provided the 14K line call graph of a file containing the dependent function, it overwhelmed the context window and made LLM queries impossible. This highlighted a crucial insight: context must be not just relevant, but efficiently represented to be useful for AI tools.

Experiment 2: Datadog dashboards

We built a proof of concept that captured Datadog dashboard screenshots using headless Chrome, fed these visualizations to AI coding assistants, and asked the AI to analyze specific metrics and trends.

While capturing dashboard screenshots seems straightforward, we encountered several technical hurdles:

  1. We had to determine when dashboards were fully rendered, and thus detect page loads, which isn't trivial. We'd need to (a) wait for specific DOM elements, (b) verify all graphs have rendered, and (c) ensure data had finished loading.
  2. Dashboards often extend beyond a single screen, so we had to handle these to get broader coverage, requiring programmatic scrolling, multiple screenshots, or stitching screenshots.
  3. We had to convert visual metrics into actionable insights, which proved challenging.
Datadog screenshot

What we learned is that LLM analysis of dashboard metrics was overly verbose, focused on surface-level observations, and missed critical correlations and patterns. This experiment revealed that effective metrics analysis requires the ability to navigate interactive, time-based data visualizations

Experiment 3: Memory profiling outputs

We tested Aider's ability to interpret memory profiling data by providing it with formatted output from Python's memory_profiler tool. The profiler output was structured in a table format showing line numbers, total memory usage, memory increment per line, occurrence count, and the actual code contents. When asked to identify memory-intensive code, Aider correctly analyzed this tabular data, identifying the line number that caused the peak memory allocation of 152.625 MiB. It also properly contextualized this against smaller allocations, like the 7.676 MiB used by the list creation in another line. The AI coding assistant was able to track the memory usage pattern through the program's execution, including the subsequent memory release of -152.488 MiB when the list was deleted.

This suggests that AI assistants can effectively interpret structured performance data when provided in their context window, opening possibilities for memory optimization assistance.

Experiment 4: Exception reports

We built a tool called to test how exception report context affects AI coding assistant performance. The tool ingests exception reports via Sentry's JSON API that coding assistants can understand and reason about. To evaluate its effectiveness, we used a real-world example: a complex database timeout issue in an academic citation system.

We provided Aider with a JSON Sentry error report for the issue. When given just the error report, Aider was able to extract basic information: the file location, error type, and timing. However, when we added the source file to the context, Aider provided significantly deeper analysis, identifying the specific recursive SQL queries likely causing the timeout and explaining how their 50,000-row limits interact with complex citation networks.

This experiment demonstrates how combining structured error reports with source code enables AI tools to perform more sophisticated debugging, moving from simple error reporting to understanding systemic issues and their root causes.

The Nuanced Graph: starting simple, thinking big

Today, we're starting with a focused foundation: a code intelligence graph enhanced with static analysis annotations. This gives AI assistants basic structural understanding of your codebase through dependencies, types, and control flow.

But our vision extends far beyond static analysis. We're building towards a rich knowledge graph that will integrate:

  • Production behavior from logs and error reports
  • System performance metrics and resource usage
  • Service interaction patterns
  • Deployment and configuration context

By transforming this operational data into structured context for AI assistants, we'll enable them to reason about your system holistically—understanding not just what the code says, but how it actually behaves in production.

Help shape our direction

We believe the future of AI development lies in better context, not just bigger models. We're starting with Python support and gradually expanding our capabilities. If you're interested in following our progress or getting early access, join our waitlist, or email me, the founder, at ayman@nuanced.dev.