Autocomplete

DevGPT starts as a thin wrapper — a code autocomplete layer plugged into engineers' IDEs inside J.P. Morgan's Asset & Wealth Management division. The problem it's solving is modest: reduce context-switching, surface code suggestions without leaving the editor, make developers a little faster.

The model is GPT-3.5. The infrastructure is minimal. There's no caching, no memory, no orchestration. Requests go in, completions come out.

But the usage data tells a different story. Engineers aren't just using it for autocomplete. They're asking it questions. Pasting in entire functions. Requesting refactors. The completions box is becoming a conversation.

The insight: developers don't want autocomplete. They want a collaborator.

From Completion to Conversation

The first major architectural shift: DevGPT evolves from a completion endpoint into a multi-turn assistant. We add session context, conversation history, and expand the IDE integration to VS Code and IntelliJ. We introduce Claude alongside GPT-4 — initially for longer context tasks where GPT-4's window was a ceiling.

User count grows from dozens to hundreds. The feedback loop tightens. Engineers want the tool to understand their codebase, not just their current file. They want it to remember what they worked on yesterday. They want it to know what their team is building.

The platform is outgrowing its original design. We begin rebuilding the backend — not iterating on what exists, but re-architecting for what's coming.

Stack at this point: FastAPI backend, PostgreSQL, basic Redis session caching, single-tenant EKS deployment, GPT-4 and Claude 3 via direct API.

First Signs of Scale Pressure

By mid-2024, DevGPT has several hundred active users. Every request is a live inference call. Every conversation turn sends the full message history. There's no prompt caching, no token budget management, no awareness of what's being re-computed on every turn.

It works. But the cost curve is climbing faster than the user curve.

We also begin integrating AWS Bedrock as the primary model layer — routing Anthropic models through Bedrock for compliance, auditability, and AWS-native IAM controls. This is the right call architecturally, but it introduces a new constraint: Bedrock service quotas. We're on default quota limits that were never designed for a platform at this trajectory.

We don't hit the wall yet. But it's visible in the distance.

Platform Ambitions

The decision is made to invest DevGPT as a real platform — not a productivity tool, but infrastructure. That means:

  • Single-tenant architecture with per-LOB isolation — true multi-tenancy is blocked by JPMC's Route53 routing restrictions, so isolation happens at the LOB layer within a single tenant
  • A proper context assembly system (not just dumping history into the prompt)
  • Memory that persists across sessions — the four-layer model: Org / Team / User / Session
  • A Kubernetes API gateway on EKS using Cilium for per-LOB auth, rate limiting, and load shedding
  • Hybrid model routing — Anthropic via AWS Bedrock, OpenAI via Azure OpenAI, with Bedrock inference profiles designed across 3 regions by default via Geo CRIS
  • IDE expansion: Android Studio and Xcode support on the roadmap — wanted, but never built; not enough time or headcount to take it on alongside everything else

The platform begins onboarding product managers and senior leadership alongside engineers. The user base is no longer just developers. DevGPT is becoming infrastructure for how AWM thinks.

User count approaches 2,000. TPM is growing fast.

The user base stays AWM-only — which is why, even at scale, DevGPT tops out around 6,500 users rather than firm-wide.

The Wall

4 million tokens per minute. AWS throttles us. Users start seeing errors.

This is the moment everything breaks.

Total platform TPM reaches approximately 4 million tokens per minute. AWS Bedrock begins throttling requests. Engineers, product managers, and senior leaders across AWM start seeing errors mid-session. Requests fail. Completions time out. The complaints come fast.

The instinct is to blame the infrastructure. The load balancer. The EKS node groups. The API gateway. We dig through every layer.

The root cause is simpler and more fundamental:

We have prompt caching disabled. Every single token in every single request is hitting live inference.

Every conversation turn. Every system prompt. Every shared context block that's identical across thousands of requests. All of it being re-computed, re-charged, re-throttled — every time.

With ~2,000+ active users sending multi-turn conversations, the math was brutal. We weren't just using our quota — we were burning it on work the model had already done.

Stack at this point: Golang endpoints using InvokeModelWithResponseStream to Bedrock, running across 20 pods on EKS, with Karpenter for node provisioning and Cilium for networking.

The Fix

Enabling prompt caching. Working with AWS on quota. Building the telemetry to see what's actually happening.

1. Prompt caching — enabled, properly architected

This isn't just flipping a flag. Effective prompt caching requires deliberate cache point placement — the system prompt, shared context blocks, and conversation prefixes need to be structured so the cache boundary sits in the right place. We redesign the context assembly layer around cache-friendliness: stable content above the fold, dynamic content below it, cache points placed at the boundaries that matter.

We also implement rolling window management so long conversations don't evict their own cache entries, and build org-level shared caching for content that's identical across users — team context, codebase summaries, shared system prompts.

2. AWS service quota engineering

We partner directly with AWS to right-size our Bedrock quotas — not just for current load, but for the trajectory we're on. This means building a quota management framework: automated monitoring against limits, alerting at threshold percentages, and a capacity planning model tied to user growth projections. We design for the entirety of J.P. Morgan Chase, not just AWM.

3. Observability

We instrument every cache interaction — hit rate, miss rate, eviction patterns, per-request token savings. We integrate Phoenix (Arize) for LLM tracing alongside Datadog and OpenTelemetry. For the first time, we can see exactly what's happening at the token level on every request.

In October 2025, we build a full telemetry gateway — Golang, OTEL end-to-end, with SQS separating staging from a federated production schema. It collects events from both the IDE plugins and the Bedrock proxies, giving us a single pipeline for everything flowing through the platform.

This gateway becomes the cornerstone for AI attribution across all of AWM — every model call, every cache hit, every agent action, traceable back to a team, a user, and a cost center.

4. The Numbers — December 2025

By December, the results are not incremental. They're structural — 90% prompt cache hit rate, errors gone, users stop complaining.

MetricBeforeAfter
Prompt cache hit rate0%~90%
Live inference TPM (peak)~4M (and climbing)~6M
Cache read TPM (peak)0~20M
TPM budget utilized>90% (throttled)~20% of 30M budget
User-facing errorsFrequentEliminated

Cache reads on Anthropic models peak at 20 million tokens per minute. Live inference consumption drops to approximately 6 million TPM against a 30 million TPM budget. The same platform, serving more users, consuming a fraction of the quota.

Prompt caching alone brings AWS costs down 70% — the same infrastructure, serving a growing user base, at a fraction of the spend.

The remaining edge cases are honest ones: inexperienced users sending requests that exceed Bedrock's maximum input token limits. These are operational constraints — not bugs, not architectural failures. We handle them with clear error messaging, input validation, and user education on context window limits.

AI4Tech

The platform earns its own organization.

DevGPT's growth — 6,500 users, firm-wide tooling impact, infrastructure that had outgrown AWM's shared environments — leads to a structural decision: spin out a dedicated organization.

AI4Tech is formed within AWM with its own SEAL, its own AWS accounts, and its own cost center. My manager and I move over in a two-week migration — a full AWS org cutover with a 30-minute planned maintenance window, zero data loss, no user-facing failures.

The platform now has the organizational independence to operate and scale on its own terms.

Agentic Runtime

Anthropic SRT. Bubblewrap. Sub-2-second sandbox dispatch.

With the cost and quota problems solved, the platform can grow. And the next frontier isn't conversation — it's autonomous execution.

We architect the Anthropic Sandbox Runtime (SRT) on EKS: a pool-managed, session-aware Kubernetes pod system where agents execute untrusted code in sandboxed subprocesses using bubblewrap for kernel-level process isolation. No privileged containers. No separate runtime service.

Key design decisions:

  • Warm pod pools eliminate cold-start latency — pods sit ready behind Karpenter NodePools, dispatched in under 2 seconds
  • Istio Ambient Mode replaces the sidecar model — sidecar-free mTLS and L4 policy enforced at the node level, keeping the data plane completely out of the agent container
  • KEDA ScaledObjects handle burst autoscaling without over-provisioning
  • SOCKS5/socat network isolation gives each agent session a contained network surface
  • VDI-to-AWS-cloud auth flow bridges JPMC's on-premise VDI environment to the cloud-native runtime securely

The SRT turns DevGPT from a chat interface with code awareness into a platform where agents can actually run code, test it, and iterate — inside the firm's security boundary.

From Plugin to Platform

The agent loop was never meant to run without an IDE attached.

I'm tasked with building cloud agents — long-running, asynchronous agents that operate without a developer sitting in front of an editor. The obvious starting point is DevGPT's existing agent loop, the same one driving the VS Code and IntelliJ plugins.

It doesn't work. The agent loop is built assuming an IDE on the other end of the connection — plugin-side state, editor-bound context capture, UI callbacks baked into the control flow. Cloud agents need none of that, and the coupling makes it impossible to run the loop headless.

Rather than bolt a cloud mode onto a plugin-shaped core, I rewrite the agent loop from the ground up as a CLI-first, headless runtime — the plugins become thin clients on top of it, not the other way around. The same binary that powers a developer's IDE session now runs unattended in a pod, dispatched by the SRT, driven by Temporal.

This rewrite becomes the foundation for cloud agents across DevGPT — and for TLM's maintenance-branch agents that operate with no human in the loop until the patch is ready for review.

It also lands at the right moment: 2026 is the year JPMC obtains Claude Code licenses firm-wide. With a headless, cloud-capable agent loop already in place, DevGPT becomes the internal answer to Claude Code — the same agentic coding experience, built to run inside JPMC's infrastructure and security perimeter.

Technology Lifecycle Management

DevGPT goes firm-wide — under the hood.

The most significant expansion of DevGPT's scope isn't a new UI feature or a new model. It's a new use case entirely: automated software maintenance at enterprise scale.

Technology Lifecycle Management (TLM) uses DevGPT's agentic infrastructure to automate dependency updates, security patches, and framework migrations across JPMC's full technology estate — Terraform, Python, Java, and Golang codebases.

The architecture:

The agents propose. The humans decide. The firm moves faster.
  • Agents operate on a dedicated maintenance branch per repository
  • A supervisor agent reviews proposed changes — diffs, test results, risk assessment — before surfacing them to the owning team
  • Teams retain full merge authority to release/master
  • The system operates continuously, across the entire firm, while the platform org remains AI4Tech-chartered within AWM

6,500 users. Firm-wide reach. Sub-2s sandbox dispatch. 90% cache hit rate.

DevGPT is no longer a tool. It's infrastructure — for how engineers write code, how teams manage their technology estate, and how J.P. Morgan Chase thinks about the relationship between AI and software development.

The journey from a simple autocomplete wrapper in 2023 to an agentic platform powering firm-wide software maintenance in 2026 is not a straight line. It's a series of constraints hit, root causes diagnosed, and architectural decisions made under real pressure with real users depending on the outcome.

That's what production looks like.