Glia: A Human-Inspired AI for Systems Design and Optimization

Editors’ note: For the third The Next Horizon of System Intelligence series, we invited the Glia team from MIT to share their work on developing human-inspired AI for system design and optimizations. The last blog article defined the ladder of System Intelligence based on the learning experience of PhD students and Glia is such a PhD-level AI!

The evolution of AI today is nothing short of revolutionary, with Large Language Models (LLMs) demonstrating reasoning and creative capabilities that continue to defy expectations. This relentless progress, however, has exposed a shortcoming: the complexity of the underlying computer systems infrastructure.

AI models are becoming increasingly sophisticated and diverse. We’re currently witnessing both an explosion of LLMs and a wide range of non-LLM “core ML” models. With this growth, we are finding that the systems that support them lag behind. These systems, which are critical to deliver AI applications to users, are complex due to massive scale, expensive (and evolving) hardware such as GPUs and TPUs, dynamic workloads, and the need for stringent performance guarantees.

To manage this complexity, companies dedicate squads of expert systems engineers. Yet, developing and optimizing a truly novel systems design remains a slow process, often taking weeks or months. This innovation is ultimately capped by human time and intuition, making traditional, human-centric R&D suboptimal for rapid progress.

Past attempts to solve this problem using AI/ML for systems have not been overwhelmingly successful. Previous efforts, often relying on Reinforcement Learning (RL), produced fragile, opaque black-box policies that were difficult to analyze, verify, or adapt to new workloads. They lack the simplicity, clarity, and robustness that characterize good, human-engineered designs.

This problem motivates our creation of Glia, a “PhD-level” AI designed to autonomously architect and continuously optimize the complex systems infrastructure. As society becomes increasingly digitized, and as AI proliferates through our digital systems, such solutions are essential. Our mission with Glia is to build a new form of autonomous infrastructure for all computing systems, and especially those that deliver AI-based applications to users.

Glia: A PhD-Level AI for System Design & Optimization

Glia, named after the non-neuronal brain cells that support and enhance the function of neurons, is an AI architecture that reimagines system design and optimization with AI. Its core innovation is a “white-box” approach rooted in deep systems reasoning, contrasting with prior black-box approaches using AI.

The system uses a human-inspired, multi-agent workflow to elevate exploration from the low-level code modifications to higher-level design concepts and ideas. Glia’s methodology is generalizable and aims to be robust. It aims to avoid the many false exploration paths common to brute-force or code-mutation methods.

Glia has three main components: a front-end for human input, a multi-agent AI imbuing systems thinking mimicking the workflow of expert human researchers and engineers, and an evaluation playground (a simulator or testbed) to experiment with new ideas. In our current implementation, the multi-agent AI has two types of agents:

Researcher: This agent produces ideas, performs hands-on engineering to implement them in code, and conducts experiments with these codes on the evaluation playground. It has been taught systems principles, the use of tools like shell commands and scripts, and how to perform deep analyses of experimental data.
Supervisor: This agent acts as the guide. It steers the Researcher by asking questions, providing feedback, recalling previous findings, and approving or suggesting revisions. Its goal is to ensure that the research process does not lose focus or terminate prematurely, or continue fruitlessly. It does not have access to the codebase; it focuses instead on key ideas and findings, keeping the big picture, the key ideas discovered thus far, and current state of the experiments in mind.

This process involves a continuous loop depicted below:

Pure LLM Prompting Doesn’t Work

Our first key finding is the inadequacy of using LLMs as-is or in simple evolutionary loops for system design. When an LLM is given a detailed prompt describing the problem and objectives, the resulting algorithms are generally not competitive out of the box.

For example, when we prompted different LLMs to design an efficient request router for an LLM serving system to minimize mean response latency for a given workload, we found the following results. Most systems barely beat the least-loaded queue (LLQ) router, a simple human-designed standard baseline. We gave this task to a human expert with 20 years of experience working on such problems and found that his solution was significantly better than both simple baselines like LLQ and round-robin, and LLM-generated solutions. The picture below shows the distribution of the mean response latency obtained over 100 attempts, where each attempt is a different run at prompting an LLM.

What About LLM-in-the-Loop Search?

A more sophisticated approach places LLMs within a black-box search loop. One or more LLMs generate or modify code candidates, an evaluator executes each candidate on a benchmark and returns a performance score (e.g., latency or throughput), and the LLM refines subsequent candidates based on that feedback. This is the method used by approaches FunSearch or AlphaEvolve, which rely on feedback to mutate code, operating more like a “code monkey” that tries out endless variants without explicit reasoning about why a solution works or fails. This approach also does not work too well, as we show below.

Glia’s Approach

Glia’s reasoning-centered method produces both higher performance and more interpretable designs. Rather than using LLM as a black-box optimizer, Glia uses it to elicit systems reasoning. We design an agentic framework that mirrors how human teams of experts approach system design problems. It consists of a workflow loop that forms hypotheses, conducts experiments, analyzes results, and refines ideas.

The philosophy behind our approach is the observation that solving systems problems requires four interdependent skills:

developing a system model,
formulating hypotheses and designing experiments,
analyzing experimental data in the form of telemetry, and
synthesizing insights into improved designs.

Standard LLM prompting misses the feedback loop required to integrate these skills.

Case Study: From Weeks to Hours in LLM Serving

Link to the demo video: Glia Demo.mov

As a representative example, we applied Glia to an important challenge in today’s AI infrastructure: efficiently serving LLM inference requests across a distributed GPU cluster. The task was to optimize the Request Router, which dynamically decides which GPU should serve an incoming request. The goal in this effort is to minimize the mean request response time (RT) across requests sent to the system.

Here is the breakdown of the multi-agent path to a new router that Glia invents. This router incorporates a novel Head-Room Admission (HRA) global scheduler combined with the Shortest-Prefill-First (SPF) idea (which is inspired by the classical “shortest job first” scheduler common in many systems that seek to minimize mean response latency). It achieves performance comparable to the human expert’s, but does so in only two hours rather than the two weeks it took our experienced human expert.

1. Initial Exploration and Bottleneck Identification (Researcher)

Starting from a baseline like Least-Loaded Queue (LLQ), the Researcher began by testing simple variants (e.g., LLQ with memory pressure checks). After the first ten experiments yielded no significant improvement, the Researcher autonomously realized this path was unproductive and decided to step back:

Rather than continuing to iterate blindly, I am going to step back, profile the baseline more rigorously and derive a quantitative model of where time is actually spent…

This rigorous profiling led to the crucial, evidence-based insight: roughly 26% of requests were experiencing restarts, averaging 48.1 seconds of end-to-end latency. This was 11 seconds slower than non-restarted requests.

The Researcher concluded:

Hypothesis: memory imbalance → more restarts → wasted execution time.

2. The Plateau and the Supervisor’s Intervention

The Researcher then tried several techniques from the literature to mitigate restarts (e.g., memory-headroom LB, elephant-house heuristic, and more). The best solution found improved latency by only 3%. The Researcher hit a plateau, leading the Supervisor to intervene and prevent the Researcher from continuing along a fruitless path:

It seems that we have been unable to reduce the number of restarts. Why is that? Is there something structurally wrong about our approach that is making it difficult to limit the number of restarts?

3. The Breakthrough Reasoning (Researcher)

The Supervisor’s targeted questioning triggered a chain-of-thought analysis in the Researcher, leading to the core design insights:

Restarts stubbornly persist because: 1) The scheduling mechanism (vLLM’s pre-empt-on-growth) evicts the youngest request when an extra block is needed and none is free; 2) The global scheduler cannot foresee the full decode length, leading to “safe” admissions that later trigger pre-emptions; 3) The high average utilization (>90) of LLQ leaves no memory head-room, causing a domino effect of restarts.

4. Idea Synthesis and Refinement

Guided by this white-box analysis, Glia synthesized a novel Head-Room Admission (HRA) global scheduler. This strategy explicitly forecasts memory usage and defers admission until sufficient GPU headroom remains.

The initial HRA implementation successfully eliminated most restarts (from 26% to 0.001%) but resulted in high latency due to deferred admission (50 second vs. 40 second LLQ baseline on one workload, for example). Glia automatically recognized this as a parameter tuning opportunity and performed a rapid search over the HRA algorithm parameters, quickly finding a configuration that broke the 40 second latency barrier.

Finally, the Supervisor encouraged idea composition, recalling that the Researcher had previously tested the Shortest-Prompt-First (SPF) scheduler. The researcher combined it with HRA.The result of the combined HRA + SPF design achieved a mean end-to-end latency under 23 seconds on this workload—a 42.5% improvement over the baseline LLQ algorithm, achieved in only 20 simulations and under two hours, compared to the 100+ simulations over two weeks by the human expert.

Discovery 1: The Head-Room Allocator (HRA) Router

The Head-Room Allocator (HRA) request router is interpretable and simple.

We confirmed this performance breakthrough in a real-world system with 4 Nvidia A10 GPUs.

Unlike humans, an AI like Glia never sleeps! We tasked Glia to develop better workload-specific adaptive solutions for other components of the LLM inference stack including the batch scheduler and auto-scaler.

Discovery 2: The Optimized Batch Scheduler

When asked to improve the vLLM batch scheduler, Glia autonomously discovered that ordering requests by prefill length instead of arrival time reduces end-to-end delay by an additional 25%. This is because prioritizing shorter prefills minimizes head-of-line blocking for short prompts.

Discovery 3: The Cost-Saving Autoscaler

To process time-varying loads in a cost-effective manner, autoscaling adjusts the number of compute instances to meet latency targets while minimizing compute cost. We asked Glia to design an autoscaler that minimizes compute cost while keeping the p95 slowdown below 5x of a system with no resource constraints. Glia proposed a proportional control loop that adjusts the number of instances based on the current volume of inflight requests per instance. Glia then tuned the controller thresholds for this specific model and workload, finding an optimal configuration that minimizes cost while satisfying the latency constraint.

A Glia-optimized stack with the HRA+SPF Request Router, Batch Scheduler, and Autoscaler cut the total GPU-hours by 40% for this variable workload to achieve the same performance as the baseline defaults in a vLLM cluster. Each Glia innovation was helpful, as shown below.

Continuous Adaptation and Future Directions

A fixed, human-designed system only works optimally for the specific workload it was tuned for. Glia’s capacity for continuous adaptation is promising. In an experiment using a Llama-3.3-70B-Instruct model on 8 NVIDIA H100 GPUs and a prefill-heavy workload, Glia was tasked with a constrained optimization: maximize throughput (QPS) while keeping the tail latency (P90 TTFT) below 1500 ms.

Glia automatically adapted to the new operating conditions. It discovered a routing algorithm that met the TTFT constraint in all ten trials and achieved a higher QPS than all heuristics that failed the constraint. The Glia-discovered algorithm sustained TTFT up to 7 QPS, a 4.6x improvement over the expert’s algorithm. This proves Glia’s versatility and its ability to perform constrained optimization, ensuring SLO compliance without human intervention.

We believe that Glia is helping to pioneer a new era of autonomous infrastructure where systems are adaptive, self-improving, and, to cope with complexity, designed by AI itself. Just as Computer-Aided Design (CAD) ushered in an era of exponential growth for hardware design by enabling the development of faster computers that could be used to design even faster hardware, we believe that tools like Glia will unlock the next generation of performance, efficiency, and adaptability for the systems that implement and deliver AI advancements to users.

Glia is progressing toward its goal of becoming an AI capable of PhD-level systems design and optimization for real-world problems. The next great systems breakthrough may well be discovered by AI.

For more details on the Glia project, please check out glia.mit.edu

The article is edited by Haoran Qiu, Chieh-Jan Mike Liang, Francis Yan, and Tianyin Xu.