Editor’s note: The authors are opening a blog series on the timely topic of system intelligence and the future of systems research with the intelligence as a new capability. They are actively looking for contributors to share ideas, viewpoints, and experiences.
Why This Blog Series?
Generative AI, as represented by Large Language Models (LLMs), has been making unprecedented impacts on many problem domains such as code generation and bug fixing, and has reshaped the research landscape across many fields in computing and data science. However, its role in tackling complex, real-world system challenges seems to remain controversial. Anecdotally, some doubt LLMs’ capability to manage system intricacies, while others express concerns over their safety and interpretability.
This controversy raises hard but urgent questions, which remain unanswered and even subject to drastically different opinions:
- Can AI models go beyond coding and bug fixing, towards the design and implementation of innovative and complex systems?
- What are the inherent limitations of existing AI models, and what are the best practices and tools available for systems researchers?
- How can we effectively “marry” AI research and systems research? Can this partnership ultimately prepare the SIGOPS community for “grand challenges”?
These questions actually set the stage for a new paradigm of computing systems. We envision:
The fundamental machinery of computing systems is no longer bounded by human ingenuity. Instead, it is replaced with self-evolving artifacts, enabled by system intelligence instantiating high-level and declarative goals, while obeying the safety and security principles that protect modern systems against adversaries.
With this blog series, we aim to spark active discussions in our community. We welcome scientific debates and principled thinking on the aforementioned questions. More generally, we encourage principled thinking on how system research could be fostered, scaled, and accelerated in the era of generative AI. Finally, subsequent episodes will also share practices and lessons through the community’s real-world stories and experiences.
A New Paradigm of Computing System Evolution
The systems community has always thrived on inflection points: times when new hardware, new workloads, or new scales force us to revisit long-held principles and assumptions. Generative AI represents another such moment.
Today’s convention and attempts to bring AI “into the loop” remain mostly exogenous — intelligence injected from the outside by humans. From architecting components, defining interfaces, crafting policies, tuning parameters, to troubleshooting failures, engineers have been the ones in the driver’s seat. For example, auto-tuners and ML-based optimizers tweak system configurations, but only after engineers carefully translate system properties into a learning space: selecting features, specifying knobs, writing objective functions, and defining constraints. The system itself does not own its evolution; it remains confined to the limits of human expertise and insight that define the problem space.
What if systems could self-evolve from within? Our vision is a leap towards endogenous system intelligence — given high-level specifications (e.g., optimization objectives, correctness properties), system intelligence can drive continuous discovery of new opportunities and autonomously act on them. For example, this new breed of systems could instrument tools to monitor their own states and behaviors, reason about collected observations, and formulate problems to solve. Furthermore, system intelligence could ideate and generate executable policies and mechanisms.
This paradigm shift goes far beyond plugging AI components into systems. It calls for a rethinking of how we build, evaluate, and trust complex systems in an era where intelligence is not layered on top of the infrastructure, but embedded within it.
Unlocking System Intelligence
This vision of endogenous system intelligence may sound bold, but it is not speculative. Recent advances in AI already demonstrate capabilities that point in this direction. These capabilities open the door to engage deeply with system properties, performance behaviors, resource trade-offs, and architectural decisions.
Knowledge of system fundamentals. Pretraining equips AI with broad system fundamentals and concepts, from cache coherence, consensus protocols, to job scheduling policies. Furthermore, in-context learning offers a way for AI to quickly adapt to specific system scenarios, enabling the synthesis of custom logics and heuristics.
Reasoning. Understanding system behavior is notoriously hard, due to factors such as multi-level structural complexity, the scale of component interactions, and the abundance of logs and signals. AI has demonstrated potential in reasoning about complex problems, and ongoing advances from the community continue to enhance these reasoning capabilities.
System tool use. AI can actively interface with systems through learning to use various tools like profilers, command-line tools, GUIs, and APIs. This allows AI to observe the live state of a system and ground its reasoning in dynamic system behavior and properties, as opposed to accessing static codebases.
Code generation. AI can program and implement candidate designs across a wide range of programming languages and programming interfaces. Additionally, it can also generate configurations and tests, following specified formats.
Learning from success and failure. After experiments or changes, AI can summarize the observed impact in natural language, enabling humans to provide high-level feedback while the system iterates on lower-level implementation details.
Open-ended Questions
Looking ahead, endogenous system intelligence represents an opportunity to revisit long-standing system-building principles, assumptions, and practices. It is also an opportunity for the SIGOPS community to define the frontier of system intelligence. Particularly, we see three interesting conversations:
Crafting a new breed of self-evolving systems
- If AI can change everything in a system, what system abstractions will replace today’s interfaces, knobs, and configuration files?
- In a moving target like self-evolving systems, how should we think about principles for system correctness, performance, and reliability?
- Can AI discover new system principles or insights that are previously overlooked with human ingenuity? Can AI predict the dynamics of complex systems?
Shaping future system researchers and engineers
- Can AI act as a mentor, offering interactive curricula, experiments, and critique that accelerate how we train system thinkers?
- If AI is capable of reasoning the chaos of extremely complex and large-scale systems, will it be able to train humans to develop comparable intuition and judgment?
- If AI is capable of taking over much of debugging, optimization, and low-level engineering, what skills or tools will humans need to remain effective system builders?
Grand challenges
- Can AI autonomously design and synthesize complex systems of unprecedented scale — systems that surpass humanity’s greatest engineering feat, such as the Linux kernel or the Internet?
- Can AI not only construct such massive systems but also formally prove their correctness and safety?
- Can AI efficiently synthesize, provision, optimize, and repair such systems, coordinating resources at a planetary scale?
This blog series is an open invitation to the community to chart the future of system intelligence. We welcome your ideas, critiques, and stories to shape subsequent episodes together.
About the Authors
Chieh-Jan Mike Liang is a Principal Researcher at Microsoft Research. He embraces learned intelligence to optimize the performance and user experience of cloud and computing systems.
Haoran Qiu is a systems researcher at Azure Research. He focuses on AI efficiency and learned cloud systems with efficiency, robustness, and reliability.
Francis Y. Yan is an assistant professor of computer science at the University of Illinois Urbana-Champaign. He develops and optimizes intelligent networked systems, with emphasis on safety, robustness, and real-world deployability.
Tianyin Xu is an associate professor at the University of Illinois Urbana-Champaign, working on reliable and secure systems. He is confused by AI and wants to keep sane to enjoy systems research.
Lidong Zhou is a CVP of Microsoft and the managing director of Microsoft Research Asia. He develops scalable, reliable, and trustworthy distributed systems.
The article is edited by Dong Du from Shanghai Jiaotong University.