Weishu Zhang

About

I like systems where the interesting part is below the API surface: schedulers, memory managers, serving control planes, data pipelines, and the benchmarks that make performance claims honest. My current focus is AI infrastructure for LLM inference.

Recently I built mini-vLLM, a small LLM serving engine simulator for learning and testing paged KV cache, continuous batching, preemption, and prefix caching. I am also working on clinical LLM extraction at UCSF and backend infrastructure for a health-agent product at BalanX-BIO.

LLM servingKV cache management, request scheduling, batching, preemption, prefix reuse, inference metrics.
AI systemsReliable extraction, evaluation, schema fidelity, long-context pipelines, production constraints.
Backend infraAPIs, event pipelines, sync engines, databases, observability, deployment.

News

2026.07Published The Shell Is a Protocol Now, on agent-native CLIs as execution and governance boundaries for enterprise workflows.
2026.05Published Infra Learning Path, a textbook-style AI infrastructure path anchored on vLLM and mini-vLLM.
2026.05Published Cloud Computing, from the machine up, a Chinese learning site for OS concepts, cloud architecture, Kubernetes, serverless, and GPU inference.
2026.05Published Data Systems Illustrated, a CS186-style visual companion for databases from SQL down to storage and recovery.
2026.05Released mini-vLLM: a reproducible, CPU-only simulator of modern LLM serving internals.
2026.05Updated benchmark suite for continuous batching, prefix caching, preemption, and serving latency tradeoffs.
2026.01Started backend infrastructure work at BalanX-BIO for a health-agent system.
2025.08Started UCSF research on LLM-based clinical note structure extraction.

Selected Systems

mini-vLLM

code benchmarks

LLM serving control plane, from scratch

A compact implementation of the control-plane ideas behind high-throughput LLM inference. It keeps compute simulated on purpose so memory management and scheduling behavior are easy to inspect, test, and benchmark on a laptop.

Implemented paged KV-cache allocation, block reuse, preemption, prefix caching, and continuous batching.
Added deterministic workload benchmarks and unit tests for scheduler, cache, engine, and prefix behavior.
Used the project as a concrete way to study vLLM-style serving internals instead of only reading papers.

7x more active sequences 1.53x sustained throughput 54x lower p99 TTFT

Python PagedAttention KV cache Scheduler Benchmarking

Clinical note structure extraction

private

UCSF research, clinical NLP and LLM inference

A pipeline for turning long, unstructured clinical records into standardized structured outputs under privacy and schema constraints.

Decomposed monolithic extraction into smaller schema-specific subtasks to reduce context waste and improve debuggability.
Built domain datasets and fine-tuning experiments with Qwen, QLoRA, self-consistency checks, and schema validation.
Optimized long-context inference with vLLM/PagedAttention and quantization to improve throughput and memory use.

vLLM PyTorch QLoRA MedSpacy Structured extraction

Health-agent backend infrastructure

private

BalanX-BIO, backend and AI orchestration

Backend systems for a personalized health-agent product, with emphasis on server-authoritative state, context assembly, and reliable data flows.

Built Express.js services and an AI orchestrator proxy for LLM API calls, context injection, and health summary retrieval.
Designed sync and conflict-resolution logic across mobile health data sources.
Used AWS EventBridge, Lambda, Cognito, and JWT-based auth for daily metric aggregation and secure access.

TypeScript Express AWS Sync engines LLM APIs

GradeSync / GradeView

live

UC Berkeley course infrastructure

A distributed course-management system for normalizing student score data across Gradescope, PrairieLearn, iClicker, and internal course workflows.

Migrated fragmented course data into a normalized PostgreSQL-backed service with role-aware access control.
Added Redis-backed caching and pre-rendering paths to reduce repeated query load.
Maintained production-facing course tooling used by students and staff.

PostgreSQL Redis React Node.js RBAC

Learning

Infra Learning Path

open source

AI infrastructure textbook anchored on vLLM

A six-month path for learning LLM serving systems through vLLM: request lifecycle, PagedAttention, scheduler design, kernels, mini-vLLM, and frontier serving systems.

Connects operating systems concepts to real LLM serving internals instead of teaching them as isolated theory.
Includes diagrams, source-reading anchors, self-checks, paper reading order, and a mini-vLLM project arc.
Adds a frontier serving chapter on prefill/decode disaggregation, KV transfer, prefix-aware routing, speculative decoding, and FP8/KV quantization.

vLLM PagedAttention KV cache LLM serving Systems

Cloud Computing, from the machine up

open

Cloud / OS illustrated companion in Chinese

A learning site that follows one cloud request down through CPU scheduling, virtual memory, I/O, virtualization, containers, networking, Kubernetes, serverless, and GPU inference.

Organizes 14 chapters around the path from a single machine to cloud-scale infrastructure.
Connects CS162-style operating systems concepts to virtual machines, containers, orchestration, and AI serving.
Published as a static learning artifact under the site’s Learning section.

Cloud Operating systems Kubernetes Serverless GPU inference

Data Systems Illustrated

open

CS186-style visual companion for database systems

A local-first illustrated textbook that walks from SQL and relational algebra down to buffer pools, B+ trees, query execution, locking, recovery, distributed commit, and modern data systems.

Organizes 12 database systems chapters around diagrams, invariants, and I/O cost intuition.
Includes a B+ tree sandbox plus static visual chapters for sorting, joins, optimization, recovery, and distributed transactions.
Built as a personal learning artifact rather than a copy of course notes.

Databases CS186 Storage Transactions Recovery

Writing

The Shell Is a Protocol Now Agent-native CLIs as executable contracts for reliable enterprise workflows.
Infra Learning Path Textbook-style path for learning AI infrastructure through vLLM and mini-vLLM.
Anatomy of a Recommender System as a Graph Networks, recommendation, sparsity, and fairness.
Why Networks Matter Graph structure, degree distributions, and connected systems.
Swarm Coder Setup Notes on local agent workflows and coding infrastructure.

Experience

Summer 2026
Software Engineering in Test, ASML
San Jose, CA.
2026 - present
CS Intern - Backend, BalanX-BIO
Backend infrastructure, AI orchestration, health data sync, and AWS event pipelines.
2025 - present
Undergraduate Research Assistant, UCSF
Clinical NLP, LLM fine-tuning, long-context inference, and schema-constrained extraction.
2025 - present
Lead Systems Architect, UC Berkeley
Course infrastructure for GradeSync and GradeView under Prof. Dan Garcia.