prim · mst

Keyboard

Homeh or esc
Experiencee
Workw
Learnl
Abouta
Blogb
GitHubg
Toggle themet
Close? or esc
Berkeley · CA ·

Weishu Zhang

Currently Building mini-vLLM to learn inference internals · optimizing clinical extraction @ UCSF · engineering backend for an AGI Health Coach @ BalanX-BIO · Incoming SWE intern @ ASML (Summer 2026).
★ Latest project
mini-vLLM · paged KV-cache + continuous batching
7× more sequences · 1.53× throughput · 54× lower p99 TTFT

Experience

UC Berkeley · BS EECS · 4.0/4.0 · May 2027
Software Engineering in Test
ASML
San Jose, CA · May 2026 – Aug 2026
Incoming
CS Intern — Backend
BalanX-BIO
Remote · Jan 2026 – Present
  • Architecting high-performance Express.js backend for an AGI Health Coach with server-authoritative synchronization engine for real-time state consistency
  • Engineered AI Orchestrator Proxy interfacing with LLM APIs, managing dynamic context injection (personality vectors, health summaries) for personalized health insights
  • Designed offline-first conflict resolution logic for seamless data sync across HealthKit and Google Fit devices
  • Built automated data pipelines using AWS EventBridge and Lambda for daily health metric aggregation, secured via Cognito and JWT
Express.js AWS LangChain
Undergraduate Research Assistant
UCSF
San Francisco, CA · Aug 2025 – Present
  • Engineered high-throughput clinical extraction pipeline by decomposing monolithic tasks into modular sub-tasks; used in-context label propagation to reduce token consumption by 90%
  • Constructed 4,000+ entry domain-specific dataset and fine-tuned Qwen LLMs via QLoRA; implemented Self-Consistency checks ensuring 100% schema fidelity for medical compliance
  • Optimized inference infrastructure using vLLM and PagedAttention — 3× throughput increase for long-context medical records, ~80% VRAM reduction via quantization
PyTorch QLoRA vLLM MedSpacy
Lead Systems Architect
UC Berkeley — GradeView / GradeSync
Berkeley, CA · Aug 2025 – Present
  • Engineered distributed course management ecosystem, migrating fragmented legacy data into high-concurrency PostgreSQL cluster with normalized schema
  • Optimized system-wide read performance with Redis-backed caching layer and Python pre-rendering service — 95% query latency reduction
  • Hardened infrastructure with Row-Level Security and granular RBAC for 100+ concurrent students and staff
  • Resolved 200+ N+1 query patterns; deployed Nginx reverse proxy for 70% reduction in deployment overhead
React Node.js PostgreSQL Redis
← back

Selected Work

four projects · systems, AI infra, and shipped product work
mini-vLLM continuous batching throughput benchmark
mini-vLLM source ↗
From-scratch systems model of a modern LLM serving engine: paged KV-cache, continuous batching, preemption, prefix caching, and deterministic benchmarks that run on a laptop with no GPU. The control plane is real and unit-tested; compute is deliberately simulated so the memory and scheduling behavior can be isolated.
AI Infra PagedAttention KV Cache Scheduler Prefix Caching Python
GradeView live ↗
Mastery learning dashboard under Prof. Garcia. Aggregates and normalizes scores from Gradescope, PrairieLearn, and iClicker into a unified Student View.
React Node.js PostgreSQL Redis
DA Chat source ↗
AI educational assistant for De Anza students. RAG over college policies + gamified transfer planning + intelligent course counselor agent backed by OpenAI function calling.
Spring Boot Vue.js OpenAI MySQL
NLP Pipeline · Medical Data Structuring · UCSF
Clinical Note Structure Extraction private (privacy policy)
High-precision NLP pipeline converting unstructured clinical notes into standardized structured outputs. LoRA fine-tuning + MedSpacy entity recognition + vLLM-served inference for long-context records.
Python PyTorch MedSpacy LoRA vLLM

Learn

深入领域的问题驱动学习笔记 · self-imposed curricula
Below are the guides I'm building while I learn. Each follows the same loop — drive question → predict → fill the gap → read real code → self-check.
← back

About

Who

EECS at UC Berkeley, graduating May 2027. I build systems at the intersection of infrastructure engineering and applied AI — LLM inference control planes, clinical NLP pipelines, distributed course management platforms, and backend systems for health tech. Recently built mini-vLLM to make paged KV-cache, continuous batching, and prefix caching tangible on a laptop. Currently researching LLM optimization for clinical data extraction at UCSF and engineering backend infrastructure for an AGI health platform at BalanX-BIO. Incoming SWE intern at ASML (Summer 2026). Biased toward things that ship, scale, and matter.

Skills
Languages
Python · Java · C/C++ · SQL · TypeScript · JavaScript
Frameworks
FastAPI · React · Node.js · Spring Boot · Next.js · LangChain
Cloud & DevOps
AWS · Kubernetes · Docker · Terraform · CI/CD
AI & Data
PyTorch · vLLM · LoRA/QLoRA · MedSpacy · HuggingFace
← back