Hossein Entezari Zarch

Hello! I’m a 3rd-year Ph.D. student in Computer Science at the University of Southern California, advised by Prof. Murali Annavaram in the SCIP Lab at the USC Meta Research Center. I also earned my M.Sc. in Computer Science from USC and B.Sc. in Computer Engineering from the University of Tehran.

My research focuses on efficient and scalable machine-learning systems, particularly improving large language model (LLM) serving and inference efficiency. I work on topics such as I/O-aware computation, KV-cache optimization, speculative decoding, and sparse attention mechanisms, aiming to make LLMs more scalable, memory-efficient, and deployable in real-world environments. Here is a copy of my CV

News

11/06/2025: Our preprint DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing is now available on [arXiv].
10/10/2025: Our preprint DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning is now available on [arXiv].
07/07/2025: Our paper DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding has been accepted to COLM 2025.
05/22/2025: Our preprint MARché: Fast Masked Autoregressive Image Generation with Cache-Aware Attention is now available on [arXiv].
05/15/2025: Our paper KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation has been accepted to ACL Findings 2025.
12/10/2024: Our paper Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation has been accepted to the AAAI SEAS Workshop 2025.
09/20/2024: Our paper CADC: Encoding User-Item Interactions for Compressing Recommendation Model Training Data has been accepted to the LargeRecSys 2024 Workshop at ACM RecSys 2024.