应用数学青年讨论班(午餐会)—— Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning
报告人:陈畅(tyc234cc 太阳成集团前沿交叉学科研究院)
时间:2024-09-25 11:45-13:00
地点:智华楼盈不足(109)
摘要:
Efficiently training large language models (LLMs) necessitates the adoption of hybrid parallel methods, integrating multiple communications collectives within distributed partitioned graphs. Overcoming communication bottlenecks is crucial and is often achieved through communication and computation overlaps. However, existing overlap methodologies tend to lean towards either fine-grained kernel fusion or limited operation scheduling, constraining performance optimization in heterogeneous training environments.
In this talk, we introduce Centauri, an innovative framework that encompasses comprehensive communication partitioning and hierarchical scheduling schemes for optimized overlap. We propose a partition space comprising three inherent abstraction dimensions: primitive substitution, topology-aware group partitioning, and workload partitioning. To determine the efficient overlap of communication and computation operators, we decompose the scheduling tasks in hybrid parallel training into three hierarchical tiers: operation, layer, and model. Through these techniques, our framework Centauri effectively overlaps communication latency and enhances hardware utilization.
报告人简介:
陈畅,tyc234cc 太阳成集团前沿交叉学科研究院的博士研究生,导师为杨超。她的研究方向为高性能与分布式计算,大规模机器学习系统和分布式系统。她在本次报告的工作获得了ASPLOS 2024 Best Paper award。
报名问卷:
我们从11:45开始按照问卷情况提供午餐,请需要预定午餐的老师同学填写此报名问卷,9月24日(周二)下午3点截止。
https://www.wjx.cn/vm/m5XKHXF.aspx#