Ottino, Alessandro;
              
      
            
                Benjamin, Joshua;
              
      
            
                Zervas, Georgios;
              
      
        
        
  
(2023)
  RAMP: A flat nanosecond optical network and MPI operations for distributed deep learning systems.
Optical Switching and Networking
, 51
      
    
    
    
    , Article 100761.     10.1016/j.osn.2023.100761.
  
  
      
    
  
Preview  | 
            
              
Text
 RAMP_OSN_Arxiv.pdf - Accepted Version Download (2MB) | Preview  | 
          
Abstract
Distributed deep learning (DDL) systems strongly depend on network performance. Current electronic packet switched (EPS) network architectures and technologies suffer from variable diameter topologies, low-bisection bandwidth and over-subscription affecting completion time of communication and collective operations. We introduce a near-exascale, full-bisection bandwidth, all-to-all, single-hop, all-optical network architecture with nanosecond reconfiguration called RAMP, which supports large-scale distributed and parallel computing systems (12.8 Tbps per node for up to 65,536 nodes). For the first time, a custom RAMP-x MPI strategy and a network transcoder is proposed to run MPI collective operations across the optical circuit switched (OCS) network in a schedule-less and contention-less manner. RAMP achieves 7.6-171 speed-up in completion time across all MPI operations compared to realistic EPS and OCS counterparts. It can also deliver a 1.3-16 and 7.8-58 reduction in Megatron and DLRM training time respectively while offering 38-47 and 6.4-26.5 improvement in energy consumption and cost respectively.
| Type: | Article | 
|---|---|
| Title: | RAMP: A flat nanosecond optical network and MPI operations for distributed deep learning systems | 
| Open access status: | An open access version is available from UCL Discovery | 
| DOI: | 10.1016/j.osn.2023.100761 | 
| Publisher version: | https://doi.org/10.1016/j.osn.2023.100761 | 
| Language: | English | 
| Additional information: | This version is the author-accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions. | 
| Keywords: | Distributed deep learning systems, Optical circuit switched network architecture, MPI operations. | 
| UCL classification: | UCL UCL > Provost and Vice Provost Offices > UCL BEAMS UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Electronic and Electrical Eng  | 
        
| URI: | https://discovery.ucl.ac.uk/id/eprint/10180660 | 
Archive Staff Only
![]()  | 
        View Item | 
                      
