IBM at PyTorch 2024
- San Francisco , California, United States
About
IBM is a proud sponsor of the 2024 PyTorch Conference in San Francisco on September 18-19. Join us at PyTorch Conference 2024 to learn more about how we’re contributing new ways to optimize AI training, tuning and inference with PyTorch. Here’s how:
- Visit the IBM booth on September 18 and 19 to learn more about how IBM is bringing transparent, trustworthy AI to communities worldwide through the recently open-sourced the IBM Granite family of enterprise AI models and the InstructLab open-source LLM project with Red Hat.
- Attend keynotes by IBM researchers Kush Varshney on trustworthy AI, and Mudhakar Srivatsa on inference optimizations for LLMs.
- Hear from IBM and Red Hat in poster presentations and lightning talks that discuss topics such as large-scale training, flexible data loaders, performant Triton kernels and supply chain security.
We look forward to seeing you in San Francisco!
*All times shown below represent your local time zone.
Career opportunities
If you are interested in joining our team, have a look at following job opening:
Speakers
Agenda
Panel:
- Sara Hooker, Vice President of Research, Cohere For AI
- Aleksander Mądry, Member of Technical Staff, OpenAI
- Kush Varshney, Fellow, IBM
- Rishi Bommasani, Ph.D. Candidate of Computer Science, Stanford University & Society Lead, Stanford Center for Research on Foundation Models
KVtorch.compile is a graph compilation technique that improves GPU utilization. A key challenge in getting torch.compile to perform well is to minimize (or eliminate) graph breaks, however, this isn't trivial as even the Llama implementation provided by Meta has many graph breaks resulting in reduced training throughput. In this talk we discuss 1. how we addressed these challenges in order to train a model using torch.compile 2. how we combined torch.compile with FSDP and selective activation checkpointing to achieve the maximum throughput for training 3. model quality comparison between models trained with compile and no-compile, and lastly 4. the best setup we have for different model sizes in the Llama family that achieves the maximum throughput and MFU number (e.g. 68% MFU for the 7B model on A100 GPUs!)
Speakers:
- Antoni Viros i Martin, Staff Research Scientist, IBM Research
- Linsong Chu, Senior Technical Staff Member, IBM Research
- Brian Vaughan, Senior Technical Staff Member, IBM
Since the dawn of the proprietary and open source software divergence there has been a debate on the security implications of these two approaches to software development. Proponents for proprietary software have championed that since the code is not public it is harder to exploit. Open source advocates have argued that since the code is open it promotes more scrutiny which increases its overall security posture.After much research and publications the argument that open source was more secure was supported and the debates subsided. In recent years the conversation has begun again yet nothing has changed for either of these software development approaches. The conversation should not simply reignite the same question but rather focus on what has changed. It is important to distinguish that open source itself is not less secure but that supply chain attacks have exploited its practices. The focus should be to secure the software supply chain. I will give a short history on the debate, present statistics on supply chain attacks, and explain that open source is not insecure but security of their supply chain is crucial, attendees will get actionable steps to secure their supply chain.
Speaker:
- Kathleen Goeschel, Principal Portfolio Architect for Product Security, Red Hat
Large-scale model pretraining crucially relies on specialized and dedicated dataloaders that can, for example, partition and stream data asynchronously across multiple processes and physical nodes. In this talk we discuss one of the torch-native dataloaders we built and use at IBM Research for addressing these needs. Intended for use in large-scale model pretraining, particularly in research settings where rapid iteration between datasets may be required, our dataloader is distributed, stateful, checkpointable, composable and rescalable – while remaining a simple extension of the existing PyTorch dataloading framework. It automatically and invisibly handles data sharding, shuffling, subdataset weighting, checkpoint saving and loading, and custom user-defined preprocessing functions, with minimal overhead and high throughput. We discuss these properties and how we achieved them, such as reducing overhead by implementing a custom LCG random number generator, and demonstrate proof of concept on production-scale training of a 7B parameter Llama model over 4 trillion tokens.
Speakers:
- Davis Wertheimer, Staff Research Scientist, IBM
- Linsong Chu, Senior Technical Staff Member, IBM Research