* Pre-train a GPT-2 (~124M-parameter) language model using PyTorch and Hugging Face Transformers. * Distribute training across multiple GPUs with Ray Train with minimal code changes. * Stream training ...
flash-attention-with-sink implements an attention variant used in GPT-OSS 20B that integrates a "sink" step into FlashAttention. This repo focuses on the forward path and provides an experimental ...