Note: This repository is to be archived. Tutorial content will be moved to the project repository that it relates to. To find the new location of an existing tutorial, refer to the following table: ...
First kernel added - a Triton Fused Softmax. Currently same speed and numerics as PyTorch Softmax in E2E training, hopefully better tuning will accelerate past PyTorch. Next up - RMSNorm. Fwd working, ...