A from-scratch implementation of GPT-2 (124M parameters) following Andrej Karpathy's "Zero to Hero" playlist. The model was trained on 10B tokens from the FineWeb-Edu dataset using x2 H100 GPUs for ...