To reproduce the results and figures presented in the manuscript, nearly all scripts used are provided in this repository. The scripts provided in the folders from 01 to 10 and data or example input ...
This repository contains all code for reproducing experiments from the paper Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? Given a BPE tokenizer, our attack infers ...