Evaluation tasks for LLMs: Perplexity, BPW, log nats...
I've been recently implementing a few evaluations for language models. It isn't really anything ground breaking, but if you need to run some quick offline metrics, such as perplexity, BPW, for your language model, you can use this as a simple template.