![Tests](https://github.com/tiktoken-go/tokenizer/actions/workflows/go.yml/badge.svg) # Tokenizer This is a pure go port of OpenAI's tokenizer. Buy Me A Coffee ## Usage ```go package main import ( "fmt" "github.com/tiktoken-go/tokenizer" ) func main() { enc, err := tokenizer.Get(tokenizer.Cl100kBase) if err != nil { panic("oh oh") } // this should print a list of token ids ids, _, _ := enc.Encode("supercalifragilistic") fmt.Println(ids) // this should print the original string back text, _ := enc.Decode(ids) fmt.Println(text) } ``` Alternatively you can use the included command-line tool ```sh > tokenizer -h Usage of tokenizer: -decode string tokens to decode -encode string text to encode -token string text to calculate token > tokenizer -encode supercalifragilistic ``` ## Todo - ✅ port code - ✅ o200k_base encoding - ✅ cl100k_base encoding - ✅ r50k_base encoding - ✅ p50k_base encoding - ✅ p50k_edit encoding - ✅ tests - ❌ handle special tokens - ❌ gpt-2 model ## Caveats This library embeds OpenAI's vocabularies—which are not small (~4Mb)— as go maps. This is different than what the way python version of tiktoken works, which downloads the dictionaries and puts them in a cache folder. However, since the dictionaries are compiled during the go build process the performance and start-up times should be better than downloading and loading them at runtime. ## Alternatives Here is a list of other libraries that do something similar. - [https://github.com/sugarme/tokenizer](https://github.com/sugarme/tokenizer) (A different tokenizer algorithm than OpenAI's) - [https://github.com/pandodao/tokenizer-go](https://github.com/pandodao/tokenizer-go) (deprecated, calls into JavaScript) - [https://github.com/pkoukk/tiktoken-go](https://github.com/pkoukk/tiktoken-go)