Introduction
A few weeks ago, I was having a discussion with a CFO from a business partner company, regarding the implementation of watsonx.ai capacities inside their own solution. During the discussion about the costs I pronounced the word “token” and all of a sudden there was panic 😱
After explaining what tokens are, there came the question; “How do I count the tokens we send and receive? How much does it cost us?”
The answer was quite easy. We went to watsonx.ai studio prompt lab, went back and forth with some simple prompts and there we saw the number of tokens. I also showed the person some very nice websites where we can find how many tokens we send to a LLM by using simple inputs.
Later on I said to myself, why don’t I make my own token counter application (and my intention is write it in Go language as there’s a long time I didn’t use Golang!). Well I figured it’s a bit more complicated than that 🤣
First attempt — Using Regex
My first thought was using Regex, I could obtain more or less some acceptable results.
I set up the following Go app.
package main
import (
"bufio"
"fmt"
"log"
"os"
"regexp"
"strings"
"github.com/sqweek/dialog"
)
// countTokens approximates the number of tokens in a text based on whitespace and punctuation.
func countTokens(text string) int {
// A simple regex to split text into words and punctuation
tokenizer := regexp.MustCompile(`\w+|[^\w\s]`)
tokens := tokenizer.FindAllString(text, -1)
return len(tokens)
}
func main() {
// Open a file dialog box and let the user select a text file
filePath, err := dialog.File().Filter("Text Files", "txt").Load()
if err != nil {
if err.Error() == "Cancelled" {
fmt.Println("File selection was cancelled.")
return
}
log.Fatalf("Error selecting file: %v", err)
}
// Output the selected file name
fmt.Printf("Selected file: %s\n", filePath)
// Specify the file to read
//filePath := "input.txt"
// Open the file
file, err := os.Open(filePath)
if err != nil {
fmt.Printf("Error opening file: %v\n", err)
return
}
defer file.Close()
// Read the file line by line
var content strings.Builder
scanner := bufio.NewScanner(file)
for scanner.Scan() {
content.WriteString(scanner.Text())
content.WriteString("\n")
}
if err := scanner.Err(); err != nil {
fmt.Printf("Error reading file: %v\n", err)
return
}
// Get the text content
text := content.String()
// Count the tokens
tokenCount := countTokens(text)
// Output the result
fmt.Printf("The file contains approximately %d tokens.\n", tokenCount)
}
You’ll figure out that I’m a fan of GUI and dialog boxes, so I implemented a dialog box to select the input text file.
And here is the text file (some random text I found 😅).
The popularity of the Rust language continues to explode; yet, many critical codebases remain authored in C, and cannot be realistically rewritten by hand. Automatically translating C to Rust is thus an appealing course of action. Several works have gone down this path, handling an ever-increasing subset of C through a variety of Rust features, such as unsafe. While the prospect of automation is appealing, producing code that relies on unsafe negates the memory safety guarantees offered by Rust, and therefore the main advantages of porting existing codebases to memory-safe languages.
We instead explore a different path, and explore what it would take to translate C to safe Rust; that is, to produce code that is trivially memory safe, because it abides by Rust's type system without caveats. Our work sports several original contributions: a type-directed translation from (a subset of) C to safe Rust; a novel static analysis based on "split trees" that allows expressing C's pointer arithmetic using Rust's slices and splitting operations; an analysis that infers exactly which borrows need to be mutable; and a compilation strategy for C's struct types that is compatible with Rust's distinction between non-owned and owned allocations.
We apply our methodology to existing formally verified C codebases: the HACL* cryptographic library, and binary parsers and serializers from EverParse, and show that the subset of C we support is sufficient to translate both applications to safe Rust. Our evaluation shows that for the few places that do violate Rust's aliasing discipline, automated, surgical rewrites suffice; and that the few strategic copies we insert have a negligible performance impact. Of particular note, the application of our approach to HACL* results in a 80,000 line verified cryptographic library, written in pure Rust, that implements all modern algorithms - the first of its kind.
After running my code I get the following output;
The file contains approximately 359 tokens.
It seems fine, but, well… okay, but… against which model 🤔? And also there are different ways to implement Regex, so this one does not count at all 🤔!
Second attempt — running against a specific model
What I figured out was that unless we don’t use the specific “tokenizer” for a given LLM, the former method is not accurate. So I began to look at how to obtain some accurate results against a model such as gpt 3.5 which is on the market for a while now. After doing some research on the net, hereafter the app I came up with.
package main
import (
"bufio"
"bytes"
"fmt"
"log"
"os"
"os/exec"
"github.com/joho/godotenv"
"github.com/sqweek/dialog"
)
func main() {
// Open a file dialog box and let the user select a text file
filePath, err := dialog.File().Filter("Text Files", "txt").Load()
if err != nil {
if err.Error() == "Cancelled" {
fmt.Println("File selection was cancelled.")
return
}
log.Fatalf("Error selecting file: %v", err)
}
// Output the selected file name
fmt.Printf("Selected file: %s\n", filePath)
// Open the file
file, err := os.Open(filePath)
if err != nil {
fmt.Printf("Error opening file: %v\n", err)
return
}
defer file.Close()
// Read the file content
var content bytes.Buffer
scanner := bufio.NewScanner(file)
for scanner.Scan() {
content.WriteString(scanner.Text())
content.WriteString("\n")
}
if err := scanner.Err(); err != nil {
fmt.Printf("Error reading file: %v\n", err)
return
}
// Specify the model
model := "gpt-3.5-turbo"
// Execute the Python script
cmd := exec.Command("python3", "tokenizer.py", model)
cmd.Stdin = bytes.NewReader(content.Bytes())
output, err := cmd.Output()
if err != nil {
fmt.Printf("Error running tokenizer script: %v\n", err)
return
}
// Print the token count
fmt.Printf("Token count: %s", output)
}
As we can see in the code above, there is a call to a Python app which I found on aAs we can see in the code above, there is a call to a Python app which I found on an OpenAI site which helps (because it has been implemented) a "tiktoken" library to determine the number of tokens for gpt or other OpenAI models! Also the model name is hard coded! Also the model name is hard coded.
import sys
from tiktoken import encoding_for_model
def count_tokens(model, text):
enc = encoding_for_model(model)
tokens = enc.encode(text)
return len(tokens)
if __name__ == "__main__":
# Read model name and text from stdin
model = sys.argv[1] # E.g., "gpt-3.5-turbo"
text = sys.stdin.read()
print(count_tokens(model, text))
This works fine. For the same text given earlier, now I obtain the count of 366 tokens which is accurate, regarding all the websites I found and on which I set the model to GPT 3.5.
The thing I want to write is, a code fully in “Golang”… and I want to be able to run it for all models (or almost all) I can find on Huggingface (such as ibm-granite/granite-3.1–8b-instruct) 🤓
This would be the part 2 of this article (WIP).
So far I’m exploring the following (great 👌) Github repos;
- Tokenizer: https://github.com/sugarme/tokenizer
- tokenizers: https://github.com/daulet/tokenizers
- And last but not least -> go-huggingface: https://github.com/gomlx/go-huggingface?tab=readme-ov-file
Conclusion
Thanks for reading and open to comments.
And till the 2nd app is out, stay tuned… 🧪
Top comments (0)