I've inspected the latest response from Mistral: Mistral-Small-24B-Instruct. It is bigger, slower than deepseek-ai/deepseek-r1-distill-qwen-7b but it also showing how it is thinking and doesn't send your sensitive data to China soil :)
So let's start.
This project provides an interactive chat interface for the mistralai/Mistral-Small-24B-Instruct-2501 model using PyTorch and the Hugging Face Transformers library.
Requirements
- Python 3.8+
- PyTorch
- Transformers
- An Apple Silicon device (optional, for MPS support)
Setup
Clone the repository:
git clone https://github.com/alexander-uspenskiy/mistral.git
cd mistral
Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
Install the required packages:
pip install torch transformers
Set your Hugging Face Hub token:
export HUGGINGFACE_HUB_TOKEN=your_token_here
Usage
Run the chat interface:
python mistral.py
Features
- Interactive chat interface with the Mistral-Small-24B-Base-2501 model.
- Progress indicator while generating responses.
- Supports Apple Silicon GPU (MPS) for faster inference.
Code:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
import time
import threading
# Check if MPS (Apple Silicon GPU) is available
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
# Load the Mistral-Small-24B-Base-2501 model
model_name = "mistralai/Mistral-Small-24B-Instruct-2501"
token = os.getenv("HUGGINGFACE_HUB_TOKEN")
tokenizer = AutoTokenizer.from_pretrained(model_name, token=token)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map={"": device},
torch_dtype=torch.float16, # Optimized for M1 GPU
token=token
)
def show_progress():
while not stop_event.is_set():
for char in "|/-\\":
print(f"\rGenerating response {char}", end="", flush=True)
time.sleep(0.1)
# Interactive terminal loop
print("mistralai/Mistral-Small-24B-Instruct-2501 Chat Interface (type 'exit' to quit)")
interaction_count = 0 # Initialize counter
while True:
user_input = input("You: ")
if user_input.lower() in ["exit", "quit"]:
break
inputs = tokenizer(user_input, return_tensors="pt").to(device)
stop_event = threading.Event()
progress_thread = threading.Thread(target=show_progress)
progress_thread.start()
with torch.no_grad():
outputs = model.generate(
**inputs,
max_length=200,
do_sample=True,
temperature=0.7,
top_p=0.9,
pad_token_id=tokenizer.eos_token_id # Added to avoid warning
)
stop_event.set()
progress_thread.join()
print("\r", end="") # Clear the progress line
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Mistral: {response}\n")
Example:
You: The bat and the ball costs $1.10 in total, bat is expensive than a ball costs $1.00 more than a ball, How much the ball costs
Mistral: The bat and the ball costs $1.10 in total, bat is expensive than a ball costs $1.00 more than a ball, How much the ball costs?
Let's denote the cost of the ball as \( B \) and the cost of the bat as \( B + 1.00 \).
According to the problem, the total cost of the bat and the ball is $1.10. Therefore, we can write the equation:
\[ B + (B + 1.00) = 1.10 \]
Simplifying the equation:
\[ 2B + 1.00 = 1.10 \]
Subtract 1.00 from both sides:
\[ 2B = 0.10 \]
Divide both sides by 2:
\[ B = 0.05 \]
So, the ball costs $0.05.
Summary:
So as you see modern models are able to run locally and solve logical tasks with the excellent performance.
Happy coding!
Top comments (0)