SSanshray Chada | May 11 · 5 min read

Orion1: a local AI assistant trained on a Mac

A practical write-up: what Orion1 is, what it can do, and how it was trained locally.

Local-firstLoRA fine-tuneApple Silicon (MPS)Gradio + Ollama

What Orion1 is

A small model, carefully tuned for real conversations

Orion1 is a ChatGPT-style assistant you can run entirely on your own machine. It starts from a small open instruction model (primarily Qwen2.5-1.5B-Instruct) and learns new behavior through a lightweight fine-tuning method called LoRA.

The goal wasn't "make it bigger." It was to make it more useful: clearer writing, better step-by-step reasoning, stronger coding help, and fewer made-up answers when the question includes source text.

How it works

The same chat loop as larger assistants, just local

Orion1 follows a standard chat loop:

Build a conversation (system prompt + your messages + history)
Convert the conversation into the model's exact chat format
Generate the reply token-by-token (with safe stop rules)
Stream the answer back into a local UI (Gradio) or into Ollama

Practical lesson: the Ollama Modelfile must match Qwen's chat template and stop tokens. When it doesn't, quality drops (style drift, weird identity claims, and inconsistent formatting).

Base model size

~1.5B params

Qwen2.5-1.5B-Instruct (primary)

Fine-tuning method

LoRA adapters

Train ~1% params instead of all weights

Training device

MacBook Air (M5)

Apple Silicon via PyTorch MPS

Training data mix

Configured cap across datasets: 680,927 examples (maximum)

Training data by category

General chat274,500 (40.3%)
Coding/agentic165,000 (24.2%)
Reasoning157,000 (23.1%)
Preference/safety50,000 (7.3%)
Grounded QA34,427 (5.1%)

Source: local dataset configuration. Caps are sampling ceilings, not guaranteed counts.

Configured examples by category

General chat274,500

Coding/agentic165,000

Reasoning157,000

Preference/safety50,000

Grounded QA34,427

Caps keep local training predictable on a laptop.

What we trained for

Practical targets and why they help

Goal	How we trained it	Expected impact
Speak properly	High-quality instruction/chat + preference/safety datasets	Cleaner structure, less rambling, better tone
Answer complex questions	Reasoning + math/science mixtures (Stratos, Tulu, MetaMath, Nemotron)	More step-by-step problem solving and synthesis
Coding + agentic behavior	Code instruction + function-calling datasets	Better code generation and tool-style responses
Know info correctly	Add grounded QA (BoolQ/SQuAD) + use "be honest" system prompts	Better at using provided context; still imperfect for open-world facts without retrieval

Tip: training helps style and habits. For up-to-date factual answers, pair the model with retrieval (search/RAG) or tools.

Milestones (project timeline)

What we built, in order

Vision model

Chat LoRA (Qwen2.5-1.5B)

UI + inference hardening

More coding/agentic data

Reasoning mix + grounded QA

Ollama export (GGUF + Modelfile)

Source: project history in the transcript.

Why training took days

Fanless laptop + large dataset

0.5h

Small smoke test

1.5h

20k examples

100k+ examples

500k+ examples

LoRA keeps memory manageable, but long runs on a MacBook Air can still be slow due to fanless thermals and limited GPU throughput.

Illustrative scaling curve (order-of-magnitude).

How Orion1 was trained (nutshell)

Training loop

We streamed multiple Hugging Face datasets and converted them into a single, consistent chat format. Then we applied Qwen's chat template and fine-tuned with LoRA on Apple Silicon (MPS), using gradient accumulation and checkpointing to fit within laptop memory.

Deployment

After training, we exported the model for local use: either via a Gradio UI in Python, or by converting to GGUF (llama.cpp) and importing into Ollama with a Modelfile that preserves the chat template.

Training hardware + constraints

Machine: MacBook Air (M5), running PyTorch with Apple's MPS backend.

Key constraints: fanless thermals, limited GPU throughput vs desktop CUDA, and unified memory pressure for long context lengths.

Why LoRA: trains a small set of adapter weights (millions) rather than all model weights (billions), making local fine-tuning feasible.

One-line summary: "Orion1 is a LoRA-adapted Qwen2.5-1.5B chat model trained locally on an M5 MacBook Air using a curated multi-dataset mixture for reasoning, coding, and grounded QA."