Luminarc.ai
Orion 1
SSanshray Chada | May 11 · 5 min read

Orion1: a local AI assistant trained on a Mac

A practical write-up: what Orion1 is, what it can do, and how it was trained locally.

Local-firstLoRA fine-tuneApple Silicon (MPS)Gradio + Ollama

What Orion1 is

A small model, carefully tuned for real conversations

Orion1 is a ChatGPT-style assistant you can run entirely on your own machine. It starts from a small open instruction model (primarily Qwen2.5-1.5B-Instruct) and learns new behavior through a lightweight fine-tuning method called LoRA.

The goal wasn't "make it bigger." It was to make it more useful: clearer writing, better step-by-step reasoning, stronger coding help, and fewer made-up answers when the question includes source text.

How it works

The same chat loop as larger assistants, just local

Orion1 follows a standard chat loop:

  1. Build a conversation (system prompt + your messages + history)
  2. Convert the conversation into the model's exact chat format
  3. Generate the reply token-by-token (with safe stop rules)
  4. Stream the answer back into a local UI (Gradio) or into Ollama

Practical lesson: the Ollama Modelfile must match Qwen's chat template and stop tokens. When it doesn't, quality drops (style drift, weird identity claims, and inconsistent formatting).

Base model size

~1.5B params

Qwen2.5-1.5B-Instruct (primary)

Fine-tuning method

LoRA adapters

Train ~1% params instead of all weights

Training device

MacBook Air (M5)

Apple Silicon via PyTorch MPS

Training data mix

Configured cap across datasets: 680,927 examples (maximum)

Training data by category

  • General chat274,500 (40.3%)
  • Coding/agentic165,000 (24.2%)
  • Reasoning157,000 (23.1%)
  • Preference/safety50,000 (7.3%)
  • Grounded QA34,427 (5.1%)

Source: local dataset configuration. Caps are sampling ceilings, not guaranteed counts.

Configured examples by category

General chat274,500
Coding/agentic165,000
Reasoning157,000
Preference/safety50,000
Grounded QA34,427

Caps keep local training predictable on a laptop.

What we trained for

Practical targets and why they help

GoalHow we trained itExpected impact
Speak properlyHigh-quality instruction/chat + preference/safety datasetsCleaner structure, less rambling, better tone
Answer complex questionsReasoning + math/science mixtures (Stratos, Tulu, MetaMath, Nemotron)More step-by-step problem solving and synthesis
Coding + agentic behaviorCode instruction + function-calling datasetsBetter code generation and tool-style responses
Know info correctlyAdd grounded QA (BoolQ/SQuAD) + use "be honest" system promptsBetter at using provided context; still imperfect for open-world facts without retrieval

Tip: training helps style and habits. For up-to-date factual answers, pair the model with retrieval (search/RAG) or tools.

Milestones (project timeline)

What we built, in order

1

Vision model

2

Chat LoRA (Qwen2.5-1.5B)

3

UI + inference hardening

4

More coding/agentic data

5

Reasoning mix + grounded QA

6

Ollama export (GGUF + Modelfile)

Source: project history in the transcript.

Why training took days

Fanless laptop + large dataset

0.5h

Small smoke test

1.5h

20k examples

6h

100k+ examples

9h

500k+ examples

LoRA keeps memory manageable, but long runs on a MacBook Air can still be slow due to fanless thermals and limited GPU throughput.

Illustrative scaling curve (order-of-magnitude).

How Orion1 was trained (nutshell)

Training loop

We streamed multiple Hugging Face datasets and converted them into a single, consistent chat format. Then we applied Qwen's chat template and fine-tuned with LoRA on Apple Silicon (MPS), using gradient accumulation and checkpointing to fit within laptop memory.

Deployment

After training, we exported the model for local use: either via a Gradio UI in Python, or by converting to GGUF (llama.cpp) and importing into Ollama with a Modelfile that preserves the chat template.

Training hardware + constraints

Machine: MacBook Air (M5), running PyTorch with Apple's MPS backend.

Key constraints: fanless thermals, limited GPU throughput vs desktop CUDA, and unified memory pressure for long context lengths.

Why LoRA: trains a small set of adapter weights (millions) rather than all model weights (billions), making local fine-tuning feasible.

One-line summary: "Orion1 is a LoRA-adapted Qwen2.5-1.5B chat model trained locally on an M5 MacBook Air using a curated multi-dataset mixture for reasoning, coding, and grounded QA."