Costing Only 300,000? Building a Personal AI Supercomputer on 4 Units of 512GB Mac Studio, Local Deployment Guide for Trillion-Parameter Kimi-K2.5
Costing Only 300,000? Building a Personal AI Supercomputer on 4 Units of 512GB Mac Studio, Local Deployment Guide for Trillion-Parameter Kimi-K2.5
In this era of large model frenzy, we all have a dream: to run a trillion-parameter model locally that rivals GPT-5. But the reality is harsh; even a 4-bit quantized trillion-parameter model requires massive video memory. H100 and B200 are too expensive; what to do if you can't afford them?
Today, JamePeng will guide you to use 4 fully equipped M3 Ultra Mac Studios, through EXO+MLX and Thunderbolt 5, to create a local AI supercomputer with 2TB of unified memory! The goal is simple: to run the Kimi-K2.5 trillion-parameter large model locally.
Why Go Through All This?
Not just for the cool factor, but also for data privacy and ultimate local control.
The core weapon is EXO (GitHub: exo-explore/exo), which supports RDMA (Remote Direct Memory Access) and can merge the unified memory of 4 Macs into a massive video memory pool.
Hardware list: 4 Mac Studios (M3 Ultra, 512GB memory version), total video memory approximately 2TB, connected using Thunderbolt 5 (120Gbps bandwidth), system requires macOS Tahoe 26.2 or later.
Step 1: Enable RDMA Support
On each Mac, perform the following:
- Shut down the Mac and enter recovery mode (hold the power button, select "Options" > "Continue")
- Open Terminal and run: bputil -a rdma
- Restart the Mac
- Verify: systemprofiler SPThunderboltDataType to check RDMA is enabled
Step 2: Install EXO
macOS App installation: Download EXO-version.dmg from GitHub, install and run it. Open the Dashboard to add other Mac IPs.
Source code installation:
- Install Homebrew
- git clone https://github.com/exo-explore/exo.git
- pip install -e .
- exo start
Step 3: Physical Connection and Topology
Do not use Wi-Fi for networking! Not even Wi-Fi 7 will work. The inference of trillion-parameter models is extremely sensitive to bandwidth. Please use Thunderbolt 5 cables, designate one Mac as the master node and the other three as worker nodes. A star topology or chain connection is recommended.
In the EXO Dashboard, you should see all 4 devices online, with the total memory pool displayed as 2048 GB.
Step 4: Download and Run MLX Community Version Kimi-K2.5
pip install huggingfacehub huggingface-cli download mlx-community/Kimi-K2.5 --local-dir ./models/mlx-community/Kimi-K2.5
exo run --model ./models/mlx-community/Kimi-K2.5 --quant 4 --shards auto --engine mlx Command explanation:
- --model: points to the model directory
- --quant 4: uses 4-bit quantization to reduce memory usage
- --shards auto: EXO automatically intelligently splits the model
- --engine mlx: utilizes the 76-core GPU and Neural Engine of M3 Ultra for inference
Final Effect and Testing
When the terminal shows Ready, you have your own AI supercomputer.
Prefill stage: The fans of the 4 Macs start to accelerate slightly (thanks to the energy efficiency of M3 Ultra, they won't take off).
Generation stage: Tokens pop out one after another.
Speed: Although it cannot match the H100 cluster, thanks to the RDMA support of Thunderbolt 5, the token generation speed can reach 17-28 tokens/s. This is fully interactive for a trillion-parameter model!
Conclusion
This solution is definitely not cheap, but it proves that with Apple Silicon and the efforts of the open-source community, the future of decentralized AI is coming. We do not need to send data to cloud giants; we can build powerful private inference clusters using the devices at hand.

