Costing Only 300,000? Building a Personal AI Supercomputer on 4 Units of 512GB Mac Studio, Local Deployment Guide for Trillion-Parameter Kimi-K2.5

2/26/2026
3 min read

Costing Only 300,000? Building a Personal AI Supercomputer on 4 Units of 512GB Mac Studio, Local Deployment Guide for Trillion-Parameter Kimi-K2.5

In this era of large model frenzy, we all have a dream: to run a trillion-parameter model locally that rivals GPT-5. But the reality is harsh; even a 4-bit quantized trillion-parameter model requires massive video memory. H100 and B200 are too expensive; what to do if you can't afford them?

Today, JamePeng will guide you to use 4 fully equipped M3 Ultra Mac Studios, through EXO+MLX and Thunderbolt 5, to create a local AI supercomputer with 2TB of unified memory! The goal is simple: to run the Kimi-K2.5 trillion-parameter large model locally.

Why Go Through All This?

Not just for the cool factor, but also for data privacy and ultimate local control.

The core weapon is EXO (GitHub: exo-explore/exo), which supports RDMA (Remote Direct Memory Access) and can merge the unified memory of 4 Macs into a massive video memory pool.

Hardware list: 4 Mac Studios (M3 Ultra, 512GB memory version), total video memory approximately 2TB, connected using Thunderbolt 5 (120Gbps bandwidth), system requires macOS Tahoe 26.2 or later.

Step 1: Enable RDMA Support

On each Mac, perform the following:

  • Shut down the Mac and enter recovery mode (hold the power button, select "Options" > "Continue")
  • Open Terminal and run: bputil -a rdma
  • Restart the Mac
  • Verify: systemprofiler SPThunderboltDataType to check RDMA is enabled
Thunderbolt 5 provides 120Gbps bandwidth, perfectly supporting data transfer.

Step 2: Install EXO

macOS App installation: Download EXO-version.dmg from GitHub, install and run it. Open the Dashboard to add other Mac IPs.

Source code installation:

  • Install Homebrew
  • git clone https://github.com/exo-explore/exo.git
  • pip install -e .
  • exo start

Step 3: Physical Connection and Topology

Do not use Wi-Fi for networking! Not even Wi-Fi 7 will work. The inference of trillion-parameter models is extremely sensitive to bandwidth. Please use Thunderbolt 5 cables, designate one Mac as the master node and the other three as worker nodes. A star topology or chain connection is recommended.

In the EXO Dashboard, you should see all 4 devices online, with the total memory pool displayed as 2048 GB.

Step 4: Download and Run MLX Community Version Kimi-K2.5

  • Download the model:
  • pip install huggingfacehub huggingface-cli download mlx-community/Kimi-K2.5 --local-dir ./models/mlx-community/Kimi-K2.5

  • Start the inference engine:
  • exo run --model ./models/mlx-community/Kimi-K2.5 --quant 4 --shards auto --engine mlx Command explanation:

    • --model: points to the model directory
    • --quant 4: uses 4-bit quantization to reduce memory usage
    • --shards auto: EXO automatically intelligently splits the model
    • --engine mlx: utilizes the 76-core GPU and Neural Engine of M3 Ultra for inference

    Final Effect and Testing

    When the terminal shows Ready, you have your own AI supercomputer.

    Prefill stage: The fans of the 4 Macs start to accelerate slightly (thanks to the energy efficiency of M3 Ultra, they won't take off).

    Generation stage: Tokens pop out one after another.

    Speed: Although it cannot match the H100 cluster, thanks to the RDMA support of Thunderbolt 5, the token generation speed can reach 17-28 tokens/s. This is fully interactive for a trillion-parameter model!

    Conclusion

    This solution is definitely not cheap, but it proves that with Apple Silicon and the efforts of the open-source community, the future of decentralized AI is coming. We do not need to send data to cloud giants; we can build powerful private inference clusters using the devices at hand.

    Published in Technology

    You Might Also Like