Running Phi models on iOS with Apple MLX Framework

Β· 1651 words Β· 8 minutes to read

As I previously blogged a few times, I have been working on the Strathweb Phi Engine, a cross-platform library for running Phi model inference via a simple, high-level API, from a number of high-level languages: C#, Swift, Kotlin and Python. This of course includes the capability of running Phi models on iOS devices, and the sample repo contains a demo SwiftUI application that demonstrates how to do this.

Today I wanted to show an alternative way of running Phi models on iOS devices, using Apple’s MLX framework. I previously blogged about fine-tuning Phi models on iOS using MLX, so that post is a good read if you want to learn more about the MLX framework and how to use it.

Prerequisites πŸ”—

To follow along with this implementation, you’ll need:

  • macOS with Xcode 16 (or higher)
  • iOS 18+ device with Apple Silicon (iPhone or iPad meeting Apple Intelligence requirements, so with 8GB RAM or more)
  • basic knowledge of Swift and SwiftUI

Project Setup and Dependencies πŸ”—

We’ll start by creating a new iOS project and adding the MLX Swift Examples package:

// In Xcode: File > Add Package Dependencies
// URL: https://github.com/ml-explore/mlx-swift-examples

While the base MLX Swift package provides the core tensor operations on which we are going to be relying, the examples package extends that with additional components that are useful for working with LLMs:

  • model loading utilities for downloading from Hugging Face
  • tokenizer integration for text processing
  • inference helpers for text generation
  • pre-configured model definitions for popular architectures

This makes the actual implementation work very straightforward, as we can rely on the pre-defined model configurations and utilities for inference.

Memory Configuration πŸ”—

We’ll need to add appropriate entitlements to our application. Firstly, need network access to be able to pull the model from Hugging Face on demand. Secondly, the increased-memory-limit entitlement is particularly important for accommodating larger models. LLMs require substantial memory resources.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>com.apple.security.app-sandbox</key>
    <true/>
    <key>com.apple.security.network.client</key>
    <true/>
    <key>com.apple.developer.kernel.increased-memory-limit</key>
    <true/>
</dict>
</plist>

Core Implementation πŸ”—

Let’s examine the ViewModel that handles model loading and inference:

import MLX
import MLXLLM
import MLXLMCommon
import SwiftUI

@MainActor
class PhiViewModel: ObservableObject {
    @Published var isLoading: Bool = false
    @Published var isLoadingEngine: Bool = false
    @Published var messages: [ChatMessage] = []
    @Published var prompt: String = ""
    @Published var isReady: Bool = false
    
    private let maxTokens = 1024
    private var modelContainer: ModelContainer?
    
    func loadModel() async {
        DispatchQueue.main.async {
            self.isLoadingEngine = true
        }
        
        do {
            MLX.GPU.set(cacheLimit: 20 * 1024 * 1024)
            
            let modelConfig = ModelRegistry.phi3_5_4bit
            
            print("Loading \(modelConfig.name)...")
            self.modelContainer = try await LLMModelFactory.shared.loadContainer(
                configuration: modelConfig
            ) { progress in
                print("Download progress: \(Int(progress.fractionCompleted * 100))%")
            }
            
            if let container = self.modelContainer {
                let numParams = await container.perform { context in
                    context.model.numParameters()
                }
                print("Model loaded. Parameters: \(numParams / (1024*1024))M")
            }
            
            DispatchQueue.main.async {
                self.isLoadingEngine = false
                self.isReady = true
            }
        } catch {
            print("Failed to load model: \(error)")
            
            DispatchQueue.main.async {
                self.isLoadingEngine = false
            }
        }
    }
    
    func fetchAIResponse() async {
        guard !isLoading, let container = self.modelContainer else {
            print("Cannot generate: model not loaded or already processing")
            return
        }
        
        let userQuestion = prompt
        let currentMessages = self.messages
        
        DispatchQueue.main.async {
            self.isLoading = true
            self.prompt = ""
            self.messages.append(ChatMessage(text: userQuestion, isUser: true, state: .ok))
            self.messages.append(ChatMessage(text: "", isUser: false, state: .waiting))
        }
        
        do {
            let _ = try await container.perform { context in
                // Format message history for the model
                var messageHistory: [[String: String]] = [
                    ["role": "system", "content": "You are a helpful assistant."]
                ]
                
                for message in currentMessages {
                    let role = message.isUser ? "user" : "assistant"
                    messageHistory.append(["role": role, "content": message.text])
                }
                
                messageHistory.append(["role": "user", "content": userQuestion])
                
                let input = try await context.processor.prepare(
                    input: .init(messages: messageHistory))
                let startTime = Date()
                
                let result = try MLXLMCommon.generate(
                    input: input,
                    parameters: GenerateParameters(temperature: 0.6),
                    context: context
                ) { tokens in
                    let output = context.tokenizer.decode(tokens: tokens)
                    
                    Task { @MainActor in
                        if let index = self.messages.lastIndex(where: { !$0.isUser }) {
                            self.messages[index] = ChatMessage(
                                text: output,
                                isUser: false,
                                state: .ok
                            )
                        }
                    }
                    
                    if tokens.count >= self.maxTokens {
                        return .stop
                    } else {
                        return .more
                    }
                }
                
                let finalOutput = context.tokenizer.decode(tokens: result.tokens)
                Task { @MainActor in
                    if let index = self.messages.lastIndex(where: { !$0.isUser }) {
                        self.messages[index] = ChatMessage(
                            text: finalOutput,
                            isUser: false,
                            state: .ok
                        )
                    }
                    
                    self.isLoading = false
                    
                    print("Inference complete:")
                    print("Tokens: \(result.tokens.count)")
                    print("Tokens/second: \(result.tokensPerSecond)")
                    print("Time: \(Date().timeIntervalSince(startTime))s")
                }
                
                return result
            }
        } catch {
            print("Inference error: \(error)")
            
            DispatchQueue.main.async {
                if let index = self.messages.lastIndex(where: { !$0.isUser }) {
                    self.messages[index] = ChatMessage(
                        text: "Sorry, an error occurred: \(error.localizedDescription)",
                        isUser: false,
                        state: .ok
                    )
                }
                self.isLoading = false
            }
        }
    }
}

The above code demonstrates how to load a Phi 3.5 model using the MLX framework, and perform inference with token-by-token output. It takes advantage of the MLX framework’s capabilities to handle model loading, tokenization, and inference in a straightforward manner.

The loadModel method initializes the model and sets up the GPU cache limit. The fetchAIResponse method processes user input, generates a response using the model, and updates the UI with the generated text. We will display the token-by-token output in the UI, providing the common streaming experience that users expect from chat applications.

The ViewModel demonstrates the key MLX integration points:

  • setting GPU cache limits with MLX.GPU.set(cacheLimit:) to optimize memory usage on mobile devices
  • using LLMModelFactory to download the model on-demand and initialize the MLX-optimized model
  • accessing the model’s parameters and structure through the ModelContainer
  • leveraging MLX’s token-by-token generation through the MLXLMCommon.generate method
  • managing the inference process with appropriate temperature settings and token limits

Model Considerations πŸ”—

MLX only supports models previously converted to the MLX format, which can be done locally using the MLX CLI but is also handled by the MLX community. You can find pre-converted models at huggingface.co/mlx-community.

Our code so far uses a Phi 3.5 mini, quantized at 4 bits. This model is a good choice for most iOS devices, as it balances performance and memory usage. This model is also predefined in the MLX library, making it easy to use. However, it is also possible to use Phi-4 mini (at same quantization level) instead, however this is not supported in the latest MLX release (2.21.2 from December 2024). However, the support was already added to the main branch of the MLX Swift Examples repository, so you can use that version to access the latest features and models. This means referencing the package directly from the main branch in our Package.swift or via Xcode’s package manager interface:

// In your Package.swift or via Xcode's package manager interface
 .package(url: "https://github.com/ml-explore/mlx-swift-examples.git", branch: "main")

Phi-4 would then be possible to be referenced from Hugging Face as follows:

let phi4_mini_4bit = ModelConfiguration(
    id: "mlx-community/Phi-4-mini-instruct-4bit",
    defaultPrompt: "Explain quantum computing in simple terms.",
    extraEOSTokens: ["<|end|>"]
)

// Then use this configuration when loading the model
self.modelContainer = try await LLMModelFactory.shared.loadContainer(
    configuration: phi4_mini_4bit
) { progress in
    print("Download progress: \(Int(progress.fractionCompleted * 100))%")
}

This gives us access to the latest model configurations, such as Phi-4, before they’re included in an official Swift MLX release. You can use this approach to use different versions of Phi models or even other models that have been converted to the MLX format.

User Interface Implementation πŸ”—

Our interface then completes the ViewModel by creating a simple chat experience:

import SwiftUI

struct ContentView: View {
    @ObservedObject var viewModel = PhiViewModel()

    var body: some View {
        NavigationStack {
            if !viewModel.isReady {
                Spacer()
                if viewModel.isLoadingEngine {
                    ProgressView()
                    Text("Loading Phi model...")
                        .padding()
                } else {
                    VStack {
                        Text("On-device AI ready to use")
                            .font(.headline)
                            .padding()
                        
                        Button("Load model") {
                            Task {
                                await viewModel.loadModel()
                            }
                        }
                        .buttonStyle(.borderedProminent)
                    }
                }
                Spacer()
            } else {
                VStack(spacing: 0) {
                    ScrollViewReader { proxy in
                        ScrollView {
                            VStack(alignment: .leading, spacing: 8) {
                                ForEach(viewModel.messages) { message in
                                    MessageView(message: message)
                                        .padding(.bottom)
                                }
                            }
                            .id("wrapper")
                            .padding()
                        }
                        .onChange(of: viewModel.messages.last?.id) { value in
                            withAnimation {
                                proxy.scrollTo("wrapper", anchor: .bottom)
                            }
                        }
                    }
                    
                    HStack {
                        TextField("Type a question...", text: $viewModel.prompt, onCommit: {
                            Task {
                                await viewModel.fetchAIResponse()
                            }
                        })
                        .padding(10)
                        .background(Color.gray.opacity(0.2))
                        .cornerRadius(20)
                        .padding(.horizontal)
                        
                        Button(action: {
                            Task {
                                await viewModel.fetchAIResponse()
                            }
                        }) {
                            Image(systemName: "paperplane.fill")
                                .font(.system(size: 24))
                                .foregroundColor(.blue)
                        }
                        .padding(.trailing)
                        .disabled(viewModel.prompt.isEmpty || viewModel.isLoading)
                    }
                    .padding(.bottom)
                }
            }
        }
        .navigationTitle("Phi On-Device")
    }
}

The UI consists of three main components that work together to create a basic chat interface. ContentView creates a two-state interface that shows either a loading button or the chat interface depending on model readiness. MessageView renders individual chat messages differently based on whether they are user messages (right-aligned, blue background) or Phi model responses (left-aligned, gray background). TypingIndicatorView provides a simple animated indicator to show that the AI is processing.

Trying it out πŸ”—

We are now ready to build and run the application.

MLX does not support the simulator! You must run the app on a physical device with an Apple Silicon chip. See here for more information. If you’d like to debug locally on a Mac, a good option is to create a multi-targeted project with a macOS target and an iOS targe.

When the app launches, tap the “Load model” button to download and initialize the Phi-3 (or, depending on your configuration, Phi-4) model. This process may take some time depending on your internet connection, as it involves downloading the model from Hugging Face. Our implementation includes only a spinner to indicate loading, but you can see the actual progress in the Xcode console.

Once loaded, you can interact with the model by typing questions in the text field and tapping the send button.

Here is how our application should behave application, running on iPad Air M1:

Conclusion πŸ”—

And that’s it! The ability to run Phi models directly on iOS devices is really easy and very, very cool. Apple’s MLX framework provides a powerful foundation for implementing these models efficiently, opening up new possibilities for building intelligent applications that provide first class user privacy and work reliably regardless of network conditions. MLX also performs really well and can take advantage of Metal.

The source code for this post is available, as always, on GitHub. A variant of this post was also contributed to the official Phi Cookbook.

About


Hi! I'm Filip W., a software architect from ZΓΌrich πŸ‡¨πŸ‡­. I like Toronto Maple Leafs πŸ‡¨πŸ‡¦, Rancid and quantum computing. Oh, and I love the Lowlands 🏴󠁧󠁒󠁳󠁣󠁴󠁿.

You can find me on Github, on Mastodon and on Bluesky.

My Introduction to Quantum Computing with Q# and QDK book
Microsoft MVP