Running Microsoft's Phi-3 Model in an iOS app with Rust

Β· 3430 words Β· 17 minutes to read

Last month, Microsoft released the exciting new minimal AI model, Phi-3 mini. It’s a 3.8B model that can outperform many other larger models, while still being small enough to run on a phone. In this post, we’ll explore how to run the Phi-3 model inside a SwiftUI iOS application using the minimalist ML framework for Rust, called candle, and built by the nice folks at HuggingFace.

Running Rust on iOS πŸ”—

I have previously blogged about using Mozilla’s uniFFI as a way to run Rust code on iOS and integrate it into a Swift application. Today, we’ll use the same approach to run the Phi-3 model on an iPad.

I recommend you read the previous post to get a better understanding of how uniFFI works. In short, it allows you to define a Rust interface that can be called from Swift. UniFFI takes care of the FFI (Foreign Function Interface) part and generates strongly typed bindings for Swift, which can then be used to call the Rust code.

The compiled Rust library, as well as those bindings, can then be packaged as a framework and included in the iOS app rather easily. This is a great way to leverage Rust’s performance and safety in an iOS app, and it’s especially useful for writing code that needs to run on many different platforms. In fact, today’s approach, while focusing on iOS, would work just as well on Android.

Running the Phi-3 model with Candle πŸ”—

At the moment, candle has a limitation where it does not support running on Metal (GPU) on iOS. I expect that this limitation will be addressed soon, however, even running on CPU is still a decent starting point to get the model running locally on an iPad - though it also means we will need to go for a quantized version of the model. I tested this on my M1 Air It will probably work on an iPhone with 6GB RAM (the 4GB ones are probably too constrained) too, though I do not own a high-end iPhone to test it on.

I previously tried this approach for both Phi-2 (the predecessor of Phi-3) and for the base (non-quantized) Phi-3, and it was just too slow to be usable. However, the Q4_K_M quantized version of Phi-3 is much faster and should be usable on a phone.

Candle already contains a basic example for running the quantized Phi-3 model, which is a great baseline to use to get started. I will use this example as a starting point and modify it to run on iOS.

Rust orchestration πŸ”—

My cargo dependencies look like this:

[package]
name = "strathweb-phi-engine"
version = "0.1.0"
edition = "2021"

[lib]
crate-type = ["lib", "cdylib", "staticlib"]

[dependencies]
thiserror = "1.0"
uniffi = { version = "0.27.1", features=["build"] }
anyhow = "1.0.81"
candle-core = { version = "0.5.0" }
candle-nn = { version = "0.5.0" }
candle-transformers = { version = "0.5.0" }
hf-hub = { version = "0.3.2", features = ["tokio"] }
tokenizers = "0.15.2"

[build-dependencies]
uniffi = { version = "0.27.1", features=["build"] }
uniffi_build = "0.27.1"
uniffi_bindgen = "0.27.1"

Next, I am going to define the uniFFI interface that my iOS app will use to interact with the Rust code. This is done in a separate file, which I call strathweb-phi-engine.udl:

namespace strathweb_phi_engine {
};

dictionary InferenceOptions {
	u16 token_count;
	f64? temperature;
	f64? top_p;
    f32 repeat_penalty;
	u16 repeat_last_n;
	u64 seed;
};

dictionary EngineOptions {
    string cache_dir;
    string? system_instruction;
    string? tokenizer_repo;
    string? model_repo;
    string? model_file_name;
    string? model_revision;
};

dictionary InferenceResult {
    string result_text;
    u16 token_count;
    f64 duration;
	f64 tokens_per_second;
};

interface PhiEngine {
    [Throws=PhiError]
    constructor(EngineOptions engine_options, PhiEventHandler event_handler);

    [Throws=PhiError]
    InferenceResult run_inference(string prompt_text, [ByRef]InferenceOptions inference_options);
    
    [Throws=PhiError]
    void clear_history();
};

[Trait, WithForeign]
interface PhiEventHandler {
    [Throws=PhiError]
    void on_model_loaded();

    [Throws=PhiError]
    void on_inference_token(string token);
};

[Error]
interface PhiError {
    InitalizationError(string error_text);
    InferenceError(string error_text);
    HistoryError(string error_text);
};

This file defines the interface that the iOS app will use to interact with the Rust code - a PhiEngine interface with a run_inference method that takes a prompt text and inference options, and returns the generated text. It also declares a PhiEventHandler trait that the Rust code will use to notify the iOS app about certain events.

Let’s now write the corresponding Rust code, with placeholders for the actual Phi-3 model inference logic. The imports include all the necessary dependencies that we will need for the rest of the code later, and the generated scaffolding from the uniFFI file:

uniffi::include_scaffolding!("strathweb-phi-engine");

use anyhow::{Error as E, Result};
use candle_core::quantized::gguf_file;
use candle_core::{Device, Tensor};
use candle_transformers::generation::LogitsProcessor;
use candle_transformers::models::quantized_llama::ModelWeights as QPhi3;
use hf_hub::api::sync::ApiBuilder;
use hf_hub::Repo;
use std::collections::VecDeque;
use std::fs::File;
use std::io::Write;
use std::path::PathBuf;
use std::sync::{Arc, Mutex};
use thiserror::Error;
use tokenizers::Tokenizer;

struct PhiEngine {
    pub model: QPhi3,
    pub tokenizer: Tokenizer,
    pub device: Device,
    pub history: Mutex<VecDeque<String>>,
    pub system_instruction: String,
    pub event_handler: Arc<dyn PhiEventHandler>,
}

#[derive(Debug, Clone)]
pub struct InferenceOptions {
    pub token_count: u16,
    pub temperature: Option<f64>,
    pub top_p: Option<f64>,
    pub repeat_penalty: f32,
    pub repeat_last_n: u16,
    pub seed: u64,
}

#[derive(Debug, Clone)]
pub struct InferenceResult {
    pub token_count: u16,
    pub result_text: String,
    pub duration: f64,
    pub tokens_per_second: f64,
}

#[derive(Debug, Clone)]
pub struct EngineOptions {
    pub cache_dir: String,
    pub model_repo: Option<String>,
    pub tokenizer_repo: Option<String>,
    pub model_file_name: Option<String>,
    pub model_revision: Option<String>,
    pub system_instruction: Option<String>,
}

pub trait PhiEventHandler: Send + Sync {
    fn on_model_loaded(&self) -> Result<(), PhiError>;
    fn on_inference_token(&self, token: String) -> Result<(), PhiError>;
}

impl PhiEngine {
    pub fn new(engine_options: EngineOptions, event_handler: Arc<dyn PhiEventHandler>) -> Result<Self, PhiError> {
       // todo
    }

    pub fn run_inference(&self, prompt_text: String, inference_options: &InferenceOptions) -> Result<InferenceResult, PhiError> {
       // todo
    }

    pub fn clear_history(&self) -> Result<(), PhiError> {
       // todo
    }
}

#[derive(Error, Debug)]
pub enum PhiError {
    #[error("InitalizationError with message: `{error_text}`")]
    InitalizationError { error_text: String },

    #[error("InferenceError with message: `{error_text}`")]
    InferenceError { error_text: String },

    #[error("History with message: `{error_text}`")]
    HistoryError { error_text: String }
}

The presence of the PhiEventHandler trait is important, as it will allow the Rust code to callback into the iOS code at certain points - such as when the model has been loaded, or when a new token has been generated during inference. This, in turn, will enable us to do streaming, greatly improving user experience. This is not trivial to do manually over FFI, and is a great example of where uniFFI shines.

With that in place, we can start filling in the gaps. First, the constructor, where we will fetch the model and the tokenizer from HuggingFace, and load them into memory. This may not be the most efficient approach for production, but it’s decent enough for a demo:

    pub fn new(
        engine_options: EngineOptions,
        event_handler: Arc<dyn PhiEventHandler>,
    ) -> Result<Self, PhiError> {
        let start = std::time::Instant::now();
        // candle does not support Metal on iOS yet
        // this also requires building with features = ["metal"]
        //let device = Device::new_metal(0).unwrap();
        let device = Device::Cpu;

        // defaults
        let tokenizer_repo = engine_options
            .tokenizer_repo
            .unwrap_or("microsoft/Phi-3-mini-4k-instruct".to_string());
        let model_repo = engine_options
            .model_repo
            .unwrap_or("microsoft/Phi-3-mini-4k-instruct-gguf".to_string());
        let model_file_name = engine_options
            .model_file_name
            .unwrap_or("Phi-3-mini-4k-instruct-q4.gguf".to_string());
        let model_revision = engine_options
            .model_revision
            .unwrap_or("main".to_string());
        let system_instruction = engine_options.system_instruction.unwrap_or("You are a helpful assistant that answers user questions. Be short and direct in your answers.".to_string());

        let api_builder =
            ApiBuilder::new().with_cache_dir(PathBuf::from(engine_options.cache_dir.clone()));
        let api = api_builder
            .build()
            .map_err(|e| PhiError::InitalizationError {
                error_text: e.to_string(),
            })?;

        let repo = Repo::with_revision(
            model_repo.to_string(),
            hf_hub::RepoType::Model,
            model_revision,
        );
        let api = api.repo(repo);
        let model_path =
            api.get(model_file_name.as_str())
                .map_err(|e| PhiError::InitalizationError {
                    error_text: e.to_string(),
                })?;
        print!("Downloaded model to {:?}...", model_path);

        let api_builder =
            ApiBuilder::new().with_cache_dir(PathBuf::from(engine_options.cache_dir.clone()));
        let api = api_builder
            .build()
            .map_err(|e| PhiError::InitalizationError {
                error_text: e.to_string(),
            })?;
        let repo = Repo::with_revision(
            tokenizer_repo.to_string(),
            hf_hub::RepoType::Model,
            "main".to_string(),
        );
        let api = api.repo(repo);
        let tokenizer_path =
            api.get("tokenizer.json")
                .map_err(|e| PhiError::InitalizationError {
                    error_text: e.to_string(),
                })?;
        print!("Downloaded tokenizer to {:?}...", tokenizer_path);

        let mut file = File::open(&model_path).map_err(|e| PhiError::InitalizationError {
            error_text: e.to_string(),
        })?;
        let model =
            gguf_file::Content::read(&mut file).map_err(|e| PhiError::InitalizationError {
                error_text: e.to_string(),
            })?;
        let model = QPhi3::from_gguf(model, &mut file, &device).map_err(|e| {
            PhiError::InitalizationError {
                error_text: e.to_string(),
            }
        })?;
        let tokenizer =
            Tokenizer::from_file(tokenizer_path).map_err(|e| PhiError::InitalizationError {
                error_text: e.to_string(),
            })?;

        println!("Loaded the model in {:?}", start.elapsed());
        event_handler
            .on_model_loaded()
            .map_err(|e| PhiError::InitalizationError {
                error_text: e.to_string(),
            })?;

        Ok(Self {
            model: model,
            tokenizer: tokenizer,
            device: device,
            history: Mutex::new(VecDeque::with_capacity(6)),
            system_instruction: system_instruction,
            event_handler: event_handler,
        })
    }

For this demo, we will use the Phi-3-mini-4k-instruct-q4.gguf model file, from the microsoft/Phi-3-mini-4k-instruct-gguf HuggingFace repository. We will also need the tokenizer, which is not available there, but via the microsoft/Phi-3-mini-4k-instruct repository.

The hf-hub crate has excellent built-in functionality for loading models and tokenizers from HuggingFace, which makes this process very straightforward. The initial load might take a while because of the size of the model (2.4 GB) but it’s cached after the first load.

We explicitly set the device to CPU, as we are not yet able to run on Metal (GPU) on iOS.

Next, let’s implement the run_inference and clear_history methods:

pub fn run_inference(&self, prompt_text: String, inference_options: &InferenceOptions) -> Result<InferenceResult,PhiError> {
    let mut history = self.history.lock().map_err(|e| PhiError::HistoryError {
        error_text: e.to_string(),
    })?;

    // todo: this is a hack to keep the history length short so that we don't overflow the token limit
    // under normal circumstances we should count the tokens
    if history.len() == 10 {
        history.pop_front();
        history.pop_front();
    }

    history.push_back(prompt_text.clone());

    let history_prompt = history
        .iter()
        .enumerate()
        .map(|(i, text)| {
            if i % 2 == 0 {
                format!("\n<|user|>\n{}<|end|>", text)
            } else {
                format!("\n<|assistant|>\n{}<|end|>", text)
            }
        })
        .collect::<String>();

    // Phi-3 has no system prompt so we inject it as a user prompt
    let prompt_with_history = format!("<|user|>\nYour overall instructions are: {}<|end|>\n<|assistant|>Understood, I will adhere to these instructions<|end|>{}\n<|assistant|>\n", self.system_instruction, history_prompt);

    let mut pipeline = TextGeneration::new(
        &self.model,
        self.tokenizer.clone(),
        inference_options,
        &self.device,
        self.event_handler.clone(),
    );

    let response = pipeline
        .run(&prompt_with_history, inference_options.token_count)
        .map_err(|e: E| PhiError::InferenceError {
            error_text: e.to_string(),
        })?;
    history.push_back(response.result_text.clone());
    Ok(response)
}

pub fn clear_history(&self) -> Result<(), PhiError> {
    let mut history = self.history.lock().map_err(|e| PhiError::HistoryError { error_text: e.to_string() })?;
    history.clear();
    Ok(())
}

Inside the run_inference method, we first prepare the history prompt, which is a concatenation of the last 10 prompts and responses. This is needed as a mechanism to ensure that we do not exceed the token window size (4k for this model). We normally should count the tokens here to make it more accurate but for the demo, we keep it simple.

We then inject the system instruction as a user prompt, and pass it to the Phi-3 model for inference - Phi-3 does not support a dedicated system prompt, so we work around that by having an initial fixed user prompt as a sort of an overarching instruction. The response is then added to the history.

Each individual prompt-response pair is separated by the <|user|> and <|assistant|> tokens, which are used by the model to distinguish between user and assistant input. This is a common pattern in conversational models, and is based on the model’s documentation. Of course it’s also needed because at that level of working with the model, we don’t have the luxury of something like high-level OpenAI API to handle this for us.

The inference is internally encapsulated into a TextGeneration struct, which is a convenience wrapper around the model and tokenizer, and is responsible for running the inference. This is derived from the candle Phi-3 example.

struct TextGeneration {
    model: QPhi3,
    device: Device,
    tokenizer: Tokenizer,
    logits_processor: LogitsProcessor,
    inference_options: InferenceOptions,
    event_handler: Arc<dyn PhiEventHandler>,
}

impl TextGeneration {
    fn new(
        model: &QPhi3,
        tokenizer: Tokenizer,
        inference_options: &InferenceOptions,
        device: &Device,
        event_handler: Arc<dyn PhiEventHandler>,
    ) -> Self {
        let logits_processor = LogitsProcessor::new(inference_options.seed, inference_options.temperature, inference_options.top_p);
        Self {
            model: model.clone(),
            tokenizer,
            logits_processor,
            inference_options: inference_options.clone(),
            device: device.clone(),
            event_handler: event_handler,
        }
    }

    // inference code adapted from https://github.com/huggingface/candle/blob/main/candle-examples/examples/quantized/main.rs
    fn run(&mut self, prompt: &str, sample_len: u16) -> Result<InferenceResult> {
        println!("{}", prompt);

        let mut tos = TokenOutputStream::new(self.tokenizer.clone());
        let tokens = tos
            .tokenizer()
            .encode(prompt, true).map_err(E::msg)?;
        let tokens = tokens.get_ids();

        let mut all_tokens = vec![];
        let mut next_token = {
            let mut next_token = 0;
            for (pos, token) in tokens.iter().enumerate() {
                let input = Tensor::new(&[*token], &self.device)?.unsqueeze(0)?;
                let logits = self.model.forward(&input, pos)?;
                let logits = logits.squeeze(0)?;
                next_token = self.logits_processor.sample(&logits)?;
            }
            next_token
        };

        all_tokens.push(next_token);
        if let Some(t) = tos.next_token(next_token)? {
            print!("{t}");
            std::io::stdout().flush()?;
            self.event_handler.on_inference_token(t).map_err(|e| PhiError::InferenceError { error_text: e.to_string() })?;
        }

        let binding = self.tokenizer.get_vocab(true);
        let endoftext_token = binding
            .get("<|endoftext|>")
            .ok_or_else(|| anyhow::Error::msg("No <|endoftext|> found"))?;
        let end_token = binding
            .get("<|end|>")
            .ok_or_else(|| anyhow::Error::msg("No <|end|> found"))?;
        let assistant_token = binding
            .get("<|assistant|>")
            .ok_or_else(|| anyhow::Error::msg("No <|assistant|> found"))?;

        let start_post_prompt = std::time::Instant::now();
        let mut sampled = 0;
        let to_sample = sample_len.saturating_sub(1) as usize;
        for index in 0..to_sample {
            let input = Tensor::new(&[next_token], &self.device)?.unsqueeze(0)?;
            let logits = self.model.forward(&input, tokens.len() + index)?;
            let logits = logits.squeeze(0)?;
            let logits = if self.inference_options.repeat_penalty == 1.0 {
                logits
            } else {
                let start_at = all_tokens
                    .len()
                    .saturating_sub(self.inference_options.repeat_last_n.into());
                candle_transformers::utils::apply_repeat_penalty(
                    &logits,
                    self.inference_options.repeat_penalty,
                    &all_tokens[start_at..],
                )?
            };

            next_token = self.logits_processor.sample(&logits)?;
            all_tokens.push(next_token);

            if &next_token == endoftext_token
                || &next_token == end_token
                || &next_token == assistant_token
            {
                println!("Breaking due to eos: ${:?}$", next_token);
                std::io::stdout().flush()?;
                break;
            }

            if let Some(t) = tos.next_token(next_token)? {
                self.event_handler
                    .on_inference_token(t)
                    .map_err(|e| PhiError::InferenceError {
                        error_text: e.to_string(),
                    })?;
            }
            sampled += 1;
        }

        let dt = start_post_prompt.elapsed();
        let inference_result = InferenceResult {
            token_count: sampled,
            result_text: tos.decode_all().map_err(E::msg)?,
            duration: dt.as_secs_f64(),
            tokens_per_second: sampled as f64 / dt.as_secs_f64(),
        };
        Ok(inference_result)
    }
}

The usage of our event handler is demonstrated in the run method, where we notify the iOS app about each token that is generated during inference. This is how we can support streaming of the generated text back to the iOS app, which can then display it in real-time.

Building and packaging the Rust code for iOS πŸ”—

With this in place, we have all the necessary pieces to run the Phi-3 model on an iPad. We can now build the Rust code into an xcframework, and package it with the generated Swift bindings, and include it in an iOS app. This is done using this build script. Note that it already assumes a certain directory structure. Make sure to check out the repo for the full code:

#!/bin/bash

NAME="strathweb_phi_engine"
HEADERPATH="strathweb-phi-engine/bindings/strathweb_phi_engineFFI.h"
TARGETDIR="strathweb-phi-engine/target"
OUTDIR="phi.engine.sample/phi.engine.sample"
RELDIR="release"
STATIC_LIB_NAME="lib${NAME}.a"
NEW_HEADER_DIR="strathweb-phi-engine/bindings/include"

cargo build --manifest-path strathweb-phi-engine/Cargo.toml --target aarch64-apple-ios --release
cargo build --manifest-path strathweb-phi-engine/Cargo.toml --target aarch64-apple-ios-sim --release

mkdir -p "${NEW_HEADER_DIR}"
cp "${HEADERPATH}" "${NEW_HEADER_DIR}/"
cp "strathweb-phi-engine/bindings/strathweb_phi_engineFFI.modulemap" "${NEW_HEADER_DIR}/module.modulemap"

rm -rf "${OUTDIR}/${NAME}_framework.xcframework"

xcodebuild -create-xcframework \
    -library "${TARGETDIR}/aarch64-apple-ios/${RELDIR}/${STATIC_LIB_NAME}" \
    -headers "${NEW_HEADER_DIR}" \
    -library "${TARGETDIR}/aarch64-apple-ios-sim/${RELDIR}/${STATIC_LIB_NAME}" \
    -headers "${NEW_HEADER_DIR}" \
    -output "${OUTDIR}/${NAME}_framework.xcframework"

The key aspect is that we are building the Rust code for both the device and the simulator, and then packaging the resulting static libraries into an xcframework. This is necessary because we need to support both the device and the simulator when building an iOS app. To run this, make sure that the targets aarch64-apple-ios and aarch64-apple-ios-sim are available in your Rust toolchain. You can install them using rustup.

Putting it all together in an iOS app πŸ”—

The final piece of the puzzle is the iOS app that will act as the user interface for the Phi-3 model. This app will use the generated Swift bindings to call the Rust code, and provide the basic conversational UI for the user to interact with the model.

First we need to import the generated xcframework into the iOS app. This is done by dragging the xcframework into the Xcode project, and adding it to the Frameworks, Libraries, and Embedded Content section of the app target. The nice thing about this is that we do not need to worry about the Swift bindings or the C headers, as they are already included in the xcframework.

First the view model which will interact with the Rust bindings. For the sake of simplicity, we ignore the error handling here:

class Phi3ViewModel: ObservableObject {
    var engine: PhiEngine?
    let inferenceOptions: InferenceOptions = InferenceOptions(tokenCount: 100, temperature: 0.0, topP: 1.0, repeatPenalty: 1.0, repeatLastN: 64, seed: 146628346)
    @Published var isLoading: Bool = false
    @Published var isLoadingEngine: Bool = false
    @Published var messages: [ChatMessage] = []
    @Published var prompt: String = ""
    @Published var isReady: Bool = false
    
    func loadModel() async {
        DispatchQueue.main.async {
            self.isLoadingEngine = true
        }
        self.engine = try! PhiEngine(engineOptions: EngineOptions(systemInstruction: nil, tokenizerRepo: nil, modelRepo: nil, modelFileName: nil), eventHandler: ModelEventsHandler(parent: self))
        DispatchQueue.main.async {
            self.isLoadingEngine = false
            self.isReady = true
        }
    }
    
    func fetchAIResponse() async {
        if let engine = self.engine {
            let question = self.prompt
            DispatchQueue.main.async {
                self.isLoading = true
                self.prompt = ""
                self.messages.append(ChatMessage(text: question, isUser: true, state: .ok))
                self.messages.append(ChatMessage(text: "", isUser: false, state: .waiting))
            }
            
            let inferenceResult = try! engine.runInference(promptText: question, inferenceOptions: self.inferenceOptions)
            print("\nTokens Generated: \(inferenceResult.tokenCount), Tokens per second: \(inferenceResult.tokensPerSecond), Duration: \(inferenceResult.duration)s")
            
            DispatchQueue.main.async {
                self.isLoading = false
            }
        }
    }
    
    class ModelEventsHandler : PhiEventHandler {
        unowned let parent: Phi3ViewModel
        
        init(parent: Phi3ViewModel) {
            self.parent = parent
        }
        
        func onInferenceToken(token: String) throws {
            DispatchQueue.main.async {
                if let lastMessage = self.parent.messages.last {
                    let updatedText = lastMessage.text + token
                    if let index = self.parent.messages.firstIndex(where: { $0.id == lastMessage.id }) {
                        self.parent.messages[index] = ChatMessage(text: updatedText, isUser: false, state: .ok)
                    }
                }
            }
        }
        
        func onModelLoaded() throws {
            print("MODEL LOADED")
        }
    }
}

The view model makes usage of the streaming by implementing the PhiEventHandler protocol (which is how uniFFI generated the trait) and updating the UI in real-time as tokens are generated. The loadModel method is used to load the model, and the fetchAIResponse method is used to generate a response to a user prompt. The messages are stored in an array of ChatMessage objects, which have a text, a flag indicating whether it’s a user message, and a state.

enum MessageState {
    case ok
    case waiting
}

struct ChatMessage: Identifiable {
    let id = UUID()
    let text: String
    let isUser: Bool
    let state: MessageState
}

The final piece is the SwiftUI view that will display the messages and allow the user to interact with the Phi-3 model. Yet again, let me emphasize that this is a very basic implementation, and there are many improvements that can be made, such as better error handling or more sophisticated UI. While we are not going to storm to the top of App Store with this, it’s a good starting point to get the model running on an iPad.

struct ContentView: View {
    @ObservedObject var viewModel = Phi3ViewModel()

    var body: some View {
        NavigationStack {
            if !viewModel.isReady {
                Spacer()
                if viewModel.isLoadingEngine {
                    ProgressView()
                } else {
                    Button("Load model") {
                        Task {
                            await viewModel.loadModel()
                        }
                    }
                }
                Spacer()
            } else {
                VStack(spacing: 0) {
                    ScrollViewReader { proxy in
                        ScrollView {
                            VStack(alignment: .leading, spacing: 8) {
                                ForEach(viewModel.messages) { message in
                                    MessageView(message: message).padding(.bottom)
                                }
                            }
                            .id("wrapper").padding()
                            .padding()
                        }
                        .onChange(of: viewModel.messages.last?.id, perform: { value in
                            if viewModel.isLoading {
                                proxy.scrollTo("wrapper", anchor: .bottom)
                            } else if let lastMessage = viewModel.messages.last {
                                proxy.scrollTo(lastMessage.id, anchor: .bottom)
                            }
                            
                        })
                    }
                    
                    HStack {
                        TextField("Type a question...", text: $viewModel.prompt, onCommit: {
                            Task {
                                await viewModel.fetchAIResponse()
                            }
                        })
                        .padding(10)
                        .background(Color.gray.opacity(0.2))
                        .cornerRadius(20)
                        .padding(.horizontal)
                        
                        Button(action: {
                            Task {
                                await viewModel.fetchAIResponse()
                            }
                        }) {
                            Image(systemName: "paperplane.fill")
                                .font(.system(size: 24))
                                .foregroundColor(.blue)
                        }
                        .padding(.trailing)
                    }
                    .padding(.bottom)
                }
            }
        }.navigationTitle("Phi-3 Assistant")
    }
}

struct MessageView: View {
    let message: ChatMessage

    var body: some View {
        HStack {
            if message.isUser {
                Spacer()
                Text(message.text)
                    .padding()
                    .background(Color.blue)
                    .foregroundColor(.white)
                    .cornerRadius(10)
            } else {
                if message.state == .waiting {
                    TypingIndicatorView()
                } else {
                    VStack {
                        Text(message.text)
                            .padding()
                    }
                    .background(Color.gray.opacity(0.1))
                    .cornerRadius(10)
                    Spacer()
                }
            }
        }
        .padding(.horizontal)
    }
}

struct TypingIndicatorView: View {
    @State private var shouldAnimate = false

    var body: some View {
        HStack {
            ForEach(0..<3) { index in
                Circle()
                    .frame(width: 10, height: 10)
                    .foregroundColor(.gray)
                    .offset(y: shouldAnimate ? -5 : 0)
                    .animation(
                        Animation.easeInOut(duration: 0.5)
                            .repeatForever()
                            .delay(Double(index) * 0.2)
                    )
            }
        }
        .onAppear { shouldAnimate = true }
        .onDisappear { shouldAnimate = false }
    }
}

And here is the end result, showing me interacting with the Phi-3 model from the app. This is captured in the simulator, but I also tested it on an iPad and it works just as well.

This is after the model and all it’s 2.4 GB was already downloaded from Hugging Face, so on what we’d refer to as a “subsequent use” of the app. The recording is not sped up, and you can see the text streamed back to the UI. Looking at the metrics, we are getting a performance of about 12 tokens per seconds from the CPU. Given that we are also streaming, the user experience is not bad at all.

Summary πŸ”—

In this post we implemented a proof-of-concept for running the Phi-3 model on iOS using Rust and the uniFFI framework. We used the generated Swift bindings to call the Rust code, and displayed the generated text in real-time in a SwiftUI view. While this is a very basic implementation, it demonstrates the potential of running large language models on portable devices, even if we had to make some compromises to get it to work, like for example explicitly running on CPU.

If you would like to explore the code further, you can find the full implementation in the strathweb-phi-engine repository on Github.

I have not tested it on Android, but in principle, the exact same code should work there. The only difference would be in the build script, where you would need to target the aarch64-linux-android and x86_64-linux-android targets instead, and of course the compiled Rust static library would have to be packaged not into an xcframework but into an AAR (Android Archive) file.

About


Hi! I'm Filip W., a cloud architect from ZΓΌrich πŸ‡¨πŸ‡­. I like Toronto Maple Leafs πŸ‡¨πŸ‡¦, Rancid and quantum computing. Oh, and I love the Lowlands 🏴󠁧󠁒󠁳󠁣󠁴󠁿.

You can find me on Github, on Mastodon and on Bluesky.

My Introduction to Quantum Computing with Q# and QDK book
Microsoft MVP