Announcing Strathweb Phi Engine - a cross-platform library for running Phi-3 anywhere

Β· 1287 words Β· 7 minutes to read

I recently wrote a blog post about using Rust to run Phi-3 model on iOS. The post received an overwhelmingly positive response, and I got a lot of questions about running Phi-3 using similar approach on other platforms, such as Android, Windows, macOS or Linux. Today, I’m excited to announce the project I have been working on recently - Strathweb Phi Engine, a cross-platform library for running Phi-3 (almost) anywhere.

Overview πŸ”—

The value proposition of the library is simple - abstract away all the low-level details of running Phi-3 model, and provide a simple API to run it on any platform, using any language. The library is built on top of the excellent candle library from Hugging Face, a minimalist ML framework for Rust.

Strathweb Phi Engine provides a simple API to interact with, without the need to write any Rust code or deal with FFI directly. Instead, each supported language - Swift, Kotlin, C# - gets the language specific bindings acting as the API surface of the library. The library also hides all the complexities of running the model, such as fetching and loading the model, tokenizing the input, running the model and decoding the output. The library is designed to be lightweight, with minimal overhead, and should be easy to include in your project.

You can find the source code, documentation and samples of using the library in C#, Swift and Kotlin apps on Github.

Getting started πŸ”—

Imagine you would like to bootstrap a Phi-3 model capable of writing a little haiku about hockey. Here are the annotated code samples on how you could do it in each of the supported languages.

C#:

using uniffi.strathweb_phi_engine;

// define inference options
var inferenceOptionsBuilder = new InferenceOptionsBuilder();
inferenceOptionsBuilder.WithTemperature(0.9);
inferenceOptionsBuilder.WithTokenCount(100);
var inferenceOptions = inferenceOptionsBuilder.Build();

// set up cache directory for the model
var cacheDir = Path.Combine(Directory.GetCurrentDirectory(), ".cache");

var modelBuilder = new PhiEngineBuilder();

// attach an optional event handler which will be called during inference
// this allows streaming tokens as they are generated
modelBuilder.WithEventHandler(new BoxedPhiEventHandler(new ModelEventsHandler()));

// build the model with a system instruction
var model = modelBuilder.BuildStateful(cacheDir, "You are a hockey poet");

// run inference
var result = model.RunInference("Write a haiku about ice hockey", inferenceOptions);

// inference result and stats are available in the result object
// however, when streaming was used, the result itself should not be needed
Console.WriteLine($"{Environment.NewLine}Tokens Generated: {result.tokenCount}{Environment.NewLine}Tokens per second: {result.tokensPerSecond}{Environment.NewLine}Duration: {result.duration}s");

class ModelEventsHandler : PhiEventHandler
{
    public void OnInferenceToken(string token)
    {
        Console.Write(token);
    }

    public void OnModelLoaded()
    {
        Console.WriteLine("Model loaded!");
    }
}

Swift:

import Foundation

// define inference options
let inferenceOptionsBuilder = InferenceOptionsBuilder()
try! inferenceOptionsBuilder.withTemperature(temperature: 0.9)
try! inferenceOptionsBuilder.withSeed(seed: 146628346)
let inferenceOptions = try! inferenceOptionsBuilder.build()

// set up cache directory for the model
let cacheDir = FileManager.default.currentDirectoryPath.appending("/.cache")

let modelBuilder = PhiEngineBuilder()

// attach an optional event handler which will be called during inference
// this allows streaming tokens as they are generated
try! modelBuilder.withEventHandler(eventHandler: BoxedPhiEventHandler(handler: ModelEventsHandler()))

// try enabling GPU - Metal is supported on MacOS
let gpuEnabled = try! modelBuilder.tryUseGpu()

// build the model with a system instruction
let model = try! modelBuilder.buildStateful(cacheDir: cacheDir, systemInstruction: "You are a hockey poet")

// run inference
let result = try! model.runInference(promptText: "Write a haiku about ice hockey", inferenceOptions: inferenceOptions)

// inference result and stats are available in the result object
// however, when streaming was used, the result itself should not be needed
print("""

****************************************
 πŸ“ Tokens Generated: \(result.tokenCount)
 πŸ–₯️ Tokens per second: \(result.tokensPerSecond)
 ⏱️ Duration: \(result.duration)s
 🏎️ GPU enabled: \(gpuEnabled)
""")

class ModelEventsHandler: PhiEventHandler {
    func onInferenceToken(token: String) {
        print(token, terminator: "")
    }

    func onModelLoaded() {
        print("""
 🧠 Model loaded!
****************************************
""")
    }
}

Kotlin:

// define inference options
val inferenceOptionsBuilder = InferenceOptionsBuilder()
inferenceOptionsBuilder.withTemperature(0.9)
inferenceOptionsBuilder.withSeed(146628346.toULong())
val inferenceOptions = inferenceOptionsBuilder.build()

// set up cache directory for the model
val cacheDir = File(System.getProperty("user.dir"), ".cache").absolutePath

val modelBuilder = PhiEngineBuilder()

// attach an optional event handler which will be called during inference
// this allows streaming tokens as they are generated
modelBuilder.withEventHandler(BoxedPhiEventHandler(ModelEventsHandler()))
val gpuEnabled = modelBuilder.tryUseGpu()

// build the model with a system instruction
val model = modelBuilder.buildStateful(cacheDir, "You are a hockey poet")

// run inference
val result = model.runInference("Write a haiku about ice hockey", inferenceOptions)

// inference result and stats are available in the result object
// however, when streaming was used, the result itself should not be needed
println(
    """
    
    ****************************************
    πŸ“ Tokens Generated: ${result.tokenCount}
    πŸ–₯️ Tokens per second: ${result.tokensPerSecond}
    ⏱️ Duration: ${result.duration}s
    🏎️ GPU enabled: $gpuEnabled
    """.trimIndent()
)

class ModelEventsHandler : PhiEventHandler {
    override fun onInferenceToken(token: String) {
        print(token)
    }

    override fun onModelLoaded() {
        println(
            """
            🧠 Model loaded!
            ****************************************
            """.trimIndent()
        )
    }
}

These simple examples above show how to run a Phi-3 model in a stateful mode, with streaming tokens as they are generated. This means that the library manages conversation history implicitly and trims the context window as needed. This is the least ceremony mode.

The library downloads the model from HuggingFace, caches it locally, and manages the model lifecycle. By default it uses the quantized Phi-3-mini-4k-instruct-gguf, but you can specify any other model from HuggingFace by providing the model repo, filename and revision or pointing to a model from the local filesystem. This is shown below (in Swift).

let engineBuilder = PhiEngineBuilder()
engineBuilder.withModelProvider(modelProvider: 
    PhiModelProvider.huggingFace(
        modelRepo: "...repo...", 
        modelFileName: "...model name...", 
        modelRevision: "...revision..."))

// or

let engineBuilder = PhiEngineBuilder()
engineBuilder.withModelProvider(modelProvider: 
    PhiModelProvider.fileSystem(
        modelPath: "...path to GGUF..."))

It is also possible to use stateless mode, which has a different API - where the conversation history has to be supplied explicitly. This mode is similar to the typical APIs we are used to when interacting with LLMs in the cloud, like OpenAI, and is useful when you want to manage the conversation history yourself, or when you want to run multiple inferences in parallel. This mode also allows changing the system instruction per interaction.

Integration πŸ”—

Since Strathweb Phi Engine is a cross platform library, it also has a cross platform integration story. This will be different for every platform and language, as there are different ways to integrate native libraries in each of them.

For .NET (C#), the library is distributed as a NuGet package, which you can include in your project by adding a reference to it in your project file. It is not yet published to NuGet, but it will be soon. At the moment you can fetch it from the build artifacts on the Github repository. This package works on Windows x64, macOS arm64 and Linux x64, but other platforms can also work if you build the project from source using the instructions in the repository.

For Swift, the library is distributed as an XCFramework, which you can include in your project by adding it to the Xcode project. You will also need the Swift language bindings to have the API surface available to call easily. Both of them can be found in the build artifacts on the Github repository. Soon, we will also have a Swift Package, which will make it even easier to include the library in your project.

For Kotlin, there is currently no packaging, but I plan to have an AAR soon. For now, you can include the library in your project by building it from source, using the approach similar to the one in the Kotlin sample, in the samples section in the repo.

Next steps πŸ”—

At the moment, the library is in the early stages of development, and the APIs are subject to change. The library is open to contributions, so if you have an idea for a feature or a use case that is not covered by the library, feel free to open an issue or a pull request.

I hope that this library will make it easier for developers to run Phi-3 models on any platform, and I’m excited to see what you will build with it!

About


Hi! I'm Filip W., a cloud architect from ZΓΌrich πŸ‡¨πŸ‡­. I like Toronto Maple Leafs πŸ‡¨πŸ‡¦, Rancid and quantum computing. Oh, and I love the Lowlands 🏴󠁧󠁒󠁳󠁣󠁴󠁿.

You can find me on Github and on Mastodon.

My Introduction to Quantum Computing with Q# and QDK book
Microsoft MVP