ExecuTorch Mobile Kit

Overview

ExecuTorch Mobile Kit is an Android app that runs a TinyLlama model entirely on-device using Meta's ExecuTorch runtime. It reads your financial emails, pulls out transaction details like amounts, merchants, and categories, and stores everything in a local SQLite database. No data ever leaves the phone.

The app uses Qualcomm's QNN backend for hardware-accelerated inference, so the model actually runs at usable speeds on mobile. It tokenizes email text locally, runs it through ExecuTorch, and produces structured financial data that feeds into a spending analytics dashboard with category breakdowns and subscription tracking.

I built this at a hackathon in San Francisco at GitHub headquarters. The hackathon was focused on Edge Deployments and On-Device AI, hosted by Meta and Cerebral Valley with support from Qualcomm. I was the only solo participant (team of one) and made it to the top 6 finals. I'm planning to continue working on this project in the coming months.

Key Features

All AI inference runs on-device. No data leaves the phone.
Extracts transactions, amounts, merchants, and dates from different email formats
Stores all extracted financial records in a local SQLite database
Works fully offline after initial setup. No API keys or subscriptions needed.
Qualcomm QNN backend for hardware-accelerated inference on Snapdragon NPUs
TinyLlama quantized and pruned to fit within mobile memory and compute limits
Spending analytics dashboard with category breakdowns, trends, and monthly summaries
Financial data never touches a remote server at any point
On-device tokenization pipeline with no external dependencies

Tech Stack

ExecuTorch

Meta's on-device ML runtime for running PyTorch models on mobile. The core inference engine for TinyLlama on Android.

TinyLlama

Compact 1.1B parameter language model, quantized for mobile deployment. Handles email parsing and structured financial data extraction.

Android / Kotlin

Native app platform for the UI, email integration, and system-level access for on-device inference.

Qualcomm QNN

Qualcomm's AI Engine Direct backend. Offloads inference to dedicated hardware accelerators on Snapdragon chipsets.

SQLite

Local database for all extracted financial records. Supports fast queries for the analytics dashboard.

Architecture

The user provides email content to the app. The text gets tokenized on-device using a bundled tokenizer matching the TinyLlama vocabulary, then fed into the model through ExecuTorch. On supported devices, computation is offloaded to Qualcomm's QNN backend for hardware-accelerated inference.

The model output goes through a structured extraction layer that pulls out transaction amounts, merchant names, dates, and spending categories. That data gets written to a local SQLite database, indexed for fast queries. The analytics UI reads from this database to render spending dashboards, category breakdowns, and trend charts.

Challenges & Solutions

Fitting an LLM in mobile memory

Mobile devices have a fraction of the RAM and storage available on servers. Even TinyLlama at 1.1B parameters needed aggressive quantization (INT8 and INT4) to fit in memory while leaving room for the Android OS and app runtime. Finding the right quantization level meant profiling the tradeoff between model accuracy and memory usage until extraction quality was good enough without blowing past device limits.

Getting inference speed to something usable

Without optimization, inference was taking multiple seconds per email. That's too slow for daily use. The Qualcomm QNN backend made the biggest difference by offloading matrix operations to dedicated neural processing hardware. Beyond that, batching token processing, caching intermediate representations, and tuning the ExecuTorch execution plan to minimize memory copies between CPU and NPU all helped bring latency down.

Extraction accuracy with a small model

Smaller models give up reasoning capability. Financial emails come in all kinds of formats: bank alerts, receipts, subscription confirmations, investment statements. The model needed to reliably extract structured data from all of them. Prompt engineering did a lot of the heavy lifting, with instruction templates guiding TinyLlama toward consistent structured output. Edge cases like multi-currency transactions, partial refunds, and bundled charges required iterating on both prompts and post-processing heuristics.