Privacy-first financial AI, fully on-device
AlphaExecuTorch Mobile Kit is an Android app that runs a TinyLlama model entirely on-device using Meta's ExecuTorch runtime. It reads your financial emails, pulls out transaction details like amounts, merchants, and categories, and stores everything in a local SQLite database. No data ever leaves the phone.
The app uses Qualcomm's QNN backend for hardware-accelerated inference, so the model actually runs at usable speeds on mobile. It tokenizes email text locally, runs it through ExecuTorch, and produces structured financial data that feeds into a spending analytics dashboard with category breakdowns and subscription tracking.
I built this at a hackathon in San Francisco at GitHub headquarters. The hackathon was focused on Edge Deployments and On-Device AI, hosted by Meta and Cerebral Valley with support from Qualcomm. I was the only solo participant (team of one) and made it to the top 6 finals. I'm planning to continue working on this project in the coming months.
Meta's on-device ML runtime for running PyTorch models on mobile. The core inference engine for TinyLlama on Android.
Compact 1.1B parameter language model, quantized for mobile deployment. Handles email parsing and structured financial data extraction.
Native app platform for the UI, email integration, and system-level access for on-device inference.
Qualcomm's AI Engine Direct backend. Offloads inference to dedicated hardware accelerators on Snapdragon chipsets.
Local database for all extracted financial records. Supports fast queries for the analytics dashboard.
The user provides email content to the app. The text gets tokenized on-device using a bundled tokenizer matching the TinyLlama vocabulary, then fed into the model through ExecuTorch. On supported devices, computation is offloaded to Qualcomm's QNN backend for hardware-accelerated inference.
The model output goes through a structured extraction layer that pulls out transaction amounts, merchant names, dates, and spending categories. That data gets written to a local SQLite database, indexed for fast queries. The analytics UI reads from this database to render spending dashboards, category breakdowns, and trend charts.
Mobile devices have a fraction of the RAM and storage available on servers. Even TinyLlama at 1.1B parameters needed aggressive quantization (INT8 and INT4) to fit in memory while leaving room for the Android OS and app runtime. Finding the right quantization level meant profiling the tradeoff between model accuracy and memory usage until extraction quality was good enough without blowing past device limits.
Without optimization, inference was taking multiple seconds per email. That's too slow for daily use. The Qualcomm QNN backend made the biggest difference by offloading matrix operations to dedicated neural processing hardware. Beyond that, batching token processing, caching intermediate representations, and tuning the ExecuTorch execution plan to minimize memory copies between CPU and NPU all helped bring latency down.
Smaller models give up reasoning capability. Financial emails come in all kinds of formats: bank alerts, receipts, subscription confirmations, investment statements. The model needed to reliably extract structured data from all of them. Prompt engineering did a lot of the heavy lifting, with instruction templates guiding TinyLlama toward consistent structured output. Edge cases like multi-currency transactions, partial refunds, and bundled charges required iterating on both prompts and post-processing heuristics.