YONGJAE YOO

CONTACT    PROJECTS    CARDNEWS    PUBLICATIONS     ANY IDEAS?


Personal Voice Keyboard

A private AI dictation system for Windows and Android.

Personal Voice Keyboard is a private AI dictation system built for Windows and Android. It records short speech, routes the audio through a serverless proxy, transcribes it with OpenAI, cleans the result into paste-ready text, and copies it to the clipboard.

The project focuses on Korean-English mixed dictation, technical terminology preservation, low-friction mobile UX, and secure API key isolation. The core productivity gain is not only transcription, but speech-to-clean-text conversion: speak naturally, receive cleaned text, and paste it anywhere.

Overview

The main workflow is intentionally simple:

Speak -> Stop -> Transcribe -> Clean up -> Clipboard -> Paste

The system is designed as a completed personal tool rather than a public product or open-source release. It is currently used as a private workflow for faster, cleaner communication with LLMs and other writing surfaces.

Why I Built It

Typing long messages to LLMs is tedious. I wanted to have real conversations with LLMs, not slow finger-typed exchanges.

ChatGPT-style voice input was useful, but it remained tied to a specific app or interface. I wanted a clipboard-first workflow that worked across tools:

Speak -> get cleaned text -> paste anywhere

I also needed Korean-first dictation that could preserve intentional English terms, model names, shell commands, and technical phrases. For my use case, the important problem was not only speech recognition. It was turning speech into reliable, paste-ready technical text.

User Experience

On Windows, the tool behaves like a lightweight keyboard-side utility:

Tray app -> Ctrl+F11 start -> Ctrl+F10 stop/send -> Clipboard

On Android, the primary flow is designed around the Quick Settings tile:

Quick Settings tile -> Compact capture -> Stop & Send -> Copied -> Return

The goal is to minimize context switching. I can start recording, stop when finished, wait for processing, and paste the result into whichever app or conversation I was already using.

System Architecture

Windows Client / Android App
        |
        | audio file + app token
        v
Cloudflare Worker Proxy
        |
        | protected OpenAI API key
        v
OpenAI transcription
        |
        v
GPT-based cleanup
        |
        v
clean_text -> clipboard

The architecture keeps the OpenAI API key out of the Windows and Android clients. Clients only know the proxy URL and an app token. The Cloudflare Worker proxy centralizes model selection, cleanup policy, glossary handling, and safety boundaries.

This was an important design choice because both clients can improve when the proxy prompt, glossary, or model policy is updated. A server-side terminology glossary helps preserve technical terms such as sudo, systemctl, journalctl, Jetpack Compose, and Cloudflare Workers.

Web search is off by default to control cost and latency. It may be explored later only as a fallback for uncertain proper nouns or emerging terms. Exact internal prompts and deployment details remain private.

Model and Prompt Policy

The model policy prioritizes accuracy over raw speed. The cleanup step is designed to preserve meaning, avoid summarization, remove meaningless fillers, and produce text that is ready to paste.

The cleanup policy follows several constraints:

In public terms, the pipeline uses OpenAI transcription followed by GPT-based cleanup. The detailed private prompts, exact deployment configuration, and operational secrets are not part of this public page.

Windows Client

The Windows client is a packaged executable with a taskbar- and tray-friendly workflow. It supports global hotkeys, microphone selection, an input level meter, recording and processing timers, clipboard copy, and AppData-based persistence.

Default hotkeys:

Status: complete and usable.

Android Client

The Android app is also named Personal Voice Keyboard.

Implementation notes:

The primary Android flow is tile-first: open the Quick Settings tile, launch the compact capture activity, auto-start recording when configuration and permission are available, stop and send, copy the result to the clipboard, and return to the previous task.

Status: Android v1 is stable enough for daily personal use.

Security and Privacy

This is currently a personal-use tool, not a public service or open-source release.

Security and privacy boundaries:

Development Workflow

I treated AI-assisted development as a structured product loop rather than a one-shot code generation task. I set product direction and tested the tool in real use; ChatGPT helped with architecture and planning; Codex handled implementation; and daily usage determined the next patch.

The loop was effective because the roles were explicit:

What I Learned

Current Status

Personal Voice Keyboard is complete and usable as a private personal workflow.

The initial CLI/reference project is frozen as a quality baseline. The serverless proxy is active as the shared transcription and cleanup layer. The Windows client is packaged and usable. The Android v1 app is stable enough for daily personal use.

Future Work

Possible future directions include:

Personal Note

I want to leave a small note to myself here: congratulations on finishing this one.

This project began with a very practical frustration. Even typing a message to an LLM often felt like friction, especially on a phone. I wanted to talk to LLMs more like I think: freely, at length, with corrections handled after the thought was spoken rather than before it was written.

The clipboard-first design became the key. The tool does not immediately send my words into one specific app. It turns speech into cleaned text, puts it on the clipboard, and lets me decide where it goes. That small detail changed the feeling of the whole workflow.

Ironically, this started because I was impressed by ChatGPT-style voice input. After building this, I often prefer my own tool: I can speak for a long time, preserve technical terms, review the cleaned result, edit if needed, and paste it anywhere. It feels less like “dictation” and more like giving my thoughts directly to the computer.