Branching Entropy Subword Tokenizer for Korean Finance and Economics

Overview

Tokenization — the process of splitting text into meaningful units before model training — is one of the most consequential preprocessing decisions in NLP. For Korean, this problem is especially challenging: word boundaries are not clearly marked, morphemes carry rich grammatical information, and the finance and economics domain introduces additional complexity through technical jargon, English loanwords, and institution names that standard tokenizers consistently mishandle.

This project develops a subword tokenizer specifically designed for Korean financial and economic text, using branching entropy as the segmentation principle and augmenting it with curated domain dictionaries.

Technical Approach

Branching entropy measures the uncertainty of character sequences at each position in a text. High entropy at a character boundary suggests that what follows is unpredictable — a natural segmentation point. By training a branching entropy model on a combination of domain-specific Korean financial texts and a general Korean corpus, the tokenizer learns to identify subword boundaries that respect both morphological structure and domain-specific term boundaries.

The development pipeline involves four steps:

Data collection: A diverse corpus of Korean financial texts — news articles, research reports, regulatory filings, corporate disclosures — is assembled alongside a general Korean corpus to balance domain specificity with generalizability.
Domain dictionary compilation: Finance and economics terminology is extracted from the corpus and supplemented with commonly used English-origin technical terms rendered in Korean (e.g., 포트폴리오, 헤지펀드). This dictionary guides the tokenizer toward correct segmentation of terms that general models fragment incorrectly.
Tokenizer training: The branching entropy model is trained with dictionary constraints, producing a tokenizer that handles domain vocabulary correctly while retaining general-purpose Korean tokenization capability.
Evaluation: The tokenizer is benchmarked on downstream NLP tasks — machine translation, text summarization, and natural language understanding — using financial corpus test sets. Results are compared against Byte Pair Encoding (BPE), the unigram model, and Morfessor.

Significance

A tokenizer that correctly handles financial Korean terminology reduces error propagation through the entire NLP pipeline. When a general tokenizer splits a technical term incorrectly, every downstream model — whether a classifier, summarizer, or machine translation system — is working with corrupted input. The domain-adapted branching entropy approach addresses this at the source.

The practical implications extend to any application that processes Korean financial text at scale: automated earnings report analysis, regulatory document parsing, financial news summarization, and cross-lingual investment research tools. By demonstrating that branching entropy combined with domain dictionaries outperforms widely-used general tokenization methods in this domain, the project provides a replicable pattern for building domain-specific tokenizers for other specialized Korean NLP applications.