Overview
Pure Lexicon AI addresses a deeply cultural problem: as Japanese absorbs more English loanwords (カタカナ語, or katakana-go), parts of the native lexicon risk atrophy. What if words like "computer" could be expressed purely through native Japanese roots — the way the Meiji reformers coined 電話 (telephone) from 電 (electric) + 話 (speech)?
This system automates that process: given an English loanword, the AI finds the best native Japanese word combination (up to 3 words) that semantically approximates its meaning, scored by cosine distance in a shared semantic embedding space.
Technical Architecture
The system is built on three core pillars:
- Semantic Embedding Mapping — Sentence-Transformers embed both the English input and the entire native Japanese vocabulary into the same high-dimensional vector space
- Vocabulary Filtering — We filter out katakana-derived words from the candidate pool, keeping only native yamato kotoba and Sino-Japanese kango
- Combinatorial Optimization — FAISS-powered vector similarity search finds single-word candidates, then greedy search evaluates 2-word and 3-word combinations scored by cosine proximity + brevity preference
The system prioritizes shorter combinations (1 word > 2 words > 3 words) to produce natural-sounding neologisms rather than awkward compound strings.
Research & Results
Initial tests on a vocabulary of ~18,000 native Japanese words produced some fascinating outputs. The system's suggestions often echoed historical coinage patterns — reinforcing the idea that semantic space captures something genuine about conceptual structure.
A secondary LLM-based verification module was prototyped to filter morphologically unnatural combinations and rank outputs by grammatical fluency.
This project is ongoing research — we are exploring fine-tuning on classical Japanese texts to improve the semantic embedding space for archaic vocabulary coverage.