[UNICODE] Base64 for UTF-8 and Unicode
btoa('héllo') throws. btoa('日本語') throws. This is not a bug — it is a 1990s API colliding with 2020s text. Here is the right way to Base64-encode Unicode, byte by byte.
// THE TRAP
btoa() in the browser looks innocent. You pass it a string, you get Base64. Then one day a user submits their name — Renée — and your code crashes with InvalidCharacterError: The string to be encoded contains characters outside of the Latin1 range.
This is not a quirk, and it is not JavaScript's fault in isolation. btoa() was specified in an era when "string" and "bytes" were treated as the same thing, as long as you stayed below U+00FF. Unicode ate the world, but btoa()'s contract stayed frozen at Latin-1. To Base64-encode anything beyond Western European characters, you need to take a detour through UTF-8 first.
// THE 30-SECOND FIX
// Browser — encode any Unicode string to Base64
function encodeUtf8Base64(str) {
const bytes = new TextEncoder().encode(str); // UTF-8 bytes
let bin = '';
for (const b of bytes) bin += String.fromCharCode(b); // Latin-1 wrapper
return btoa(bin); // now btoa can handle it
}
// Browser — decode Base64 back to a Unicode string
function decodeUtf8Base64(b64) {
const bin = atob(b64);
const bytes = Uint8Array.from(bin, c => c.charCodeAt(0));
return new TextDecoder().decode(bytes); // UTF-8 → string
}
// Usage
encodeUtf8Base64('Renée'); // 'UmVuw6ll'
encodeUtf8Base64('日本語'); // '5pel5pys6Kqe'
encodeUtf8Base64('👋 hello'); // '8J+RiyBoZWxsbw=='
decodeUtf8Base64('5pel5pys6Kqe'); // '日本語'
// WHY THIS WORKS — A SHORT DETOUR THROUGH UTF-8
JavaScript strings are sequences of UTF-16 code units. The character 日 is U+65E5, which is a single 16-bit code unit. The emoji 👋 (U+1F44B) is outside the Basic Multilingual Plane and is represented by a surrogate pair of two 16-bit code units.
btoa() needs bytes, not code units. And not just any bytes — it wants Latin-1 bytes in the range 0–255. UTF-16 code units routinely exceed 255 (for example, 日 = 0x65E5 = 26085), so btoa() throws.
UTF-8 is an 8-bit-clean encoding of the full Unicode character set. Every Unicode character becomes 1–4 bytes, and every byte is guaranteed to be in the 0–255 range. If we convert our string to UTF-8 bytes first, then trick btoa() into seeing those bytes as a Latin-1 string (one char per byte), everything lines up: btoa() sees bytes it can handle, and the receiver decodes the same bytes back to UTF-8.
// WHAT TextEncoder DOES UNDER THE HOOD
TextEncoder is the standards-based way to turn a JavaScript string into UTF-8 bytes. It has been in every browser since ~2017 and is a global in Node.js 11+.
A quick peek at the byte sequences it produces:
new TextEncoder().encode('A');
// Uint8Array [ 0x41 ] (1 byte)
new TextEncoder().encode('é');
// Uint8Array [ 0xC3, 0xA9 ] (2 bytes)
new TextEncoder().encode('日');
// Uint8Array [ 0xE6, 0x97, 0xA5 ] (3 bytes)
new TextEncoder().encode('👋');
// Uint8Array [ 0xF0, 0x9F, 0x91, 0x8B ] (4 bytes)
new TextEncoder().encode('Renée');
// Uint8Array [ 0x52, 0x65, 0x6E, 0xC3, 0xA9, 0x65 ] (6 bytes for 5 chars)
// NODE.JS: Buffer IS UTF-8-AWARE BY DEFAULT
Node's Buffer class does the UTF-8 conversion internally. You never need TextEncoder; passing a string to Buffer.from() encodes it as UTF-8 unless you explicitly pick another encoding. This is why server-side Base64 in Node is a one-liner:
// Encode
Buffer.from('日本語').toString('base64');
// '5pel5pys6Kqe'
// Decode
Buffer.from('5pel5pys6Kqe', 'base64').toString();
// '日本語'
// Explicit UTF-8 (the default, but clearer)
Buffer.from('Renée', 'utf8').toString('base64');
// 'UmVuw6ll'
// UNICODE BASE64 IN EVERY LANGUAGE
// Python
import base64
base64.b64encode('日本語'.encode('utf-8'))
# b'5pel5pys6Kqe'
base64.b64decode('5pel5pys6Kqe').decode('utf-8')
# '日本語'
// PHP
base64_encode('日本語'); // '5pel5pys6Kqe'
base64_decode('5pel5pys6Kqe'); // '日本語'
// (PHP strings are byte strings, so UTF-8 is handled implicitly
// if your source file is saved as UTF-8.)
// Java
import java.util.Base64;
import java.nio.charset.StandardCharsets;
Base64.getEncoder().encodeToString("日本語".getBytes(StandardCharsets.UTF_8));
// '5pel5pys6Kqe'
new String(Base64.getDecoder().decode("5pel5pys6Kqe"), StandardCharsets.UTF_8);
// '日本語'
// Go
import "encoding/base64"
base64.StdEncoding.EncodeToString([]byte("日本語"))
// '5pel5pys6Kqe'
b, _ := base64.StdEncoding.DecodeString("5pel5pys6Kqe")
string(b)
// '日本語'
// Ruby (source file must be UTF-8)
require 'base64'
Base64.strict_encode64('日本語') # '5pel5pys6Kqe'
Base64.decode64('5pel5pys6Kqe') # '日本語'
// Rust
use base64::{Engine, engine::general_purpose::STANDARD};
STANDARD.encode("日本語"); // "5pel5pys6Kqe"
STANDARD.decode("5pel5pys6Kqe")
.map(|b| String::from_utf8(b).unwrap())?; // "日本語"
// Shell (bash/zsh with GNU coreutils)
echo -n '日本語' | base64 # '5pel5pys6Kqe'
echo -n '5pel5pys6Kqe' | base64 -d # '日本語'
// COMMON UNICODE FAILURES
-
>
Mojibake on decode — you decoded with
atob()directly and the receiver sees garbage. Fix: wrap withTextDecoder. -
>
Surrogate-pair errors — your code splits strings by character count and hits the middle of a 👋 surrogate pair. Fix: use
TextEncoderon the whole string, not char-by-char. -
>
BOM issues — your source text starts with a UTF-8 BOM (
EF BB BF). It gets encoded along with everything else, and the receiver sees a stray\uFEFF. Strip BOMs before encoding. -
>
Wrong charset metadata — you put Base64 of UTF-8 bytes into a field labelled
charset=iso-8859-1. The decoder uses the label to interpret the bytes. Always align the metadata with the actual encoding. -
>
URL-safe Base64 confusion — you used base64url for storage but forgot to restore the standard alphabet before decoding. Fix: normalise with
replace(/-/g, '+').replace(/_/g, '/')and pad to a multiple of 4. -
>
Locale-dependent encoding — in Python 2,
'日本語'.encode()defaulted to ASCII and crashed. In any modern runtime, always specifyutf-8explicitly.
// ROUND-TRIP TEST — THE DEFINITIVE PROOF
The single best way to verify your Base64 pipeline handles Unicode correctly is a round-trip test. Encode a known-good string, decode it, and compare to the original byte-for-byte.
// Vitest / Jest / any test runner
const samples = [
'hello',
'héllo',
'日本語',
'한국어',
'👋 Renée — 日本語 — مرحبا',
'\n\t\0 edge cases',
'𐀀 ancient greek (U+10000)',
];
for (const s of samples) {
const encoded = encodeUtf8Base64(s);
const decoded = decodeUtf8Base64(encoded);
expect(decoded).toBe(s);
expect([...new TextEncoder().encode(s)])
.toEqual([...new TextEncoder().encode(decoded)]);
}
// BASE64 OF BINARY vs BASE64 OF TEXT
A subtle but important distinction. When you Base64 a PNG file, the input is already bytes — there's no Unicode question to answer. Just feed those bytes straight to the encoder.
When you Base64 a string, the input is a sequence of Unicode code points, not bytes. You must commit to a byte encoding (effectively always UTF-8 in 2026) before the encoder can see anything. The Base64 string you produce is valid only if the receiver reverses the same byte encoding.
This is the source of most "why is my Base64 wrong" tickets: two systems Base64-encoded the same logical string using different underlying byte encodings (one UTF-8, one UTF-16, or one Latin-1), and the receiver's decoder happily reproduces the wrong bytes.
// SHOULD I ALWAYS USE UTF-8?
Yes, unless you are working inside a legacy system that specifies otherwise. UTF-8 is:
• The dominant encoding on the web (> 97% of websites in 2026)
• The default in every modern programming language's standard library
• ASCII-compatible for the first 128 code points, so it never breaks English
• Well-defined and round-trippable for every Unicode character
• The IETF's recommended charset for new protocols (RFC 2277)
The old alternatives (UTF-16, UTF-32, Latin-1, GBK, Shift-JIS) still show up in legacy databases, Windows APIs, and some telecoms protocols. If your data came from one of those systems, convert to UTF-8 at the boundary, Base64 the UTF-8, and document the encoding decision in the payload format.
// THE TextEncoder-FREE FALLBACK (LEGACY BROWSERS)
If you need to support a very old browser without TextEncoder (IE11 and earlier), the classic trick is to use encodeURIComponent as a UTF-8 encoder, then unescape back to a byte string. This pattern appears in every StackOverflow answer about Base64 and Unicode from 2008 to 2020.
// Legacy browser fallback (pre-2017)
function encodeUtf8Base64Legacy(str) {
return btoa(unescape(encodeURIComponent(str)));
}
function decodeUtf8Base64Legacy(b64) {
return decodeURIComponent(escape(atob(b64)));
}
encodeUtf8Base64Legacy('日本語'); // '5pel5pys6Kqe'
decodeUtf8Base64Legacy('5pel5pys6Kqe'); // '日本語'
// Note: unescape() and escape() are deprecated.
// Only use this pattern if TextEncoder is genuinely unavailable.
// It also mishandles some edge-case surrogates.
// WHAT TO REMEMBER
- > Base64 encodes bytes, not characters. Pick a byte encoding first — almost always UTF-8.
-
>
In the browser:
TextEncoder→String.fromCharCode→btoa→ one direction. -
>
In the browser:
atob→Uint8Array.from→TextDecoder→ other direction. -
>
In Node.js:
Buffer.from(str).toString('base64')andBuffer.from(b64, 'base64').toString()— both are UTF-8 by default. -
>
In Python / Go / Java: always pass an explicit
'utf-8'to.encode()/.getBytes(). - > Always round-trip test with real Unicode (emoji, CJK, Arabic, combining marks) before shipping.
// RELATED
• What is Base64 and how does it work? — the 6-bit grouping that underlies everything here.
• Base64 in JavaScript (browser + Node) — full cross-runtime helper.
• URL-safe vs standard Base64 — an orthogonal choice you may also need to make.
• Try the live Base64 encoder — the same Unicode-safe pipeline, in your browser.