hjp: Programs: utf8dump

utf8dump

Description

utf8dump interpretes stdin as a UTF-8 stream and dumps the codepoint and optionally name of each character to stdout.

For example, the title of Pike & Thompson's UTF-8 paper would be displayed as:

00048 H  00065 e  0006c l  0006c l  0006f o  00020    00057 W  0006f o  
00072 r  0006c l  00064 d  0000a .  0006f o  00072 r  0000a .  0039a Κ  
003b1 α  003bb λ  003b7 η  003bc μ  003ad έ  003c1 ρ  003b1 α  00020    
003ba κ  003cc ό  003c3 σ  003bc μ  003b5 ε  0000a .  0006f o  00072 r  
0000a .  03053 こ  03093 ん  0306b に  03061 ち  0306f は  00020    04e16 世  
0754c 界  0000a .  

or — when invoked with the -c option — as:

00048 H LATIN CAPITAL LETTER H
00065 e LATIN SMALL LETTER E
0006c l LATIN SMALL LETTER L
0006c l LATIN SMALL LETTER L
0006f o LATIN SMALL LETTER O
00020   SPACE
00057 W LATIN CAPITAL LETTER W
0006f o LATIN SMALL LETTER O
00072 r LATIN SMALL LETTER R
0006c l LATIN SMALL LETTER L
00064 d LATIN SMALL LETTER D
0000a . LINE FEED
0006f o LATIN SMALL LETTER O
00072 r LATIN SMALL LETTER R
0000a . LINE FEED
0039a Κ GREEK CAPITAL LETTER KAPPA
003b1 α GREEK SMALL LETTER ALPHA
003bb λ GREEK SMALL LETTER LAMDA
003b7 η GREEK SMALL LETTER ETA
003bc μ GREEK SMALL LETTER MU
003ad έ GREEK SMALL LETTER EPSILON WITH TONOS
003c1 ρ GREEK SMALL LETTER RHO
003b1 α GREEK SMALL LETTER ALPHA
00020   SPACE
003ba κ GREEK SMALL LETTER KAPPA
003cc ό GREEK SMALL LETTER OMICRON WITH TONOS
003c3 σ GREEK SMALL LETTER SIGMA
003bc μ GREEK SMALL LETTER MU
003b5 ε GREEK SMALL LETTER EPSILON
0000a . LINE FEED
0006f o LATIN SMALL LETTER O
00072 r LATIN SMALL LETTER R
0000a . LINE FEED
03053 こ HIRAGANA LETTER KO
03093 ん HIRAGANA LETTER N
0306b に HIRAGANA LETTER NI
03061 ち HIRAGANA LETTER TI
0306f は HIRAGANA LETTER HA
00020   SPACE
04e16 世 CJK UNIFIED IDEOGRAPH-4E16
0754c 界 CJK UNIFIED IDEOGRAPH-754C
0000a . LINE FEED

Source-Code and History

2023-06-26 utf8dump (Perl source code) 561B The current version — uses 5 digits for codepoints (I'm finally giving in to the existence of emojis).
2007-10-30 utf8dump (Perl source code) 561B The first version — used 4 digit codepoints.

Binary distributions

None at the moment. It's just a single tiny perl script.

$Date$
vim: sw=2