Overview
A GEDCOM file includes a CHAR
tag in the HEAD
section that indicates the character encoding used in the file. The valid options as of GEDCOM v5.5.1 are ANSEL
, ASCII
, UNICODE
, and UTF-8
. Unfortunately, some programs specify the wrong character encoding, some use an invalid option such as ANSI
, and some files are edited by text editors that change the character encoding of the file without changing the CHAR
value. For those reasons, GedSite detects the character encoding using both the CHAR
value and character encoding detection techniques.
GedSite opens the GEDCOM file once to detect if the file has a Unicode "byte order mark" (BOM). If GedSite detects a Unicode BOM, it ignores the CHAR
value and uses the encoding associated with the BOM to read the file.
GedSite re-opens the GEDCOM file to read the CHAR
tag and several other GEDCOM tags from the HEAD
section. If GedSite did not find a Unicode BOM, GedSite uses an ASCII encoding to read the file for the tag preview.
After reading the CHAR
tag, if GedSite did not find a Unicode BOM, GedSite chooses an encoding based on the CHAR
tag value.
If your genealogy program supports writing a GEDCOM file using the UTF-8 encoding, choose that option for the best results with GedSite.
When the character encoding in the GEDCOM file is set to "ASCII", GedSite will accept characters in the Windows-1252 encoding. Windows-1252 is a superset of ASCII.
Challenges
Unfortunately, there are character encoding issues that make it difficult or impossible to detect the encoding automatically.
GEDCOM file specifies UTF-8, but file is not UTF-8
If a GEDCOM file self-identifies as UTF-8 by including a 1 CHAR UTF-8
record, GedSite may or may not be able to detect if the file is actually in some other format. For example, if the file is actually a Windows text file with encoding "Windows 1252", a common text file format on PCs running MS Windows, then GedSite cannot tell that the file is not in UTF-8 format. Non-accented characters will display correctly in the resulting site, but many accented characters will not. The solution is to change the Database.CHAR Value property to "ASCII".
ANSEL
ANSEL was an ANSI standard used to encode text, but as of 14 February 2013, the standard has been withdrawn. The Family History Department of the Church of Jesus Christ of Latter-day Saints recommended an extended version of ANSEL for use in GEDCOM files.
Fortunately, modern genealogy software programs are not limited to writing ANSEL-encoded GEDCOM files, and you should configure your software to write in another format, preferably UTF-8. Still, to process files that are encoded in ANSEL, GedSite includes support for that encoding. GedSite's support is based on the information sources listed below.
The tables below describe how certain ANSEL code points map to Unicode. These tables were constructed based on information from these sources:
- The GEDCOM Standard, Draft Release 5.5.1, "Appendix C, ANSEL Character Set"
- The GEDCOM Standard, Release 5.5, "Appendix D, ANSEL Character Set"
- The GEDCOM Standard, Release 4.0, "Chapter 6, Specification for GEDCOM Character Sets"
- The Character Name Index on The Unicode Consortium
- GEDCOM ANSEL Table by Tamura Jones, especially for corrections and examples
The GEDCOM standards listed above were prepared by the Family History Department of the Church of Jesus Christ of Latter-day Saints.
Spacing Characters
Hex | Decimal | Unicode | Graphic | Name | Example |
---|---|---|---|---|---|
A1 | 161 | U+0141 | Ł | capital L with stroke | Łódź |
A2 | 162 | U+00D8 | Ø | capital O with stroke | Øst |
A3 | 163 | U+0110 | Đ | capital D with stroke | Đuro |
A4 | 164 | U+00DE | Þ | capital thorn | Þann |
A5 | 165 | U+00C6 | Æ | capital AE | Ægir |
A6 | 166 | U+0152 | Œ | capital ligature OE | Œuvre |
A7 | 167 | U+02B9 | ʹ | modifier letter prime | fakulʹtet |
A8 | 168 | U+00B7 | · | middle dot | novel·la |
A9 | 169 | U+266D | ♭ | music flat sign | B♭ |
AA | 170 | U+00AE | ® | registered sign | Kleenex ® |
AB | 171 | U+00B1 | ± | plus-minus sign | 1910±2 |
AC | 172 | U+01A0 | Ơ | hook O, uppercase | BƠ |
AD | 173 | U+01AF | Ư | hook U, uppercase | XƯA |
AE | 174 | U+02BE | ◌ʾ | right half ring (alif) | Unʾyusho |
B0 | 176 | U+02BF | ◌ʿ | left half ring (ayn) | faʿil |
B1 | 177 | U+0142 | ł | small l with stroke | rozbił |
B2 | 178 | U+00F8 | ø | small o with stroke | høj |
B3 | 179 | U+0111 | đ | small d with stroke | đavola |
B4 | 180 | U+00FE | þ | small thorn | þann |
B5 | 181 | U+00E6 | æ | small ae | skæg |
B6 | 182 | U+0153 | œ | small ligature oe | œuvre |
B7 | 183 | U+02BA | ʺ | modifier letter double prime | obʺi︠a︡vlenie |
B8 | 184 | U+0131 | ı | small dotless i | masalı |
B9 | 185 | U+00A3 | £ | pound sign | £5.00 |
BA | 186 | U+00F0 | ð | small eth | verður |
BC | 188 | U+01A1 | ơ | hook o, lowercase | Sơ |
BD | 189 | U+01B0 | ư | hook u, lowercase | Tư |
BE | 190 | U+25A1 | □ | white square(LDS Extension) | □ |
BF | 191 | U+25A0 | ■ | black square(LDS Extension) | ■ |
C0 | 192 | U+00B0 | ° | degree sign | 98.6° |
C1 | 193 | U+2113 | ℓ | script small L | 2.0ℓ |
C2 | 194 | U+2117 | ℗ | sound recording copyright | Parlophone℗ |
C3 | 195 | U+00A9 | © | copyright sign | ©1993 |
C4 | 196 | U+266F | ♯ | music sharp sign | D♯ |
C5 | 197 | U+00BF | ¿ | inverted question mark | ¿Qué? |
C6 | 198 | U+00A1 | ¡ | inverted exclamation mark | ¡Esta! |
CD | 205 | e | e in middle of line(LDS Extension) | e | |
CE | 206 | o | o in middle of line(LDS Extension) | o | |
CF | 207 | U+00DF | ß | small sharp s | Preußen |
GedSite does not support LDS extensions "e in middle of line" or "o in middle of line". They are converted to "e" and "o", respectively.
Combining (non-spacing) Characters
ANSEL includes combining characters that modify the following1 character. In the table below, the Graphic column shows the combining character modifying a dotted circle ◌ (U+25CC).
Hex | Decimal | Unicode | Graphic | Name | Example |
---|---|---|---|---|---|
E0 | 224 | U+0309 | ◌̉ | hook above | củi |
E1 | 225 | U+0300 | ◌̀ | grave accent | règle |
E2 | 226 | U+0301 | ◌́ | acute accent | está |
E3 | 227 | U+0302 | ◌̂ | circumflex accent | même |
E4 | 228 | U+0303 | ◌̃ | tilde | niño |
E5 | 229 | U+0304 | ◌̄ | macron | gājājs |
E6 | 230 | U+0306 | ◌̆ | breve | altă |
E7 | 231 | U+0307 | ◌̇ | dot above | żaba |
E8 | 232 | U+0308 | ◌̈ | diaeresis (umlaut) | öppna |
E9 | 233 | U+030C | ◌̌ | caron (hacek) | vždy |
EA | 234 | U+030A | ◌̊ | ring above (angstrom) | hår |
EB | 235 | U+FE20 | ◌︠ | ligature, left-half | akademii︠a︡ |
EC | 236 | U+FE21 | ◌︡ | ligature, right-half | akademii︠a︡ |
ED | 237 | U+0315 | ◌̕ | comma above right | rozdel̕ ovac |
EE | 238 | U+030B | ◌̋ | double acute accent | időszaki |
EF | 239 | U+0310 | ◌̐ | candrabindu | Alii̐ev |
F0 | 240 | U+0327 | ◌̧ | cedilla | ça |
F1 | 241 | U+0328 | ◌̨ | ogonek (nasal hook) | vietą |
F2 | 242 | U+0323 | ◌̣ | dot below | teḍa |
F3 | 243 | U+0324 | ◌̤ | double dot below | k̲h̲ut̤bah |
F4 | 244 | U+0325 | ◌̥ | circle below | Samskr̥ta |
F5 | 245 | U+0333 | ◌̳ | double underscore | G̳hulam |
F6 | 246 | U+0332 | ◌̲ | underscore | s̲amar |
F7 | 247 | U+0326 | ◌̦ | left hook | dārzin̦a |
F8 | 248 | U+031C | ◌̜ | right cedilla | kho̜ng |
F9 | 249 | U+032E | ◌̮ | breve below | ḫumantus̆ |
FA | 250 | U+FE22 | ◌︢ | double tilde, left half | n︢g︣alan |
FB | 251 | U+FE23 | ◌︣ | double tilde, right half | n︢g︣alan |
FC | 252 | U+0338 | ◌̸ | long solidus (slash) overlay(LDS Extension) | 0̸ |
FE | 254 | U+0313 | ◌̓ | comma above | ge̓otermika |
Please note that per the Unicode Standard, Version 7.0, Chapter 3, D52, "The graphic positioning of a combining character depends on the last preceding base character, unless they are separated by a character that is neither a combining character nor either zero width joiner or zero width non-joiner."
In the table above, the position of the cedilla may be different when it applies to the dotted circle ◌̧ compared to where it appears under a "c" in "ça". In this document, the behavior will vary based on your browser's font choices for "serif" and "sans-serif", and also on your browser's text layout software.
The slash overlay character does not seem to be positioned properly when applied to the dotted circle or the digit zero. I tried multiple fonts and multiple base characters and all combinations produced similar results.
Notes
- In the ANSEL encoding, combining characters precede the character they modify. In Unicode, combining characters follow the character they modify.