GEDCOM Data Analysis
A random selection of 50 user-contributed GEDCOM files were downloaded from Ancestry.com and analyzed. The results are summarized below, but the analysis reports for all files may be downloaded (original data not included): download - 40KB
 

File Size
The smallest file contained 402 lines. The largest file contained 7,104,328 lines.
 

Product
Name No. of files
Family Tree Maker for Windows 20
Online Family Tree 18
Ancestry Family Tree 4
Personal Ancestral File 3
Legacy 1
Family Trees Quick & Easy 1
Roots Magic 1
EasyTree 1
gwb2ged 1
 
This breakdown does not necessarily represent market share. The figures may well be skewed becaues Ancestry.com produces/promotes the most prevalent applications in this sample.
 

GEDCOM Version
Version No. of files
5.5 31
4.0 18

unspecified

1

 


File Character Encoding

Encoding No. of files
ANSI 39
ANSEL 6
UTF-8 3
ASCII 1
IBM-WINDOWS 1
 
This result is a bit surprising. The only four encodings supported by the GEDCOM specification are ANSEL, ASCII, UNICODE, and UTF-8. Neverthless, the majority of files used the non-standard encoding ANSI (not the same as ASCII). Interestingly, one file specified IBM Windows encoding which is expressly forbidden by the GEDCOM specification.

 

Non-Standard (Custom) Tags
Custom tags are permitted but discouraged by the GEDCOM specification. Even so, a total of 53 custom tags were used in the sample. Family Tree Maker used the most custom tags, whereas Online Family Tree used none. Every custom tag began with the underscore character as required by the spec. The longest custom tag was _ALT_BIRTH.

 

Non-ASCII and Non-Standard Characters
14 of 50 files contained at least one non-ASCII (i.e. non-English) character. 34 of 50 files contained at least one non-standard (improper?) character. Non-ASCII characters are perfectly valid, but provide an indication of the number of non-English words found in the file. Non-standard characters include values that represent neither ASCII nor ANSEL characters. One explanation for the existence of non-standard characters may be the use of ANSI character encoding. The reason for non-standard characters in ANSEL-encoded files is unknown.
 

Citations
A common complaint is that user-contributed GEDCOM files do not contain source citations. Nevertheless, 27 of the 50 files in the sample do contain citations, although their extent and consistency varies. As a rule, the citations probably do not meet professional-quality standards, but the results suggest that many users may be more conscientious than presumed.
 

 


© 2003-2004 Software Renovation Corporation. All rights reserved.
http://www.igenie.org