A formal grammar for parsing GEDCOM 5.5x files

The following LALR(1) grammar can be used to parse GEDCOM 5.5x files generated by a wide variety of programs. Pointer strings, if used, must be manually trimmed to remove prefix and suffix characters ('@'). Likewise, embedded escape sequences in value strings must be manually extracted.

ALNUM:   // an alphanumeric character
[a-z_A-Z0-9]
 
WHITE:   // any whitespace character
[ \t\v\b\f\a]
 
ANY:    // any character except newline
[^\n]
 
POINTER:   // a pointer string
@ALNUM[^@]*@
 
BOM8:   // a UTF-8 byte order mark
\xEF\xBB\xBF
 
UBOM:   // a Unicode byte order mark
\xFF\xFE
 
NEWLINE:
\n
 
SPACE:
' '   // single space character (0x20)
 
TAG:
ALNUM+
 
LEVEL:
0|([1-9][0-9]*)
 
VALUE:
ANY+


%ignore%:
BOM8    // UTF-8 byte order mark
UBOM    // Unicode byte order mark
WHITE* // leading whitespace
 
file:
line
file line
line:
LEVEL SPACE TAG
LEVEL SPACE TAG line_value NEWLINE
LEVEL SPACE POINTER SPACE TAG
LEVEL SPACE POINTER SPACE TAG line_value NEWLINE
line_value:
VALUE
POINTER
 
 
 


© 2003-2004 Software Renovation Corporation. All rights reserved.
http://www.igenie.org