A GEDCOM Grammar

The following LALR(1) grammar can be used to parse GEDCOM 5.5 files generated by a variety of programs. If desired, pointer strings must be manually trimmed to remove prefix and suffix characters ('@'). Embedded escape sequences in value strings (if used) must also be manually extracted.

ALNUM: // an alphanumeric character
[a-z_A-Z0-9]
WHITE: // any whitespace character
[ \t\v\b\f\a]
ANY: // any character except newline
[^\n]
POINTER: // a pointer string
@ALNUM[^@]*@
BOM8: // a UTF-8 byte order mark
\xEF\xBB\xBF
UBOM: // a Unicode byte order mark
\xFF\xFE
NEWLINE:
\n
SPACE:
' ' // single space character (0x20)
TAG:
ALNUM+
LEVEL:
0|([1-9][0-9]*)
VALUE:
ANY+


%ignore%:
BOM8 // UTF-8 byte order mark
UBOM // Unicode byte order mark
WHITE* // leading whitespace
file:
line
file line
line:
LEVEL SPACE TAG
LEVEL SPACE TAG line_value NEWLINE
LEVEL SPACE POINTER SPACE TAG
LEVEL SPACE POINTER SPACE TAG line_value NEWLINE
line_value:
VALUE
POINTER
 
 
© 2001-2002 Software Renovation Corporation. All rights reserved.