A formal grammar for parsing GEDCOM 5.5x files
The following LALR(1) grammar can be used to parse GEDCOM 5.5x files generated by a wide variety of programs. Pointer strings, if used, must be manually trimmed to remove prefix and suffix characters ('@'). Likewise, embedded escape sequences in value strings must be manually extracted.
- ALNUM: // an alphanumeric character
-
- [a-z_A-Z0-9]
-
- WHITE: // any whitespace character
-
- [ \t\v\b\f\a]
-
- ANY: // any character except newline
-
- [^\n]
-
- POINTER: // a pointer string
-
- @ALNUM[^@]*@
-
- BOM8: // a UTF-8 byte order mark
-
- \xEF\xBB\xBF
-
- UBOM: // a Unicode byte order mark
-
- \xFF\xFE
-
- NEWLINE:
-
- \n
-
- SPACE:
-
- ' ' // single space character (0x20)
-
- TAG:
-
- ALNUM+
-
- LEVEL:
-
- 0|([1-9][0-9]*)
-
- VALUE:
-
- ANY+
-
%ignore%:
-
- BOM8 // UTF-8 byte order mark
- UBOM // Unicode byte order mark
- WHITE* // leading whitespace
-
- file:
-
- line
- file line
- line:
- LEVEL SPACE TAG
- LEVEL SPACE TAG line_value NEWLINE
- LEVEL SPACE POINTER SPACE TAG
- LEVEL SPACE POINTER SPACE TAG line_value NEWLINE
- line_value:
- VALUE
- POINTER
-
-
|