A GEDCOM Grammar
The following LALR(1) grammar can be used to parse GEDCOM 5.5 files generated by a variety of programs. If desired, pointer strings must be manually trimmed to remove prefix and suffix characters ('@'). Embedded escape sequences in value strings (if used) must also be manually extracted.
- ALNUM: // an alphanumeric character
-
- [a-z_A-Z0-9]
- WHITE: // any whitespace character
-
- [ \t\v\b\f\a]
- ANY: // any character except newline
-
- [^\n]
- POINTER: // a pointer string
-
- @ALNUM[^@]*@
- BOM8: // a UTF-8 byte order mark
-
- \xEF\xBB\xBF
- UBOM: // a Unicode byte order mark
-
- \xFF\xFE
- NEWLINE:
-
- \n
- SPACE:
-
- ' ' // single space character (0x20)
- TAG:
-
- ALNUM+
- LEVEL:
-
- 0|([1-9][0-9]*)
- VALUE:
-
- ANY+
-
%ignore%:
-
- BOM8 // UTF-8 byte order mark
- UBOM // Unicode byte order mark
- WHITE* // leading whitespace
- file:
-
- line
- file line
- line:
- LEVEL SPACE TAG
- LEVEL SPACE TAG line_value NEWLINE
- LEVEL SPACE POINTER SPACE TAG
- LEVEL SPACE POINTER SPACE TAG line_value NEWLINE
- line_value:
- VALUE
- POINTER
|