Generic document structure objects
Two parts:
- data structures representing the document structure
- text formatting
It's important to keep the distinction between the two. Representations for the first are independant of representations for the second.
I'd like the first (data structures) to be something that can easily and directly translate between languages, container formats, databases, etc. The second (formating) should primarily be unified across all data structure types and formats, but there might be alternative representations for XHTML.
We can address the formatting first:
- bold text
tags for XHTML
- italic text
tags for XHTML
- underline text
tags for XHTML
- strikethrough text
tags for XHTML
- monospaced text
<m> tags for XHMTL? </m>
- enumerated lists
- tags in XHTML
- itemized lists
- tags in XHTML
- description lists
- tags in XHTML
- tex inline math
$ "tag" in TeX
- references (both internal and external)
Ensure that we allow unicode characters and maybe XHTML character entities?
The data structure should be easy to put into YAML, JSON, XML. It should be directly representable by nested hashes and arrays in javascript and perl.
Object types:
- Document: contains information about the entire document and points to contained Sections (as pre-matter, main body, and appendices).
- Section: Contains content in the form of Paragraphs, Figures, Tables, Code Fragments, Equations.
- Paragraph: Contains text content.
- Figure: Contains image content.
- Table: Contains tabular format (data).
- Code: Contains code listings.
- Equation: Contains non-inline TeX equations (or mathml I suppose).
For each item, it could contain verbatim text, formatted text, data, image blobs, URIs to find the information, etc., as appropriate for the type.
- Most of these objects have id fields, specifying a way to refer to the given section, figure, etc.
- Most have title fields, which would be used to identify the section, etc, in tables of contents, figures, bibliographies, etc., as well as titles in the section text itself.
- A url field may be present which would be used to get the data for the section.
- Alternatively, a blob field could be present, where the section's data would be directly included in the data structure.
- If either the url or blob field is present, a type field would specify how to interpret the data at the URL, or the data in the blob. The default type depends on the type of object and the fields that the data is in.
- Some objects may have caption fields to show by the figure/table/etc.
- Some objects may contain other fields to represent the data in a different way.