I tend to agree with John.
We need to factor in the likely SDLC for our schemas and modeling.
In particular, if you are a multinational company or a company that will likely merge with other companies or buy in data, you can be pretty sure that the structure you derive from your document analysis will be revealed as incorrect/inappropriate/bogus as soon as it is faced with these new kinds of data.
Therefore you need to be much clearer in what is vocabulary (e.g. element names), semantically necessary structure (e.g. that table cell only occurs in a row and that in a table) and document constraints (e.g. that a section starts with one title, or that an address has one ZIP code at the end).
Failure to model these separately (e.g. by an open and loose base schema for the first two, and derived schemas or Schematrons for the last) causes extra work for later integration: indeed, sometimes this is compounded by management, embarrassed that the new documents could not be shoe-horned into the standard schemas that so much effort had been spent on to model, blaming the poor suckers who have to do this shoe-horning, rather than attributing it to a kind of failure in awareness of the SDLC.
Regards
Rick