|
|
Conversion to XML: Should I stay with SGML?
Mike Gross, Chief Technical Officer at DCL, answers the big question of whether you should move from SGML to XML.
Mike Gross (pictured) is responsible for solution engineering at DCL. He has been solving digital publishing conversion problems at DCL for almost 20 years, where he has overseen thousands of legacy conversion projects. One of the most common questions he gets asked is whether you should move from SGML to XML. People understandably want to know whether it is necessary (i.e. will SGML go out of use) and whether it will benefit them.
DCLnews caught up with Mike during a quiet moment in his busy schedule and put these and other often asked questions to him.
Q: I'm already in SGML. Do I need to start moving over to XML?
A: If you are already in SGML with production systems in place, you've already gotten past the hard part and there may be no real immediate purpose in leaving SGML. Typically, you would need to contemplate such a move if you are considering new changes and technologies that support XML to the exclusion of SGML, or if the vendors that support your current system will be dropping SGML support. In general, the old adage "If it ain't broke, don't fix it," may apply here.
Q: I'm just starting out. Do I go to SGML or XML?
A: Although there are useful features in SGML that were left out of XML, the reality is that XML has become a worldwide accepted standard, with many new, powerful technologies built around it, such as XSLT, XLINK, XSLFO, XML Schema, and MATHML, as well as many XML support tools - along with the continually growing trend of enhanced XML support within popular mainstream software titles (such as Microsoft Office). Most applications can work around the features that were left out in the transition from SGML to XML. So if you are starting out today (and don't need to work with a preexisting SGML DTD), our recommendation in most cases would be to go for XML.
Q: If I've already put my data into SGML, and I want to tap some of XML's power, have I lost my investment?
A: Your investment in having built a database in SGML is certainly not lost, and as we said above, it may simply be unnecessary to stop existing SGML projects in order to retool in XML. In general terms, XML is a subset of SGML, and both allow you to represent your information using structured markup (and taking unstructured documents and transforming them into structured XML/SGML markup really is the hard part). Having said that, for large legacy document systems built around SGML, a complete migration to XML can be complex and involve substantial expense.
Migration of an SGML document set from SGML to XML involves two main tasks – porting the DTD to XML, and then porting the actual documents from SGML compliant markup to XML compliant markup. The porting of documents so that they are XML compliant, though nontrivial, can often be done with publicly available tools that transform SGML documents to XML compliant ones (by doing tasks such as fixing case sensitivity, adding in minimized tagging, and adding the extra slash (/) at the end of empty tags).
There may be some infrequent cases where you can run into problems trying to make an SGML document XML compliant, but typically, this is straightforward. In some cases, you might even have SGML documentation sets that are already XML compliant and nothing needs to be done, other than fixing empty tags.
Converting SGML DTDs to XML
The more difficult issue is actually migrating the legacy SGML DTD(s) to conform to the restricted DTD features that the XML standard mandates. In order to simplify XML, certain features that were allowed in building SGML DTDs, such as inclusions, exclusions, and the "&" content model connector, were removed from XML DTD support (partially to make XML parsers easier to build.) Lack of inclusion support (inclusions allow an element to occur anywhere inside another element) means that if your DTD made heavy usage of this feature, it'll take a partial overhaul to support that functionality in XML.
Exclusions (the ability to exclude particular tags in specific contexts) are a bigger problem. It may not be possible to rewrite a DTD in XML and specify tagging requirements in exactly the same way as they were done in SGML – and may even require new tagging to be inserted in the XML marked-up documents.
The other DTD features left out of XML could be problematic as well, and can require DTD changes that then require changes to the actual XML document tagging.
What this means in practice, is that for many complex DTDs (including some industry standard ones), the DTD overhaul itself may cause you to modify some document tagging, which in itself is a "can of worms," because if you've got approved documents, you may have to go through a whole re-approval process. If you've got thousands of documents, you could be looking at an expensive migration process.
Are there any shortcuts?
So it is probably prudent to ask yourself why you are doing the migration. If management has mandated XML, then you may have no choice. If you've got publishing tools that are no longer supported by the vendors (or the vendor is pulling SGML support), then you may also have no choice. If you want to take make use of certain technologies that are XML only, then it may also be wise to do a full content migration.
But if you need XML simply to make use of XML document publishing tools, a compromise solution may be workable. You can continue to author in SGML (assuming your authoring/editing environment still includes SGML support, as many still do), validate the documents against the SGML DTD, then use migration tools to convert the SGML document instance to a well-formed XML document. (Please remember that XML only requires a document be well-formed - it doesn't actually require a DTD).
This XML version of the document can then be used with readily available publishing tools. So it could, for instance, be rendered over the web within IE (as well as other XML enabled Web browsers), or be published as a PDF document, via XSLFO document rendering tools. Thus, this hybrid solution may allow you to avoid the pain of migrating complex DTDs to XML, while allowing you to make use of some of XML's publishing power. This solution will not work for everybody, but it could allow you to tap into the world of XML while minimizing time and expense.
For more information on the specific details of migrating from SGML to XML, we refer you to an excellent article by Norman Walsh which can be found at http://www.xml.com/pub/a/98/07/dtd/index.html.
Automated transition
The transition from SGML to XML, while not trivial, is not a complex one. Most things that need to be done can be done automatically through software filters, the DTD can be made XML compliant, and much existing SGML software has already been ported to support XML. The key features that are not supported in XML and will require clean up are things like: putting quotes around attributes, proper casing of tags, and removing tag minimization. More details may be found in "SGML to XML Conversion Strategies" by Richard Lander.
DCLnews editorial
|
|
|
|
|
CIDM Best Practices Conference September 13–15, 2010 Hampton, Virginia
Vasont Users' Group Meeting September 27–30, 2010 Hershey, Pennsylvania
Internet Librarian Conference October 25–27, 2010 Monterey, California
Journal Article Tag Suite Conference (JATS-Con) November 1–2, 2010 Bethesda, Maryland
SPARC Digital Repositories Meeting November 8–9, 2010 Baltimore, Maryland
More Events »
|
|
|
|
 |
|
|