|
||||
| DCLab.com | About DCL | Tech Info | Press Info | Contact Us | DCLNews | Partners | Wiki | Client Area | ||||
|
ABSTRACT: Getting an encyclopedia on-line is a monumental task. All that
information has to be converted to a structured format like SGML. DCL
specializes in such large complex conversions. The key is a well-defined process
that includes careful design and plenty of customer feedback. From detailed
conversion specifications to a review pass with pre-composition software,
converting the world's knowledge becomes a possibility. This article details the
steps in that process through the use of a composite case study. Putting together an encyclopedia is a monumental task: articles are
collected from experts in hundreds of different fields, which then have to be
laid out in a consistent format for volume after volume, and then there's the
general index, filled with references to this gigantic opus. It takes years to
develop an encyclopedia, and the process of updating, refining and expanding
never ends. So how can something of this magnitude get on-line? It's not easy, but
clearly it's happening. Not just encyclopedias, but dictionaries, almanacs and
most other kinds of reference books, ranging in size from bulky to colossal, are
available over the Internet or CD-ROM. It's not hard to understand why someone in the Board Room would demand
on-line access to their encyclopedia, but what does Joe Vice President do when
he gets assigned the task? Choosing SGML After a good deal of research and asking around, Joe chooses SGML (Standard
Generalized Markup Language) from the many options available for electronic
publishing. SGML is designed for structured data, a necessary element of
electronic publishing. After all, this structure enables multiple uses for your
data (CD-ROM, Web pages, FTP, e-mail), hyperlinking and sophisticated search
& retrieval. SGML achieves this structure with a Document Type Definition (DTD),
which is tailored to specific kinds of documents (encyclopedias vs. dictionaries
vs. repair manuals vs. Insurance policies, etc., etc., etc.). Perhaps most
importantly, SGML is a standard that is easily converted to most other formats
(like HTML for the Web). Now that Joe has made this decision, the problem becomes getting the
encyclopedia into SGML. Conversion to a highly structured format is difficult
because it means adding structure to the data. Initially, Joe thinks his SGML
team (be they in-house experts or consultants) will convert the data, but he
soon realized that the scope of such an effort would require a team of workers
dedicated to the conversion alone. Such a team would drain his much-needed SGML
people, require a significant learning curve and collect extra manpower that
would become obsolete the moment the conversion was over. Bring in the Conversion Pros Joe decided to hire a conversion specialist. As the following example will
demonstrate, document conversion has its own challenges that go well outside the
scope of SGML knowledge alone. (By the way, Joe's conversion is really a
composite case study of the work Data Conversion Laboratory has done for various
Encyclopedia publishers.) Design & Setup To ease the conversion, DCL has established a well-defined process for
planning and designing the project with customer feedback. First, Joe sends DCL
the encyclopedia itself, along with typesetting tapes. Fortunately, since much
of the typesetting was already done electronically, most of the material will
not have to be keyed in. Our data analysts carefully study this material and compare it to previous
encyclopedia projects. They also study a "mark-up" that Joe sent. This
is comprised of some photocopied encyclopedia pages with tag names written in
the margins, so we know how the customer wants the data tagged. You might think
this can be determined from the DTD, but there are always ambiguities. To
further assure customer agreement, DCL prepares a small sample, called a
"proof of concept" sample. Once Joe approves this sample, the data
analyst writes up conversion specifications. These specifications form the road map of the conversion. They list all of
the relevant structures in the encyclopedia (articles, bylines, reference lists,
headings, lifetimes for the biographical entries, etc.), how they can be
recognized in the source data and how they will be tagged in the SGML. Trying to
do an SGML conversion (or any complex conversion) without conversion
specifications is like building a house without a blueprint, yet many
conversions are attempted without such a document. These specifications are approved by Joe. DCL has found customer feedback
essential for a successful conversion. One of the realities of such a complex
project is that one cannot completely visualize it at the start. The customer
must be given the opportunity to adjust their vision throughout the project,
until the newly implemented system finally takes shape. Conversion is therefore
best conceived of as a collaborative effort between the customer and the vendor. A larger sample is prepared, which is called the production sample, because
it emulates the full production process. DCL's software is configured to tag the
data in the same way as this sample, so that there are no surprises. Production This is the phase that we did not rush into, but the wait is more than made
up for by our careful preparation, which minimizes the amount of manual work and
rework. Weekly deliveries are sent to Joe, which he can plug into his CD
software and make sure the data works. When it doesn't work, adjustments can be
made. For instance, let's say the chemistry doesn't look right. It's tagged
properly, but his CD application can't handle it in this format (it couldn't be
tested earlier because the chemistry viewer wasn't ready until after production
began). Because this discrepancy was caught as early as possible, Joe has
options: DCL can change the conversion process to retag formulae, or Joe can
have the chemistry viewer modified. The automated process more than makes up for the effort spent on setup. The
bulk of the work is done through software, which not only saves up-front labor
costs, but also reduces efforts to check and correct the converted material. Format Review Quality control should be part of any production process. DCL's primary
quality check is called a "format review," because the SGML is loaded
into precomposition software, which formats the documents to visually
demonstrate how they are tagged. This sort of specialized composition is more
effective for review than a publishing-oriented full composition package. DCL's format review phase is unique in the industry, but we feel it's
critical to the quality of the finished product. Other vendors promise parseable
SGML, but we found that parsing isn't enough. If a magazine editor sent all of
his articles to a copy editor for spell checking and grammar checking, but never
looked at them himself, he would soon be fired. We feel that a conversion
service should have the same responsibility to make sure the data is correct,
not just parseable. Final Review As mentioned above, conversion is not simply a matter of dropping off your
old data and picking up finished documents. Conversion is a team effort. Even
after the converted material is received, there is additional work for the
client. As a publisher, Joe is very demanding about his documentation. A final,
thorough review is done by his own staff. Because of our feedback process and
quality control, his editors are able to focus on high-level subject-matter
issues (e.g., is this link to John I connected to the right John I?). Without a format review, customer cleanup can be a long and costly process.
And unless you provide feedback throughout the process, you may find that your
data is perfectly valid, but does not meet your needs. The subjective nature of
SGML makes this mishap a likely, if not inevitable, occurrence if nothing is
done to avoid it. Moving Mountains
And Joe can hardly wait for the next Board meeting. Want more information on this topic? Click here! |
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Data Conversion Laboratory, Inc. 61-18 190th St., 2nd Floor, Fresh Meadows, NY 11365 718-357-8700 convert@dclab.com Copyright © 1997-2008 Data Conversion Laboratory, Inc. All rights reserved. |