On July 8, 2004 I visited Drexel University to discuss the problems of preservation of materials to be created by their plans to digitize their archival collections on Women in Medicine.
We generally agreed on a preservation plan to involve three basic
1. Use of standard non-proprietary formats for information, to ensure availability of software to interpret these formats in the future.
Specifically, TIFF and JPG are used for images; TIFF for retention of full data and JPG for faster presentation. Metadata are created in accordance with Dublin CORE and stored in XML files.
2. Maintenance of current files in accordance with normal computer center techniques, including daily backups of disks on to magnetic tapes. The Drexel Library's staff for information technology is responsible for this process. An estimate of the size of the total amount of material is in the hundreds of gigabytes, probably less than one terabyte. The exact amount depends on how many pages are scanned in color as opposed to greyscale, and whether lossless compression is used on the archival images.
3. Anticipation of loss of Drexel service by arranging alternate sites.
Drexel support might be endangered either by a geographical disaster, such as fire or flood, or by the administrative closure of some key unit, noting that the medical school has changed owners twice in the last few decades. Drexel will approach at least two major libraries with interests in medicine or women's history and ask them to keep copies of the material generated. Two suggested institutions were the National Library of Medicine (Bethesda, Maryland) and the Arthur and Elizabeth Schlesinger Library on the History of Women in America (Cambridge, Massachusetts).
Back to the Top
DIGITAL FORMATS FOR PRESERVATION
The use of widely adopted standard formats for digital material is
important for preservation. Standard formats with public descriptions
enable future users to find software that will read the material and display it. If the format is widely used today, software to read it is likely to be easy to obtain in the future.
The most common format for raw image storage in libraries is TIFF (ISO 12639). Of the variations available in the standard, Drexel is using lossless TIFF. This makes minimal requirements on the software to read it, at the cost of requiring extra storage.
Drexel is considering doing the majority of the scanning in color, even though the material is primarily either handwriting or printing with no original intent to include color content. However, many of the pages are discolored through age, and the capture of color content may enable more accurate separation of the writing or printing from the background
in later image processing. The librarians selecting the material
for scanning will be examining each item and making the choice between color and greyscale scanning.
A 400 dpi uncompressed color image of an 8.5x11 page is approximately 15 megabytes. Thus, even if all 27,000 images were created in color the original images would need only 400 gigabytes; with disk space currently at 50 cents a gigabyte reducing a $200 total cost for disk is not a priority.
If it were felt important to save space, lossless compression methods such as LZW are a better choice than using an unsuitable scanning method.
An alternative to TIFF would be PNG (portable network graphics) format but this format is much less widely used than TIFF and is not recommended.
For image display, compression is needed to reduce transmission time.
The project proposes to use JPEG (ISO 10918 and 15444). This is the most common format for image distribution today and is the obvious choice.
For preservation purposes, this format is less important, since the compressed images could always be recomputed from the original versions.
JPEG is extremely widely used and seems likely to remain so despite the current flurry of lawsuits from Forgent Networks. Whatever happens in these cases, the Forgent patents expire in October 2006.
Alternatives are GIF, suitable but less efficient on general images, and DjVu, which might be significantly more compact but is currently a proprietary product. The project should stick with JPEG.
Back to the Top
The cataloging is being done in accordance with the Dublin Core Metadata Initiative (ISO 15836:2003, ANSI/NISO Z39.85-2001). This is an internationally used set of metadata fields which is relatively simple (at least compared with
MARC) and easily understood. Crosswalk methods exist to translate
Dublin Core into various versions of MARC, EAD, TEI, and others.
The Dublin Core data will be formatted and presented in XML (officially a
subset of SGML, ISO 8879:1986). XML is if anything too popular today,
and the hype will die down at some point, but the SGML-XML syntax is widely accepted in many contexts, including the entire web, and software to read these formats will be widely available.
The combination of Dublin Core in XML is fairly simple ASCII and also quite popular. Finding software in the future to process this content should be easy.
Alternatives would be bibliographic description standards such as EAD (Encoded Archival Description) or USMARC but either would raise the cost of creating metadata and neither is recommended.
Drexel should review the file formats in use at least every five years and decide whether the material should be migrated to new formats.
Back to the Top
CURRENT FILE MAINTENANCE PROCEDURES
The Drexel University libraries, like most others, maintain a variety of computer systems that must be highly available, such as the online catalog
and the circulation system. They perform daily tape backups and recognize
the importance of off-site copies.
Drexel's central Office of Information Resources and Technology provides the campus-wide backup systems and is also available to help with planning for long-term storage of active materials.
Both organizations also use anti-virus software, and can be expected to maintain awareness of current network-based threats to system integrity and implement appropriate countermeasures.
When the web site for the project is operating, the current plan to include this site in the library's core information services and rely
on their daily procedures to keep it running. The system does not rely
on any unique technology; in the event of the loss of a computer or a disk many compatible machines are available and the information is preserved through daily backups.
The interface to the system is via the Greenstone Digital Library software. Continued availability of Greenstone is not essential for preservation of the information, since the basic files could be transported to an alternative digital library platform (such as Fedora or D-Space). However, since Greenstone is used in hundreds of projects around the world, and is available as open source, its continued availability seems assured. Should the existing support group in New Zealand disappear and changes in computing platforms require changes to Greenstone, groups such as Cygnus would almost certainly be prepared to support the project into the future.
Although Philadelphia is not in an earthquake zone, the possibility of a massive disaster can not be ruled out. The library should maintain a policy of keeping some degree of backup offsite.
So long as the site is in operation at Drexel, there should be a yearly review of the regular backup procedure to be sure that it is operating smoothly. At least once a year the project management should pretend to lose a file and verify that the retrieval process works. This yearly review should also consider the state of the project, which is likely to become inactive at some point but leave a persistent website that is used by scholars. At some time in the future, for example, it may be worth considering whether additional material should be digitized, or whether the project should be combined with other, related projects that may have appeared at Drexel, Penn or other institutions.
Back to the Top
ALTERNATIVE INSTITUTIONAL HOSTS
Perhaps the greatest long-term risk to the information would be an
inability to continue hosting the project at Drexel. Aside from
the potential loss of funding for this particular project, the recent history of the College of Medicine is not one of stability. In both 1993 and 1998 the medical school came under new ownership, and in 2002 it was merged into Drexel under a 20-year agreement with a local for-profit hospital chain.
The long-term preservation plan thus also involves transfer of a copy of the material digitized to alternate host institutions, which could at a minimum preserve the material against disaster or abandonment in Philadelphia.
The amount of material involve is less than a terabyte, and would fit
on three disk drives today. For simplicity, we agreed that standard
USB or Firewire external disks would be easiest. Neither a pile of
over 1,000 CDs or 100 DVDs seemed practical. Either would involve too much manual work to be sure that no individual item had been lost, and there is not yet agreement on the most popular format for writeable DVDs (DVD-R, DVD+R, DVD-RW and DVD-RAM are still competing).
By the time the project is done, it might well fit on one disk drive.
Transfer to another institution, however, whether on one disk or three, is technically straightforward.
More important is obtaining the cooperation of a suitable institution.
The institutions involved should be interested in the history of women or the history of medicine, have a long term commitment to preservation, and be institutionally stable.
Two organizations suggested at the meeting were the National Library
of Medicine and the Schlesinger Library. Both seem admirably suited
to the task. Other possibilities would be libraries such as the Sophia Smith Collection at Smith College or the Mabel Smith Douglass Library at Rutgers. Ideally, Drexel will be able to arrange a swap with these libraries to watch some of their bits in exchange. Two alternate hosts would seem sufficient, although there is no objection to more.
This could be done more systematically by joining the LOCKSS project but that is realistically a decision to be made at the Drexel level, not the project level. I would encourage Drexel to join, but would understand if the library administration felt that they did not have enough digital activity to warrant it.
Back to the Top
Commercial or semi-commercial services are also possible, such as the OCLC Digital Archive or the San Diego Supercomputer Center, but these would probably charge more than could be considered for the long term.
It will be necessary to consider the details of the administrative arrangements. Just in case the collection becomes highly used, Drexel will have to decide whether it would prefer to see the alternate hosts actually place the material online to help share the access load, or would prefer not to see it used except in case of disaster, in order to preserve Drexel's visibility as the source of the original materials.
The agreements would, of course, have to ensure that if Drexel no longer continued to provide simple access the alternative hosts would be free to do so, and to provide that Drexel could get a copy of its material upon future request.
Most important will be to ensure that whatever copyright licenses are obtained permit the transfer of copies to other institutions. It would be administratively unworkable to have to excise a few documents in the process of duplicating the collection.
Drexel should, two years after the initial delivery of a copy of the files, and every five years thereafter, request that the sharing institutions make a copy of the information and return it to Drexel. Drexel should expect to pay a charge for this copying, and should verify that it has been done
correctly. Drexel should also, at least every five years, consider the
list of institutions holding copies of the material and decide whether any new institutions should be approached to hold additional copies. Drexel should also do this in the event that one of the copy-holding institutions approaches Drexel in the future to say that they have changed their mind and would rather not continue to hold a copy.
If this material were more popular, it would be worth considering whether or not to reserve the right to transfer the file to a commercial online service and delete the free public service, but I do not think the content of this file is such that Drexel is likely to gain by trying this. The administrative process will be simpler if everyone agrees that the file is not suitable for commercial exploitation.
I believe that the project is taking appropriate steps for both long and short term preservation. The most important steps are to use standard formats and to keep duplicate copies. Both of these are being done. What remains to be done is to find alternate hosts for off-site copies and develop a minimal administrative structure that will persist into the future, at least long enough to do a yearly review and determine and whether the material is still being cared for properly.
Back to the Top