worldcat – In the Library with the Lead Pipe

Why isn’t a picture worth a thousand words?

Kristine Alpi — Wed, 16 Sep 2009 11:30:39 +0000

In the Library with the Lead Pipe is pleased to welcome another guest
author, Kristine Alpi! Kris is the Director of the William Rand Kenan, Jr. Library of Veterinary Medicine at North Carolina State University Libraries.

Why do document delivery technologies limit information transfer?

Modified from the original — permission for the use of this derivative work has been requested from the publisher of Histology and Histopathology.

By Kristine Alpi

The technologies that libraries use for interlibrary loan and document delivery frequently reduce the value of the information available to be delivered. In the past, color was used sparingly by publishers concerned with printing costs, and readers could assume that most images were not available in color unless dealing with visual arts publications. Although entire books have been written about the value of color as communication, color has always been a special request for interlibrary loan copies. Now, color is much more common: in situations where color is crucial and in cases, such as graphs, where well-presented shades of gray could convey the message. In 2001, the Journal of Histochemistry and Cytochemistry began offering one full page of color figures per article at no cost to authors since the majority of their content required color images [1]. Scholarly disciplines that need color to convey meaning are not having their needs met by interlibrary loan/document delivery (ILL/DD). Growth in the frequency and quality of image reproduction in pathology, molecular biology, microsurgery, and other highly visual aspects of science has changed the amount of content for which color is absolutely essential to shared understanding. The 275,000+ papers on the subject of gene expression covered by PubMed provide just one example.

Standards?

Neither color nor image quality is mentioned in the American Library Association Interlibrary Loan Code for the United States (Revised 2008, http://www.ala.org/ala/mgrps/divs/rusa/resources/guidelines/interlibrary.cfm) nor the sample ALA Interlibrary Loan Request Forms. Most standard library forms and processes assume that a readable black and white scan (B&W) is sufficient to meet user needs. Library staff in academic, public and special libraries, large and small, have suggested to me that the images don’t matter because users just skip over the pictures or data in favor of the text; that doesn’t fit with the browsing patterns of many users who go straight for the data tables or images. I would argue that the reason readers might undervalue images in their interlibrary loan articles is because the image quality has typically not been able to convey the message from the original publication. Warner (2004) compared the quality of print original journals, custom supply photocopies from the Canada Institute for Scientific and Technical Information (CISTI), and the online and printed quality of Ariel transmitted files and found the Arieled copies lacking [2]. Ariel has gone through several upgrades since his 2003 exercise, but it is not clear how many libraries have upgraded their Ariel software or how the upgrades of the Ariel technology centered around TIFF file transmission have attempted to take advantage of global improvements in non-library imaging devices and software. The corporate website (http://corporate.infotrieve.com/ariel) positively comparing its transmission to fax quality suggests a need to aim higher.

Why aren’t we pushing the envelope to provide a more accurate and usable facsimile of the original article?

If pushing our ILL/DD partners to scan in color or grayscale isn’t feasible, purchasing the original article is a viable option from some publishers. Image and data technologies have made tremendous advances, but if you ask document delivery staff why color is not more widely supplied, the answer will almost always come back to the technology as a limitation. File size challenges, difficulty with email attachments and file transfer software, old versions of scanning software, or the scanners themselves are cited as the barriers. Lack of color printing in the borrowing library was often a concern back when all articles were printed and mailed or faxed. Now, the borrowing library does not need to offer color printing of the final document received in order for the acquisition of a document in color to be useful. If the item is to be delivered electronically, the user can view it in color or may have affordable access to color printing at home or elsewhere. Also, a black and white printout of a color scan will have more contrast and distinction than printing a B&W scanned document.

Even when the color technologies are available, our ILL/DD requesting systems do not facilitate color requests. The requesting library staff may not have time to consider whether the material carries content in color based on the citation, but requesters probably have some idea after reading the abstract. Users could use the Notes field to make this request, as many have, but asking the question about color up front could save the time of the user and library staff and allow the color request to be made in an automated fashion. It would be better to ask for and use this information on the initial request, than to acquire a B&W copy and then hear from the requester that what was received is unsatisfactory. One of our anatomic pathology trainees is learning the hard way to request color or grayscale after having to wait on replacement color copies for several poor quality B/W documents received via Ariel.

Automating the ordering of color increases its usage.

The National Library of Medicine’s DOCLINE interlibrary loan request system added color copy requesting in December 2003 due to user demands for biomedical literature which features images that need to be seen in color for the reader to fully understand the message. The number of color requests has grown as a percentage of the overall DOCLINE requests from .02% of the overall requests in FY2004 to .14% of the 1.5+ million total requests in FY2009. While 2,217 color requests may seem paltry, this data reflects only requests for which library staff indicate a color request using the system select box, not those that use the Comments field. Because so few lending libraries indicate that they provide color copies, some borrowers will not select the color request checkbox, but will add a comment to the lender indicating they prefer color if available and at no extra charge. In these cases, getting the article content is more important than getting that article in color.

DOCLINE is primarily a tool of biomedical libraries. What about academic basic scientists and clinicians using public libraries who rely on OCLC Resource Sharing? Do these users realize that color is a choice either when ordering direct via WorldCat or using library forms? How are we limiting the range of possibilities and why? Is it accidental or intentional? Right now, a borrowing library asking for a color copy in OCLC must entertain several possible steps of additional effort—you can pre-identify lenders that provide color and route requests to them or you can make it a note for the lending library staff to receive and respond—where the resulting conditionals can add time to the request. Some libraries warn users that color copies can take longer:

Color copies are available through MINITEX for articles with color charts and graphs. If you need a color copy please make a note of that in the “Comment” field when sending your request. Color copy can take up to two days longer to obtain. http://www.morris.umn.edu/library/ill.php

Asking for color shouldn’t have to slow down the process, but it does when the request forms and shared systems don’t match the right user need with sufficiently detailed information about the lending libraries. Warning users creates more realistic expectations, but it can also dissuade users from requesting color if they need the article in a timely fashion. In a system like DOCLINE where color capacity and requesting is automated, the turnaround times for color are frequently the same as B/W. Users may also be hesitant if they aren’t sure whether an article is actually in color, especially if there are color-associated charges. If not able to fill in color, should the lending library share the information about the pages in color with the requesting library as a conditional response so that the library or user can make a fully informed request?

What about Document Delivery?

How do libraries providing document delivery handle images for their own clients? CISTI offers custom supply service to meet the needs of researchers who require high-quality color or grayscale images. In Warner’s report, these documents were supplied as high-quality photocopies—there is no information about this service on the CISTI website that I can find. The British Library Articles Direct request form does not ask about color—the requester will need to complete either the “Additional details” or “Specify special requirements.” A naive user might assume that color articles come in color and that articles with images will be scanned with the best available photo imaging technology and never realize whether the original article was in color or not. The Linda Hall Library addresses this issue in their Email Delivery Frequently Asked Questions:

Although the typical file size delivered will be less than 2MB, grayscale and color images will create files of a far greater size. Linda Hall Library will not scan in color for electronic delivery unless specifically requested to do so. Please do not request color scanning for electronic delivery unless your email is able to accept files of at least 10MB.

Asking for color isn’t all rosy.

The fill rate for color requests is lower. Per the institution records in DOCLINE, only 243 libraries report providing color copies with 32 of those libraries charging extra for those color copies. For example, the National Library of Medicine charges $2.00 more per item in color and the Linda Hall Library charges an additional $1.00 per page for color copies. What would our users say about the value of color or grayscale images if we asked—would they pay differential rates? Why should they? Why do libraries charge more for color when it is now mostly scanning? It could be that they only have one color price option in the software and still need to deliver paper copies. It is true that a paper copy in color costs more in toner—though that difference in cost is decreasing. But what is it in the case of scanning—is it a matter of staff time spent since it takes a few seconds longer with many scanners to acquire a page of images in color or grayscale? It may also reflect trying to spread out the cost of more expensive color scanning equipment. While low volume flatbed scanners are inexpensive and offer B&W, color, and grayscale, there are significant price differences between color and B&W versions of the large overhead scanners used for tightly bound and duplex page scanning. Are libraries who pay for ILL/DD trying to avoid the extra cost for color? More likely it is just that they haven’t revisited these options as their technology and workload has changed.

Providing color can create the blues as well.

At the William Rand Kenan, Jr. Library of Veterinary Medicine, we want to provide the most informative materials possible. We often scan color plates in color or detailed images in grayscale, but we run into all kinds of problems in delivering these large files to other libraries and directly to our users. Our processing choices result in very different file sizes and image quality, though the readability of the text remains about the same. Below is a table showing the five possibilities available in the Veterinary Medicine Library’s operation. Our example was a selection of three pages (613-5) from the paper “The Notch pathway: hair graying and pigment cell homeostasis” in the journal Histology and Histopathology [3]. We accessed the article online in the original PDF, as well as scanning the print file in all the available options using Ariel 4.1.1.99 with our two scanners—a black and white Minolta PS 7000 overhead scanner and a color HP ScanJet 8290. We also looked at printing an online article to a TIFF file using the Microsoft Document Image Writer which turns the color images to grayscale and pixelates the images, a loss of image data quality. The image quality is still much better than all of the B&W scans, and this is our only option to securely deliver online-only content without printing and rescanning. The opening image in this article shows a side-by-side comparison of an original image in the online PDF article with the output from B&W text scanning.

The results of our scanning experiment with 3 pages of an article with many images.

The size limit for an email attachment at North Carolina State University is 15 MB including the encoding, which increases the file size by about 30%. This is a fairly typical limit with many organizations being restricted to even smaller attachments. It is clear from the email delivery addresses used by many Interlibrary Loan departments in DOCLINE that they have created free email accounts on external services in order to send and receive materials. The file attachment size limits of 25 megabytes per message for gmail.com and yahoo.com are more generous than university or hospital IT policies. Other strategies that have been espoused on discussion lists are using the free levels of services such as YouSendIt (http://www.yousendit.com/). In order to deliver to non-Ariel libraries and individuals with these email limitations, we have posted their scanned documents online and emailed them the URL for download. In some cases however, people still have trouble opening, viewing, downloading, and printing the files from their computers, and it is very difficult to help troubleshoot these issues remotely during the very busy workflow of the interlibrary services function. Other ILL departments have reported that they cannot receive and therefore disseminate color documents electronically via their version of Ariel software because it is attached to a B&W scanner which is not something the lending library can tell from the sending end. Odyssey software has been reported to work with black and white, grayscale, color, or any combination of these scanned formats, albeit slowly. Perhaps its widespread dissemination will address some of these file size transmission issues as more libraries have delivery software. It is clear from the ILL/DD community discussion list questions that a great deal more improvements to speed and functionality are needed in all of these products.

Breaking the Color Barrier

Library procedures and technology really shouldn’t be a barrier to sharing color information. All partners in the borrowing and lending chain have a role in providing the highest quality information. Ideally color scanning of color images at no additional charge would be the default practice. Absent that sea change, borrowing libraries should get users thinking about whether color is needed and explicitly ask them on request forms whether color is preferred. Lending libraries should indicate whether they provide color or grayscale scanning or copying services and any associated charges. Lenders can also look out for materials where the typical scan doesn’t provide sufficient information and use the options in the technology at their disposal to optimize the images. Resource sharing systems should provide an automated way to match the user’s request for color materials with lending libraries’ capacities for filling requests in color. Resource sharing software should provide options to deliver better compressed versions of files that reduce the file size burdens for file transfer. Institutional information technology departments should be more flexible in allowing large file size attachments or providing easy-to-use, secure file transfer services. Lastly, funding agencies can work with libraries to help them obtain faster and more effective scanning technologies and software as prices and functionality improve.

Acknowledgements

Thanks to Maria Collins, National Library of Medicine, for providing data about color requesting in DOCLINE and to Beth Westcott of the National Network of Libraries of Medicine, Southeastern/Atlantic Region for discussing this article proposal with me. Discussions with James Harper, Librarian for Interlibrary and Document Delivery Services at North Carolina State University, greatly affected this piece and broadened my point of view. Thanks to Lead Pipe reviewer Derik Badman for his comments and edits and to Kimberly Burke Sweetman at New York University for her review and thoughtful questions. Lastly, the ILL/DD staff at NCSU deserve recognition for the care they give to the images in each item they provide.

References

1. Baskin DG. Free color pages. Journal of Histochemistry and Cytochemistry. 2001; 49:551-2.

2. Warner P. CISTI Source and journal use at Memorial University of Newfoundland. Interlending and Document Supply. 2004;32(4):215-8.

3. Schouwey K, Beermann F. The Notch pathway: hair graying and pigment cell homeostasis. Histology and Histopathology. 2008;23(5):609-19.

A Useful Amplification of Records That Are Unavoidably Needed Anyway

Brett Bonfield — Wed, 19 Nov 2008 11:00:56 +0000

By Brett Bonfield

Depending on books can feel like relying on snail mail. “Now that I’ve showed you how to find some articles,” I say to people at the reference desk, “I’ll show you how to use our website to find some books you might want to check out. And after that, wouldn’t it make your grandmother’s day if you wrote her a letter?”

For anyone accustomed to the Internet, books can lack the immediacy of articles or websites. Books generally have slower developing narratives, and often have longer paragraphs, sentences, and words, which means they don’t lend themselves to skimming. Compared to digital material, relevant passages can be hard to find, and even finding the right book can be challenging.

Although library websites are improving, keyword searching doesn’t work well at most libraries and faceted browsing—the links down the left side of the page on Amazon—is still a rarity. More importantly, with one notable exception, there is a good chance that nothing on the shelf that is “printed on paper and constructed on the model of the codex” includes the exact information you have in mind.

This is where universal catalogs come into play. If there’s nothing on the shelf that meets your needs, the next step is to figure out if such a book exists. There are five websites that provide relatively complete and easily accessible lists of books: Amazon, Google, LibraryThing, WorldCat, and Open Library. In order to make the best use of these websites, it can be useful to learn how each of them started, what keeps them going, and how their business models and practices affect the data they collect and and how they go about sharing it.

Amazon

It’s tempting to think of Amazon as a technology company. That’s how Werner Vogels sees it, which is understandable: he’s their Chief Technology Officer, and he seems to have done a very good job of it, because Amazon’s technological initiatives have taken a leap forward since Amazon hired him away from Cornell in 2004. Over the last couple of years, Amazon has made its mark as a service supplier, rewriting the rules for online hosting with its Amazon Web Services; it has developed a successful consumer electronics product (the demand for its Kindle e-book reader consistently exceeds supply, and it seems to be extraordinarily popular with publishers as well: they have made almost 200,000 titles available); and it has also made use of its infrastructure with offerings as diverse as its Mechanical Turk and Fulfillment services.

But if you look at its revenue stream, it’s pretty clear that Amazon has very little in common with a traditional technology company, such as Microsoft, its Seattle-area neighbor. Instead, Amazon is probably most like a different neighbor: Costco.

Amazon’s founder, Jeffrey Bezos, seems to have a firm grasp of three important aspects of retailing:

Look for items that can be sold in near limitless quantities (such as “books, music and videos”);
Figure out how to sell them profitably but with minimal markup (“He said he would ‘relentlessly slash prices,’ even if it cut into incremental profits, because he was convinced that it was the right thing to do”); and
Focus your energy on building customer loyalty (“Satisfaction surveys show that Amazon enjoys a golden reputation among most of its 49 million active customers”).

Similarly, Costco’s founders, James Sinegal and Jeffrey Brotman, stock their retail outlets to the rafters, refuse to mark up items more than 15%, and, in their most recent report to shareholders, they note, “This past year we also enjoyed the highest membership renewal rate in our history at 87%, attesting, we believe, to the high level of satisfaction our members have in our products and services.” Think about the things you typically shop for at Amazon: are they more like what you buy from Microsoft or are they more like what you buy from Costco?

Because of Amazon’s size, breadth, and ubiquity, it can be easy to forget that its original business model was pretty basic: it resold books it bought from Ingram and Baker & Taylor. As Tim O’Reilly points out in an apologia on Web 2.0, Amazon purchased a database of book information from R.R. Bowker, put it on the still new World Wide Web, and encouraged its customers to share reviews, bibliographies, and even correct any mistakes or omissions in its data. Two years later, when Amazon went public, it carried more than 2.5 million titles, “including most of the estimated 1.5 million English-language books believed to be in print, more than one million out-of-print titles believed likely to be in circulation and a smaller number of CDs, videotapes and audiotapes.” Out-of-print titles were generally available within two to six months.

Amazon’s original formula hasn’t changed all that drastically. In 2007, books and other media accounted for 62% of its net sales, down from 66% in 2006 and 70% in 2005. The trend may be downward, but media sales are actually improving—it’s just that other sales are improving even faster.

Despite investments in other areas, Amazon knows that it is still primarily a retailer of books and other media, and it continues to invest in complementary initiatives and businesses that fortify its ability to sell these items. Its recent acquisitions, including Audible, Shelfari, and AbeBooks (which brings with it a 40% stake in LibraryThing), join other Amazon businesses, including the Internet Movie Database (IMDB), Alexa, and BookSurge. It also developed its own search subsidiary, A9, it was an important participant in creating ONIX, “the international standard for representing and communicating book industry product information in electronic form,” and it published a hugely successful API (now a part of its Associates program) through which it makes book jackets and summaries available to affiliates (including libraries), and also shares a percentage of sales, inspiring creative programmers to develop websites like BigBookSearch and Zoomii.

Amazon does all this so it can sell more goods and, in general, it seems to be working. Consumers are getting deeper discounts on a broader range of books and other media than ever before, and they have an easy time finding the items they want thanks to Amazon’s faceted browsing interface, its active user community, and its search engine which, in many cases, makes it easy to search within the text of published items.

While Amazon does everything it can to provide you with as much information as possible about the items it has in stock, there’s no motivation for it to share information about items it can’t sell in volume, such as out-of-print material. If the information you’re seeking is likely to be included in new, commercially available books, then Amazon is an excellent resource. If not, you’re best served looking elsewhere.

Google

Amazon is one of two major corporate alternatives to libraries; Google is the other.

Amazon followed one of the two traditional paths for forming a giant corporation: it was founded by an entrepreneur who had a good idea for a company and then hired talented people to build its technological infrastructure. Google followed the other path: its founders, Larry Page and Sergey Brin, created something the world wanted and then hired people to turn their idea into a profitable corporation.

While still graduate students at Stanford, Page and Brin took Eugene Garfield’s work on citation indexing and adapted it for the World Wide Web. Garfield, who marketed information products through his company, the Institute for Scientific Information (now a Thomson Reuters subsidiary), records how often scholarly papers are cited by subsequent scholarly papers, which is useful because citation frequency is a reasonable proxy for importance. Similarly, Google’s PageRank algorithm is primarily a scheme for measuring and weighting links between Web pages: the more links to a page or website, the more likely it is to be important, especially if those links come from other important sites. PageRank is intended to determine which Web pages are likely to be perceived by Google’s users as relevant.

It was soon apparent that Google worked—users found what they were looking for—but no one saw any money in it. Page and Brin tried to sell their technology for $1 million to the big players in the Web market. After everyone turned them down, they decided to start their own company, focusing their attention on attracting as many users as possible.

Where Amazon is a retailer that can be thought of as a virtual Costco, Google is an entertainment company like News Corp or Viacom—it generates 99% of its revenue from advertisements. Just as Amazon is primarily a reseller of products others make, Google is primarily a portal into content others create. Its mission is to “organize the world’s information and make it universally accessible and useful.” Note the absence of the word “Web” in that mission statement: Google’s goal is to organize every bit of information. For instance, Google created its free telephone directory assistance project, GOOG-411 in order to develop speech recognition software. In turning spoken words into text, Google opens up the possibility of searching audio and video files through the same Google search box that is currently used to search websites.

Though the Web has become many people’s primary information source, a great deal of the world’s information is still found in books. In order to harvest that data, in December 2004, Google announced that five libraries—the University of Michigan, Harvard. Stanford, Oxford, and the New York Public Library—had agreed to let Google begin scanning their collections (and several more have since joined the project). Multiple elements of this arrangement remained secret, including the terms of these agreements and the rate at which books were being scanned. It was also unclear how Google would deal with potential copyright issues, especially after the Association of American Publishers and the Authors Guild almost immediately filed a joint lawsuit.

This copyright lawsuit mirrors another: Viacom’s suit against Google acquisition YouTube for copyright infringement. There was some speculation that Google bought YouTube specifically to make sure YouTube didn’t lose its lawsuit, establishing a precedent that Google would have to overcome if it were ever sued for hosting video files. When Google reached a settlement in its book scanning lawsuit this past October, Viacom saw a potential concession in its own suit.

The book-scanning settlement has raised concerns about preservation and access for Google-scanned materials. Harvard has expressed its reservations publicly, and Peter Brantley has been doing an extraordinarily good job of identifying and summarizing the issues involved. How all this will affect people who want to read books online has yet to be determined.

What does seem settled, at least for now, is that Google has archived an unparalleled number of books (and also scholarly articles) whose entire text could be as easy to search as the Web. With the success of Google-411, it seems likely that Google will soon be able to offer text-based searching within audio and video files as well.

What’s not clear is whether advertising will make these ventures profitable or if Google can successfully transition to alternative business models for subsets of its data. Right now, it resells access to scholarly articles and newspaper stories for several publishers, and it appears that it will soon be selling access to the books it has digitally archived. It’s also not clear if Google sees any point in developing an active user community around books. While Google allows users to add reviews at its book website, user-contributed content is not a focus in the same way it is at Amazon or at LibraryThing.

LibraryThing

Founder Tim Spalding’s LibraryThing is a new kind of Internet-enabled organization, the small company that operates on a large scale. This method for doing business has been best documented by programmer, essayist, and venture capitalist Paul Graham, one of Spalding’s inspirations, though LibraryThing probably resembles Craigslist more than it resembles any of the YCombinator companies Graham has helped to shepherd into existence.

Like Craigslist, LibraryThing has an evangelical faith in its users, maintains a simple and easy to understand interface, is satisfied with steady and modest profitability, and competes for attention in a field with significantly larger entities (Craigslist is often cited as a cause of the newspaper industry’s financial difficulties, even though it employs fewer than 30 people).

LibraryThing gets its data from Amazon, from libraries that make their catalogs available through the Z39.50 protocol, and from its users, who supplement the data by providing reviews, cataloging information, adding tags, and disambiguating records. These last two seem to be particularly successful even though they vary from standard library practice.

The tagging concept, popularized by Joshua Shachter’s group bookmarking website, del.icio.us, allows users to catalog items using whatever keyword they wish. This enables works like Bridget Jones’s Diary to be tagged “chicklit” or Neuromancer to be tagged “cyberpunk,” subject terms that differ greatly from Library of Congress designations for these works by Fielding and Gibson.

Disambiguation allows users to clarify records by taking actions such as combining entries for works that are identical but released under different titles, or aggregating work under a single author heading even though that person has released work under multiple names. These can be difficult tasks when a small group of staff members attempt to take this on manually, and it has proved tricky to teach computers to disambiguate records programmatically. For instance, author Cyril Northcote Parkinson’s name is subject to multiple permutations (C.N., Cyril N., C. Northcote, etc.), and his most famous work, Parkinson’s Law (which expands on his belief that “work expands so as to fill the time available for its completion”), has been released with multiple title variations and in numerous editions. Amazon struggles to make it clear which edition of Parkinson’s Law a potential customer might wish to purchase and Google offers a few different options that are not readily distinguishable from one another. LibraryThing, while representing more options than either of the other two, also makes it clear which title its users believe should be considered definitive.

It’s worth noting that Amazon, Google, and LibraryThing are not operating on a different scale when it comes to the number of books they’re cataloging. LibraryThing, which launched on August 29, 2005, has catalog entries for over 32 million books. While open cataloging has its limitations, LibraryThing’s website regularly demonstrates the power of crowdsourcing big tasks to a large, devoted community.

That community is the key to LibraryThing’s success. Just as del.icio.us users socialize around shared bookmarks and tags, LibraryThing users socialize around the books in their collections. Users can add 200 books for free, but to add more they have to pay either $10 per year or spend $25 for a lifetime membership.

That’s one way LibraryThing makes money. The other is LibraryThing for Libraries, a service that allows libraries to integrate LibraryThing’s tag database and, as of September 2008, its user reviews, into participating libraries’ websites. This service is offered on a sliding scale, with the smallest libraries paying $1,000 per year.

While Amazon’s business model does not target libraries in any discernible way (either as customers or competitors), and Google appears to be interested only in the largest libraries as partners, LibraryThing seems to be actively interested in selling its services to pretty much every kind of library—dozens have already signed up for LibraryThing for Libraries—and in digesting Z39.50 feeds (or getting records in other formats) from any library willing to share. In a pinch, it appears that LibraryThing will even take care of your cataloging.

WorldCat

OCLC is a nonprofit consortium that includes almost 70,000 libraries as members. It was founded in 1967 as the Ohio College Library Consortium. In 1977, it began allowing libraries outside Ohio to become members, and in 1981 it changed its name to the Online Computer Library Center. It has made multiple acquisitions as it has grown, including the Dewey Decimal Classification System and its only competitor, the Research Libraries Group, which operated from 1974 until 2006. This sort of activity, and OCLC’s business model, led to its nonprofit status being investigated, but ultimately recognized. Understandably, OCLC uses its tax status to its advantage, just as some nonprofit hospitals take advantage of their status and IKEA makes use of its unusual structure.

OCLC’s most widely visible product is an amazingly good website, WorldCat.org, which provides free access to over 110 million library catalog records, most of which are for books: member libraries provide access to their entire collection, which includes articles, audio, and video. Right now, WorldCat.org is the best free website that lets visitors use keywords to conduct serious research across all media types, a feature which all on its own would make it valuable. On top of that, OCLC has integrated its work on FRBR and xISBN—projects that make it easier to find what you’re looking for—helping to turn WorldCat.org into an invaluable resource.

One of the two major problems with WorldCat.org is what it doesn’t include: the long tail of library records. With 70,000 libraries contributing records, it’s tempting to assume that just about every book is included in the WorldCat.org database, but that’s probably far from true. OCLC’s Karen Calhoun has written about its efforts to position its pricing and services so smaller libraries can participate, and OCLC is making inroads, but it still serves far fewer than half of the smaller libraries in the United States. This won’t affect most of the popular material—big libraries have just about every major work held by a smaller library, so the small libraries’ records are redundant in these instances—but it does mean that more obscure works collected by smaller libraries, representing local authors and regional historical resources, may not be included.

This sort of limitation affects everyone from amateur genealogists to academic researchers. For instance, I have a friend who is writing her doctoral thesis on the history of illness in the counties surrounding Philadelphia. Almost none of the libraries, archives, and historical societies she is relying on have shared their catalogs with OCLC. This means she must make use of each of these collections individually, usually in person, and spend time learning how each collection is organized. This is the research equivalent of using a manual typewriter instead of a MacBook Pro to type her dissertation, and represents a failure to make the best possible use of available technology. These collections’ records should be included in WorldCat.org.

This kind of wasted opportunity to assist researchers is one major disadvantage of WorldCat.org’s omission of smaller libraries’ holdings. The other major problem arises when researchers try to make use of one WorldCat.org’s signature features. When users search for an item in WorldCat.org, they can select a tab labeled “Libraries,” which takes them to a list of local libraries that have that item in their collection. However, only libraries that share their records with OCLC are listed. For example, search for Daemon: a novel by Leinad Zeraus and select the Libraries tab. WorldCat.org displays ten libraries where you can find this book, in descending order of proximity. It would be natural for WorldCat.org visitors to infer that these are the ten closest libraries that have this book. Unfortunately, that’s probably not the case. Instead, WorldCat.org is displaying the ten closest libraries that share their records with WorldCat. Users who believe that WorldCat.org is helping them search their nearby libraries may be led to believe that their local libraries don’t have any books at all—or, at least, none of the books they’re hoping to find.

Of course, it’s possible that some libraries may not want their records included in WorldCat.org. I’m not sure why they would feel that way, aside from the recent hullabaloo over licensing which appears to be getting increasingly heated. However, the library where I work very much wants its records in WorldCat.org so that our neighbors in town can use it as an alternative way of looking for the books that are available in their local library.

OCLC markets WorldCat and other services through a network of regional service providers. The provider for our area is PALINET, so if we want to get our records into WorldCat, we have to go through PALINET. Unfortunately, between OCLC and PALINET, a sort of “if you have to ask, you can’t afford it” pricing structure seems to have emerged for getting records included in WorldCat.org.

I don’t think this is anyone’s fault. Everyone I’ve met at OCLC and PALINET is smart, dedicated, and helpful. My guess is that it’s more like Kate Sheehan’s post office story in which her attempt to pick up a package left her feeling “broken or inept.” That’s certainly how I felt after spending a month exchanging emails with PALINET. At the end I was so confused that it just didn’t seem worth bothering to get an accurate price to take to my board, because the one thing about which I was relatively certain was that we didn’t have enough money to share our records on the WorldCat.org website.

The folks at OCLC seem to be working hard to remedy this situation. I have faith that they’ll get there. But until they do, there will probably be a lot of libraries that would like to share their records in WorldCat.org and either can’t afford it or can’t figure out if they can. That means researchers are going to have to keep working harder than necessary, WorldCat.org users will keep being misled by its Libraries tab, and frustrated libraries may find themselves looking for more accommodating partners.

Open Library

Along with OCLC’s WorldCat.org, Open Library is one of two major nonprofit initiatives centered on creating a universal book catalog: its goal is a page for every book ever published, and to enable those pages to be updated by users, just as LibraryThing or Wikipedia pages are edited by site visitors. Since its founding in July, 2007, it has added over 30 million records to its book database.

For now, Open Library may be best known for its founder, Brewster Kahle, and its technical lead, Aaron Swartz. Both are Internet celebrities and serial entrepreneurs, though both specialize in nonprofit startups. Kahle has sold companies to AOL and Amazon, but he is best known for his work on the Internet Archive, home of the Wayback Machine, which attempts to archive the entire Web. Swartz was a founder of Reddit, which was sold to Condé Nast, and a developer of RSS, which enables websites, most notably blogs, to deliver content directly to readers. Open Library is currently funded by the Internet Archive and the California State Library and is committed to remaining entirely free, right down to the code that runs the site, which it makes available through an open source license.

Unlike our experience with OCLC, sharing our records in Open Library was dead simple: I emailed Aaron Swartz and he replied that receiving our records “was cause for much rejoicing.” (I also emailed Tim Spalding at LibraryThing to see if he might be interested in our records, and I found out he was as well.) Open Library is actively soliciting these contributions from libraries. However, it could, potentially, get these records directly from library websites. The technology involved is pretty simple and fairly well understood.

For example, the library where I work recently introduced a new website that’s powered by Casey Bisson’s fantastic Scriblio project. To import the Collingswood Library’s old records into our new website, we had Scriblio visit the web page for each record in the old catalog and import its data into the Scriblio database, turning blah into beautiful. We also use scrib_availability to show website visitors if the book is on the shelf.

Open Library clearly has the technical knowledge to do something like this and, because just about every library has a web-based catalog, it could easily include every book from pretty much every library in its database, enabling site visitors to learn if their local library has the book they want. For now, Open Library’s book pages, LibraryThing’s book records, and Google’s About this book pages link to WorldCat.org. (Edit: I originally wrote that Google’s About this book pages did not link to WorldCat.org. In the future, I’ll try to remember to disable my Firefox extensions before making such claims.)

The issue isn’t technical; it’s legal and ethical. On behalf of the library where I work, I uploaded our records to archive.org, making it possible for Open Library to use them, and on behalf of my library I uploaded them into our Scriblio-based website. It seems unlikely that libraries will have their records aggregated without their permission, at least in the near future. However, it wouldn’t be surprising if Kahle or Swartz, instead of asking for our records, began asking for our permission: what if they came to us and asked if they could automatically index our catalogs, creating for free a service that costs libraries thousands of dollars through OCLC? Even non-OCLC libraries are used to sharing their records. Why wouldn’t they accept Open Library’s offer to create a universal catalog? For most libraries, there’s no downside, but there’s an enormous upside: a single website where the world could see their records, and a free hub they could use for sharing records with each other.

A Useful Amplification

In his 1992 Redesigning Library Services: A Manifesto, Michael Buckland writes that, “(f)rom an operational perspective the library catalog can be seen as a useful amplification of records that are unavoidably needed anyway. The information in a catalog can be useful in a variety of ways to library staff and library users. The difference between modern library catalogs and those before the late nineteenth century is essentially that the modern catalogs have a much larger bibliographical superstructure added to the locational information than had previously been the case.” In a nutshell, Buckland is saying, libraries decided that, since they had to keep a list of what they owned, they might as well describe each item and make sure they knew exactly where copies of it could be found. “With materials on paper, having copies stored locally is a necessary (though not a sufficient) condition for convenient access. With electronic materials, local storage may be desirable but is no longer necessary…. The answer is to shift from catalogs to union catalogs or linked catalogs…. Arguably the present day catalog… is more a product of the limitations of nineteenth century library technology than of present day opportunities.”

Between Amazon, Google, LibraryThing, WorldCat, and Open Library, we’re getting ever closer to setting aside nineteenth century models and to more fully taking advantage of present day opportunities. There is no technological reason preventing us from building a universal catalog that contains information on every book in existence and locates that book in every library that has a copy available for use.

We’re also closing in on having a digital scan of every book, making full-text searching possible, as well as concurrent, remote use of scarce resources (by which I mean, I can look at the text of a book on my screen while you’re looking at it on yours, a feature not available in a paper-based book, which is limited to being used in a single location and, generally, by a single user). It’s an exciting time to be a booklover, and it gives one hope that, with better resources available, books will begin to seem as accessible and vital as born-digital resources.

I like the alternatives that Amazon, Google, LibraryThing, WorldCat, and Open Library make available. I think each has made the other better, and I like having alternatives in researching books just as I like having FedEx, UPS, DHL, and the United States Postal Service available when I’m trying to send a package. I don’t think researchers are generally lazy, and I don’t think they want fewer options. What they want are a few really good choices, and they have them. It’s exciting for all of us that these good choices seem intent on becoming great ones.

Thanks to Tim Spalding and Aaron Swartz for reading an early draft of this article, and to my ItLwtLP colleague, Hilary Davis, for helping me with its final version.