1x1



DECEMBER 22, 2005
News Analysis

By Burt Helm


Google's Great Works in Progress

Academics point to content errors in the ongoing Book program, but the search giant says fixes are coming


  STORY TOOLS
Printer-Friendly Version
E-Mail This Story
Reader Comments
POLL INSTANT SURVEY >>
With which of the following statements on outsourcing do you most agree?

The benefits of outsourcing to corporate America far outweigh the costs
There's an even split between the drawbacks and rewards
Any benefits are overshadowed by the loss of U.S. jobs
Unsure

VIEW POLL RESULTS >>
  PEOPLE SEARCH

Search for business contacts:

First Name :
Last Name :
Company Name :

PREMIUM SEARCH
Search by job title, geography and build a list of executive contacts

Search by Zoominfo
  Tech White Papers

Google books
Slide Show >>
From the start, the Google Book Library Project has met with mixed reviews. The program is designed to scan millions of books and make them searchable online. Some academics and librarians lauded it as a way to make the world's written knowledge accessible to anyone with Internet access. Upon learning that Google (GOOG) would also be scanning copyrighted books, certain authors and publishers (including BusinessWeek's parent, The McGraw-Hill Companies) were none too pleased -- and they've taken their concerns to court (see BW Online, 10/20/05, "Google's Escalating Book Battle").


Google persevered with parts of the project, including a plan to scan books from five of the biggest U.S. universities. Legal issues aside, Google's book project faces a host of technical challenges. And now that the scanning is well under way, it's still not pleasing to some -- including participating institutions.

EARLY HICCUPS.  Librarians and academics say some early scans include a variety of errors, such as blurred words, missing pages, and truncated text in public-domain texts from the library program (see our slide show). "If we were paying for this, if we were driving the [quality specifications], they would be different from what Google is offering -- that's a true fact," says Andrew Herkovic, director of communications and development at Stanford University Library, a participant.

Google is taking steps to correct mistakes. Adam Smith, product manager of Google Books, concedes that the public-domain content contained errors. But, he stresses, the primary goal right now is to put as much content online as possible, and address problems later. "We feel [this approach] creates value for users," says Smith. "We also acknowledge this is a long-term project. We realize it's a process of continued improvement."

Yet the early hiccups underscore the enormity of a task that may make Google's bailiwick -- finding relevant results among billions of Web pages -- look easy by comparison, search-engine experts say. Other attempts by Google to chart new territory -- from its commerce-search tool Froogle to its social-networking site Okrut -- also have failed to live up to what many observers consider an unparalleled ability to serve up search results (see BW Online, 11/30/05, "Google Tops the Charts").

SECRET SCANNING SYSTEM.  In the Google Books (formerly called "Google Print") program, Google gets books from two sources: publishers and libraries. Scanning books from publishers is fairly straightforward: The books are either stripped of their bindings and fed loose-leaf into a scanner, or existing digital files are used. When a user turns up a page from one of these books in a search, viewing access is determined explicitly by the publisher. Generally, users can only look at a few pages.

Scanning library books is a tougher challenge, and that's where glitches are more apt to emerge. Many books are out of print and irreplaceable, so they must be kept intact throughout the scanning process. To do this, Google engineered its own technology.

The company won't discuss the method, but according to people who have seen the Google machinery, the system works like this: Books are placed, one at a time, in a V-shaped cradle underneath two high-resolution digital cameras, each focusing on an individual page. After using metal clamps to secure the outer cover in place, a human operator turns a page of the book, then taps his foot on a pedal to trigger both cameras simultaneously, snapping pictures of the two pages. The operator does this, turning one page at a time, until the book is finished.

MISSING PAGES.  Manually scanning a book in this way would take about 30 to 60 minutes to complete 500 pages, estimates Lotfi Belkhir, CEO of competing book-scanning technology company Kirtas Technologies, who discussed Google Books issues with reporters on Dec. 13. From there, a software program automatically formats the text to be read by a computer, and optimizes it so that the print shows up clearly.

In the best cases, this process can take a decrepit, yellowed book and make text crisp and sharp-looking on the computer screen. But kinks still need to be worked out. Digital versions of books scanned using the Google system contain far more errors than pages scanned loose-leaf.

While, in theory, one should be able to read a public-domain book from start to finish on Google's service, it often doesn't play out that way. Among public-domain books available via the service, many have pages where text is blurry to the point of being illegible, or pages where text has been cropped off. And in some cases, several pages in a row are inexplicably missing.

RELEVANT RESULTS.  Another hurdle Google needs to clear: making the book-search engine work as well as the Web-search tool, says Gary Price, news editor of SearchEngineWatch.com. Displaying the most relevant passages from millions of pages of library books is much harder than searching billions of Web pages, he says.

The reason? Internet search today largely relies on how Web pages link to one another, in order to sort out the most relevant ones. That's not possible within the text of books. "Determining relevancy with little information in a full-text source is a huge challenge, especially when today's users are used to just typing three to four words into a [search-field] box at most" and expect to find what they need in the first page of results, Price says.

Currently, even finding a public-domain book in the first page of results is difficult. Often it requires sophisticated searches on the "Advanced Search" page -- something the average Google user isn't accustomed to doing. Google says this will get better, too. "We've got researchers working on this as we speak," says Smith. "Just like we are constantly improving the way we rank Web pages, [search technology] for books is going to be an ongoing area of improvement for us as well."

PARTNER SATISFACTION?  So far, Google's results have failed to wow library partners. While all of the partners interviewed made certain to note that they have been extremely impressed by how carefully and safely Google's staff handles the books, they concede the overall quality of the scans hasn't been great. "We at Harvard do a more careful and high-quality digitization when we do it for our own purposes, there's no question" says Sidney Verba, director of the Harvard University Library.

And the partners concede there's no beating Google on price. Google is scanning each library book for free, while giving each library its own digital copy to store. "Google has never pretended to knuckle under to quality demands that [preservationists] hope for," says Stanford's Herkovic. "But, overall, it's not out of line with what we expected."

Verba adds, "We want the quality to be good enough to read, but not necessarily as good as a printed book." Harvard's hope is that students will find relevant parts online, then check out the book in the library. And, blurry pages or not, if it succeeds in piquing wider interest in the world's great literature, Google's big book experiment may merit the ravest of reviews.

Helm is a reporter for BusinessWeek Online in New York


 READER COMMENTS



 BW MALL   SPONSORED LINKS
Buy a link now!


Get BusinessWeek directly on your desktop with our RSS feeds.XML

Add BusinessWeek news to your Web site with our headline feed.

Click to buy an e-print or reprint of a BusinessWeek or BusinessWeek Online story or video.

To subscribe online to BusinessWeek magazine, please click here.

Learn more, go to the BusinessWeekOnline home page

Back to Top
Advertising | Special Sections | MarketPlace | Knowledge Centers

Terms of Use | Privacy Notice | Ethics Code | Contact Us

Copyright 2000- 2008 by The McGraw-Hill Companies Inc.
All rights reserved.

McGraw-Hill Cos.

TODAY'S MOST POPULAR STORIES

  1. Windows on a Mac: Virtually Perfect
  2. Apple's iPod Problem
  3. The Recession: What Top CEOs Are Thinking
  4. No Quick Fix for GE Capital
  5. November Job Losses Could Be Worst in 28 Years

Get Free RSS Feed >>
  MARKET INFO

Portfolio Service Update

Stock Lookup

Enter name or ticker