My presentation [Geoff Nunberg @ Language Log] focussed on GB’s metadata — a feature absolutely necessary to doing most serious scholarly work with the corpus. It’s well and good to use the corpus [fulltext/body] just for finding information on a topic — entering some key words and barrelling in sideways. (That’s what “googling” means, isn’t it?) But for scholars looking for a particular edition of Leaves of Grass, say, it doesn’t do a lot of good just to enter “I contain multitudes” in the search box and hope for the best.
But whether it gets the BISAC [Book Industry Study Group] categories right or wrong, the question is why Google decided to use those headings in the first place. (Clancy denies that they were asked to do so by the publishers, though this might have to do with their own ambitions to compete with Amazon.) The BISAC scheme is well suited to organizing the shelves of a modern 35,000 foot chain bookstore or a small public library where ordinary consumers or patrons are browsing for books on the shelves. But it’s not particularly helpful if you’re flying blind in a library with several million titles, including scholarly works, foreign works, and vast quantities of books from earlier periods.
VERY interesting article about the metadata used for Google’s book project. Brackets [ ] are my additions.