A conference report, with comment :
"The Google Books Settlement & the Future of Information Access",
a conference held at UC Berkeley, August 28, 2009
Many folks are following the GoogleBooks Saga -- which is, by turns, either the glorious democratization of all the world's "texts" -- or a globalization conspiracy to dominate same, both de jure at home in the US and de facto everywhere else...
Well, GoogleBooks came to a conference, today: at Berkeley, famous US West Coast conferencing hotspot -- and host to many highly-original presentations, from the Birth of the Internet to the Très Grande Bibliothèque [http://www.fyifrance.com/Fyarch/fy920414.htm].
Today it was GoogleBooks' chance, then, to face the keen minds, and sharp critical knives, and Left Coast scepticisms, of "Berkeley". And GoogleBooks did OK...
This conference was hosted by the university's school of information, the "iSchool" [http://www.ischool.berkeley.edu/], the earliest of 2+ dozen such postgraduate-level schools recently sprouted on academic campuses, across the US and the rest of the digitizing world [http://www.ischools.org]. UCB iSchool Dean AnnaLee Saxenian introduced her program:
|"Data Mining and Non-Consumptive Use"|
|"Privacy and the Google Books Settlement"|
|"The Google Books Settlement and Information Quality"|
|"Public Access and the Google Books Settlement"|
|[A Concluding Note]|
This is the 11th hour of the sound & fury leading up to the US federal court's decision, on the GoogleBooks Settlement:
But Dean Saxenian was concerned to broaden the debate, to address more than just US legal arcanity and bibliographic detail, at her conference today.
So the first group of speakers talked about data-mining, and about "users other than consumers": these being two of the fundamental issues of the GoogleBooks project, underlying and far exceeding the details of the legal debate --
Erik Kansa, iSchool adjunct professor and director of their Information & Service Design program ["Information & Service Design / ISD: Teaching and research on the skills and concepts required by a service-led and information-powered economy", http://isd.ischool.berkeley.edu/] moderated and introduced this panel --
As of October 2008, he said, per GoogleBooks itself 7 million books had been mounted, and as of now less than a year later that figure has risen to over 15 million, or over double. For perspective, Kansa suggests, the world's largest book catalog lists only 23 million books [see http://orweblog.oclc.org/archives/002000.html], and per the national census only 2,345,000 book titles have been published in the US since 1880 [see http://www.springerlink.com/content/r2ux87167445382p/]. So GoogleBooks rapidly is covering a significant portion of the world's printed literature.
Google's project may be about digitizing books, Kansa pointed out, but it also is a matter of, "What to do with their contents, once they've been scanned?"
There are commercial applications, for this -- those have received most attention, in the press and from the critics and in the GoogleBooks lawsuit and pending Settlement -- but also there are research applications.
For the latter purpose, the project becomes not a bunch of books but an enormous and very useful, and very valuable, "dataset" -- a potential database, to be "mined" for all sorts of interesting and not particularly consumer-oriented purposes.
There are commercial firms already interested in using the GoogleBooks "dataset" in this way: Kansa mentioned,
But also there is, simply and purely, "research": he mentioned the National Science Foundation / NSF - funded "Bibliographic Knowledge Network" of UC Berkeley Professor Jim Pitman, the statistician on his panel, which studies knowledge relationships for information retrieval [http://www.stat.berkeley.edu/~pitman/].
Another domain for such study, Kansa pointed out, might be the World Wide Web. Or the GoogleBooks "corpus" might be considered a "cultural genome": and very much as the Human Genome Project has been developed under public sponsorship, as a public trust, the development model for the GoogleBook Corpus similarly needs some fleshing out, Kansa said... No mention of Craig Venter and his very non-"public" human genome sponsorship and trust, there, but Venter's ghost -- or the memory of his noisy & successful human genome effort, as he himself is very much alive and active -- lurks on... [http://en.wikipedia.org/wiki/Craig_Venter].
Kansa's primary point is that this GoogleBooks project's primary value may be as a tool for not humans but machines to read -- for machine processing, not just human eyeballs, he says.
His greatest concern is that, not realizing this, the pending court Settlement unwittingly may restrict it: imposing restrictions requiring permissions from Google (he cites page 81 of the Settlement), or from rights holders (he cites p. 82).
Also, Kansa points out, researchers need to probe and understand and verify every aspect of their sources and their tools: which very definitely includes Google's famous, and famously-secret, "algorithms"... But Google cannot remain a trade-secrets "black box", to be truly useful to researchers, he says.
Nevertheless, already it is clear that the GoogleBooks Corpus has great research value: "literature as data" has a long-term future, Kansa hopes.
GoogleBooks "makes books more valuable than they ever have been", because of the tremendous opportunities presented now from mining their data, he believes.
Brantley cites Ivor Richards' 1924 notion that, "a book is a machine to think with" [cit. Duguid in Nunberg ed., http://books.google.com/books?id=O8xg8EfQnnAC&pg=PA78&lpg=PA78&dq=the+book+is+a+mac hine+to+think+with&source=bl&ots=DIMLV-vgWI&sig=mtOxfB_GtUG5uMkvfjZZbztf20U&hl=en] -- or, for a tinier URL, please see http://tinyurl.com/nf2chz.
The book is an object of maximized design, Brantley says, one developed historically through long industrial, cultural, and social and political processes.
Now the challenge is to do the same with "books as data": these will be new kinds of the old books -- there is no dichotomy between old and new here, he feels -- we are entering a stage of Gemeinschaft becoming Gesellschaft, Brantley suggests, a socialization of books.
Brantley expressed several concerns about this:
The third scheduled panel member, Colin Evans of Freebase [http://www.freebase.com], was unable to make it to the conference because of the flu, Kansa then announced to the audience... A nervous "swine flu" titter went rippling through the capacity crowd -- a few folks who had been leaning over near their neighbors suddenly sat up a little more straight -- two people left their seats to go sit on the floor in the back of the hall... well apart from one another...
Panel member #4, Professor of Statistics James Pitman, then spoke, indicating what the "academic" community might be able to do, and should have an opportunity to do, he feels, with such a GoogleBooks Corpus:
-- Pitman's concern, though, is that regulation and restriction done now not inhibit the scholarly community's pursuit of such research and applications improvements. He himself would like to look at data structures, for example, likening these to chemical compounds or math theorems: but these are complex, many-layered investigations -- anyone who has tried to negotiate a GoogleEarth permission, with their many-layered maps, each requiring a separate authorization, will sympathize... -- and he greatly fears being required to seek multiple "permissions".
Pitman would like to use, he says, "The GoogleBooks Corpus" -- this is how the Conference increasingly came to refer, to the academic community interest in Google's project, as "The GoogleBooks Corpus" -- to identify people and places and events, to study machine learning...
I could not help thinking, however, as I listened to Professor Pitman's worries about academic research community access, that this was putting the customer cart before the marketing horse: surely this instance of "the future of information access" illustrates the need for users like Pitman to rest a little easier...
Google institutionally knows all about "viral marketing", it's the secret to their commercial success and to that of any other firm which has "made it" in hitech.
So Pitman and the other potential customers in academia need only recast themselves / pitch themselves as "viral research", an enormous potential pool of users for the product -- to attract, and greatly, the attentions of Google Inc.'s marketers... and from thence all permissions and many other good things will flow... Iron Law of Marketing...
A fifth, unscheduled, participant then bounded ebulliently to the stage... I think he was announced: the Dean, I believe, said something to him about, "You might as well get up there, Dan, most of the comment and discussion have been directed at you, after all..."
Dan Clancy, of GoogleBooks, could not have contrasted more radically with the assembled crowd of careful academics -- shock of radiant golden hair, deep suntan, enormous and athletic physique, brilliant smile, the guy is youthful-looking and energetic, billowing-bright-color-shirted and booming-voiced and, yes, halfway through his initial comments he actually did raise both great arms and "virtually embrace", revival-tent style, his somewhat older and considerably grayer and far more "careful" audience below -- academics and librarians in abundance, at this event, and few of those ever "bound" up to a podium...
Immediately, memories of Craig Venter in the "Genome Wars" came back again... The Government Research Bureaucracy meets The Surfer...
In this case, though, Clancy appears to have done his political homework better than Venter did, or cared to: folks in the audience already knew him, some already are working with GoogleBooks, there seemed to be real respect for Clancy and for his project -- not the competition which Venter created and relished, here, but more a collaboration and cooperation -- still The Corporate Approach meets The Academic, but here with much shared vision... or at least the appearance thereof...
"We at Google always have considered Non-Consumptive Use to be 'Fair Use'", Clancy began, seemingly-innocently and placatingly. There were quite a few bitter critics and outright enemies in his audience, he must have known -- these became evident as the day wore on -- but from this initial comment on he defused most of them completely. "We want university-run research centers to be the Host Centers for this GoogleBooks Corpus -- to be clear that its use is for academic applications -- this is to be a 'Research' Corpus."
Clancy sees two different Host Centers developing, he said: which universities will be chosen for this has not been decided -- this UC Berkeley audience seemed agreed that UC Berkeley ought to be one of the two...
This way, Clancy said, responding to Professor Pitman's worries, yes there would need to be "procedures" for gaining academic access to use the GoogleBooks Corpus, but that access would be provided by universities themselves, following procedures already familiar to academic researchers such as Pitman.
Google recognizes the immense value, to research, of the Corpus their Books project has been assembling, Clancy said: they want to see it used, by academics -- so long as the usage is non-commercial, Google will do everything they can to keep permissions procedures etc. from "chilling" research. Their upcoming Settlement is one step in precisely this research-oriented direction, he said.
The second panel of the Conference looked a little more at the law. Nothing too specific, this being academia... But the law at least specifies minimal rules of how the game gets played. And in the brawling bleeding-edge world of GoogleBooks, "privacy" is a leading legal concern.
Chris Hoofnagle started off his panel's discussion by observing that if we are to have "privacy by design", in this, we are going to need "early intervention". He is a lawyer, at the law school here, an expert on "information privacy law".
His first speaker, however, was not a lawyer but a librarian: Angela Maycock, from the greatly-respected Office for Intellectual Freedom of the American Library Association [http://www.ala.org/ala/aboutala/offices/oif/] -- in the US, significant social actors do not always bear names and labels which precisely signify their activities -- the librarians long have been leaders in US "privacy" fights, defending patrons and librarians and others against censorship and snooping by government and various groups.
Maycock said her organization is not opposed to the GoogleBooks Settlement which Google has negotiated. ALA favors open access to information, she said, and it is their hope that the GoogleBooks project will provide that.
The librarians are very concerned, however -- they have been since 1939, she reminded us -- about "privacy", which she defines as "open inquiry, unrelated to its subject-matter", and "confidentiality", which they believe must conform to users' "reasonable expectations" regarding the use of information about themselves.
Above all, she said, ALA believes strongly that users must not feel "chilled" by restrictions on their use of information -- "felt reluctance" is the enemy here, she suggested, it discourages readers from using the resources. Librarians do like to see their books get read...
Then a practicing librarian, the director of the UC Berkeley Library, Tom Leonard, took the podium. First he honored his preceding speaker, declaring that US "public" librarians, largely represented by Maycock's ALA, are in the "front lines" of the privacy debates. There are more "public libraries" in the US than there are McDonalds Restaurants!, he said, dramatizing the pervasive influence of public libraries here.
The professional role in privacy, then, is to protect the library user's anonymity. For example, user data gets purged once it is used, so as not to be available for snooping later by officials or others interested in the user's interests or opinions or reading habits.
Leonard acknowledges, however, that there may be some trade-offs ahead, as new digital libraries may provide better "access" through relying upon methods once shunned for their possible threats to "privacy". Nevertheless he is very hopeful for the results of the GoogleBooks project, he says.
Leonard mentioned the recent alarmed letter sent to the Settlement court by 21 University of California faculty members, linked above here [UCLetter]. We already are able to monitor information users in libraries, he said, but we don't: because this would violate current policy -- but also because equipping libraries with sufficient personnel and networking capacity would be wildly impractical. I thought of the wild impracticality of John Poindexter's now-infamous "total information awareness" schemes [http://en.wikipedia.org/wiki/Information_Awareness_Office] for equipping the US federal government with such capacities, in the name of anti-terrorism: for large-scale reading and indexing of all citizens' email for instance... tasks of the type the National Security Agency / NSA [http://en.wikipedia.org/wiki/NSA], established 1945, exists to perform...
More generally, Leonard reminded this UC Berkeley audience of two of the lessons the university garnered from the recent experience of another large business corporation on-campus:
Jason Schultz, a glib and entertaining attorney with long experience in these matters, now director of the Samuelson Law, Technology & Public Policy Clinic [http://www.law.berkeley.edu/clinics/samuelsonclinic/] at the law school, spoke next.
He reminded us of the famous-if-frightening Stanford "marshmallow" psychology test [http://www.newyorker.com/reporting/2009/05/18/090518fa_fact_lehrer], where researchers gave children a choice, between a marshmallow candy now or foregoing that to earn another marshmallow later -- saying he felt, with the current GoogleBooks Settlement, that it too is a matter of "a marshmallow now" or "a better marshmallow later"...
The project as-constituted now provides very valuable information access, Schultz said. Concerning privacy he would have four concerns:
Two other Google projects might serve as models for handling privacy concerns in this GoogleBooks situation, Schultz suggested:
"What does Google need to do, to reassure us?" ought to be the general critical position on privacy concerns, Schultz suggested.
This panel's next speaker was Michael Zimmer, assistant professor of Information Studies at the University of Wisconsin. His primary academic interest, he says, is the interaction between ethics and technology. He sees two leading issues in the GoogleBooks debates:
Zimmer mentioned the "aesthetic" burden of adding a "privacy link": referring to the tiny link at the very bottom of the main Google search page [http://www.google.com, http://www.google.com/intl/en/privacy.html], which now leads to corporate declarations of their commitment to privacy, and offers to address user worries about it -- installed there only after a fight or at least pressure from outside, he said.
Vastly minimizing or even underestimating the difficulty Google must have had, in loosening up their precious "one-box" interface design philosophy and approach -- key to their commercial Web success, in the opinion of many -- to accommodate the privacy-concern there...
My guess is Google was very willing to put the issue somewhere else on the site -- they're officially committed to "Don't be evil!" as their corporate motto, after all [http://www.google.com/intl/en/corporate/ux.html] -- but the privacy-zealots insisted on the home page, space sacred to the anti-clutter "One-box" design ideal.
Ideals, sacred, zealots, "Don't be evil", philosophy, ethics... This digital info business gets very religious, at times. It's never just "aesthetics", however -- never just "the money", either -- not on either side.
Zimmer mentioned GoogleStreetview's adventures and misadventures, as an example of Google's insensitivity to the privacy issue --
Street View is the Google project [http://en.wikipedia.org/wiki/Google_Street_View] to link actual onsite photographs to the various geo-location services, such as GoogleEarth and GoogleMaps, which the firm offers.
Little GoogleCars, equipped with wifi GoogleMaps and cameras and enthusiastic young Googlers, have been zipping around urban and other neighborhoods for several years, now, in the US and overseas, generating snapshots of every possible street scene or neighboring facade or road-crossing, taken from every conceivable angle, for inclusion in photo-panoramas to which one now can click from within GoogleMaps.
To get an idea of the immense power of the insight, here, in GoogleMaps try finding a street address in Montpellier, along the ancient rue de l'université, then click on the photos and take a "virtual" walk-up-the-street -- Rabelais at your side, or Petrarch ambling along up ahead... you practically can smell the croissants, and feel the hot southern summer sun on the rooftops high above as you shelter in the cool shadows beneath...
And you can begin to imagine further developments of the Street View project: as photography and image-splicing and throughput all evolve and improve, "virtual Montpellier" will grow more and more "real", here.
But it works a little better in Cupertino than it does in Languedoc-Roussillon, or in Buckinghamshire... There have been the inevitable "dogbite" problems: standard risks in any street -- a fist shaken here, a face hidden there, a few angry confrontations. In Bucks, UK, one village threw the Googlers out, a high-dudgeon resident there irately insisting that they needed his permission to photograph his house [http://www.guardian.co.uk/technology/2009/apr/03/google-street-view-broughton] -- I've had San Francisco neighbors tell me this too, but here it's more rare I think.
Street View Googlers sheltering behind US legal protections and "open-ness" cultural norms are guilty of "too-legal" thinking, Zimmer points out. Even in the US: as a local example he offered an apparently-now-notorious Streetview photo of an Electronic Freedom Foundation / EFF attorney, taken not far from his EFF office -- first rule of dogbite, never bite a lawyer... one of the most Orwellian "chilling" police-state measures might be snooping on such people at their place of work, or in their own homes...
If Google encourages such practices they run afoul of even very-loose US civil liberties concerns -- how much more difficult overseas, where far more restrictive legal and cultural civil liberties practices may exist, against backgrounds of recent very real police state practices, painful wars, long-held cultural insecurities, ancient traditions.
And everywhere, after all, there are cultural taboos against "capturing" or even "depicting" the human image: whether it is camera-shy gypsies hiding their faces, in Europe, or elderly hilltribe peasants in Southeast Asia worried that their children's "souls" might be "taken", the fear of photographers remains widespread, globally. Sure, maybe all that "should change"... like a lot of things... But that is just youth talking -- in the meantime, until it does change, Google needs to approach its street-photography projects knowledgeably, and very diplomatically.
All of this is part of ethics, ethicist Zimmer says: he believes all such projects need "ethicists" on their staffs, in addition to "engineers" and "lawyers"...
I felt the audience's many "lawyers" squirm a little, though, hearing this last: engineers may be unsurprised at being distinguished from ethicists, perhaps, but lawyers consider themselves somewhat expert on the "ethics" subject -- whatever others may think, sometimes, of their claim...
Zimmer might do better with "business" people, too, if he were to pitch his ethics concerns to them in terms of marketing -- i.e., "If it seems unethical to the customer it won't sell..." -- crass though this might sound to a philosopher, a businessperson has a hard time allocating for an "ethicist" in a budget. But this was an academic discussion...
Zimmer mentioned some "design options", which he believes might protect privacy:
Zimmer mentioned TrackMeNot software, a Firefox extension [http://mrl.nyu.edu/~dhowe/trackmenot/] -- also other work being done at NYU by developers Daniel C. Howe and Helen Nissenbaum, on the problem of "polluting your own data cloud" [http://mrl.nyu.edu/~dhowe/, http://www.nyu.edu/projects/nissenbaum/] -- saying that all these are becoming privacy-protecting options, now, for any Internet user, and GoogleBooks both needs to be aware of them and to adapt to them.
This Berkeley conference, by now, clearly had become "constructive" -- rather than the "confrontational" which so many of us had feared, and which so often happens. Berkeley "democracy" often has degenerated into shouting matches, in the past -- a necessary freedom, in a democracy, but one subject to abuse just as too much gentility can be.
Zimmer's suggestions, like the ideas of those who preceded him, contained barbs but those were not their true point, I felt -- he, and the others, sincerely did believe that Google's good Books project could and would be improved by their input, and the proof not only was sitting in the audience in the person of Dan Clancy, but now could be seen on Google screens' "privacy" buttons, and in various GoogleBooks changes, the upcoming Settlement perhaps included.
A few fireworks were upcoming in the conference too, however...
An audience question was asked, of this panel: "What about the digital watermark, will that invade user privacy?" Jason Schultz answered: "We don't know, yet..." -- he then made the good tactical point that now is the time to discuss it, and other privacy concerns, to get "early enforcement" written-in to the GoogleBooks Settlement if we can.
Tom Leonard observed that although librarian professional values are very strongly protective of privacy, there have been "no cases in 20 years" at his institution. He suggested Cal's famous Mark Twain Papers as an example: not much need there for worry, he believes -- patron records are kept, as with any archival or historical material of value, but he doesn't feel too many folks are that interested in such records, unless & until some piece in the collection turns out to have been damaged and they need to find out who did it.
This comment by Leonard was curious, to me: surely the same could not be said for library holdings of, say, House Un-american Activities Committee / HUAC - period archives, or of "Loyalty Oath Controversy" materials -- or, more recently than the old Red Scares, post-9/11 Terrorism Scare demands by government for reader records and particularly for library Internet terminal security, which were much in the news [http://www.ala.org/ala/alonline/currentnews/newsarchive/2001/september2001/fbitargetslibrary.cfm].
Another audience question asked, "What paradigm, now, for privacy?" Zimmer suggested "personalization services": that there has been a shift in this direction, and that now libraries among others may change.
He did not elaborate, but I believe he refers here to the increasing appearance online of "My Account" services -- whether these are for commercial shopping, or book purchasing, or for "My Lists" or "My Favorite Searches" library service, all can be tabulated and analysed now to very specifically identify and precisely "profile" a user.
Schultz offered, "We need binding commitments: not 'trust' -- that changes..." He's certainly correct: even in the US -- and definitely in Europe, and elsewhere -- the great fears regarding 'US commercial corporation' sponsorship, of this GoogleBooks undertaking, center for many around the sheer changeability of US corporations:
Schultz qua lawyer believes we should require Google to require "compelling need" before releasing user information to anyone, government included, he says -- that way a judge will be able to get in there to enforce privacy protections, when user information queries get made.
And then up jumped Dan... To be fair, he had kept demurely quiet throughout the long disquisitions on his work delivered by others -- but you could see from his "bounce" that he was dying to speak out again...
"We are in a transition from libraries to private corporations for information access!", Mr. GoogleBooks declared...
Well no, we're not. But then reasonable minds can differ.
"The GoogleBooks Settlement is a first step, not a last step!"...
On that we're agreed.
The conference broke for lunch promptly at 12:30, as-scheduled -- Dean Saxenian runs a tight ship -- and we all adjourned to enjoy a very good box lunch out on the new CITRIS building terrace, enjoying the hot Berkeley August sunshine for an hour.
At 1:30, again promptly, we reconvened,
This panel was fun -- not that the previous two had not been, but this one offered some added panache.
Paul Duguid, of the iSchool, gave some animated and entertaining introductions, both to his topic and of his panelists.
iSchool colleague Pam Samuelson had invited him, he said -- he has taught and written about the subjects addressed by the GoogleBooks project for some time -- he wants to share with us three general notions,
-- so that when discussing or more likely debating GoogleBooks, it is possible to choose among its various facets, or to select only one, or to conflate them all.
- also, per Duguid, we are,
I am very much with him, about the flood: "sipping from a firehouse" was one famous metaphor for information overload's inundation, a while back -- it deserves resuscitation -- for a time Google's infamous algorithm has changed its course, if not exactly stemmed the tide, and the Semantic Web does hold some promise -- but now we have entered the Age of the iPhone, and the flood is showing strength again, this time greatly renewed.
Mark Liberman, of the University of Pennsylvania, was this panel's first speaker. He is a linguist, Duguid had said. Liberman says he studies the English language, which is now about 500 years old and contains roughly 1 million words -- by comparison the much newer GoogleBooks Corpus contains five to six times that, enough to "strangle cultural history", he says.
Liberman believes two points are fundamental:
The GoogleBooks Corpus is only about 100 gigabytes now, Liberman says... In an era of inexpensive 32 gigabyte thumbdrives, that is not very much... I suppose he means text-only... [The number seems small, too, compared to the Library of Congress' estimated 20 terabytes of "text" data among its 130 million items -- http://www.businessintelligencelowdown.com/2007/02/top_10_largest_.html]... But, if he's correct, perhaps there could be wide distribution of at least the non-copyrighted portion of the GoogleBooks Corpus, Liberman suggests.
The iSchool's, and NPR's, Geoff Nunberg then had some fun. "This is The Last Library!", he declared, "it won't be scanned again. There is no Moore's Law for data capture."
So we have to be careful with this one, Nunberg suggests -- above all, its data must be good quality. All this is being done in whose interests, he then asked rhetorically,
-- but all of those require good quality data, and already GoogleBooks does not necessarily provide that, he says --
Nunberg's slides then showed the results of his casual personal Googling through the current GoogleBooks, looking for authors' publication dates which predate their birth dates!... plus other comparable anomalies...
And turning up hundreds... Nunberg has wonderful lists, showing books Charles Dickens supposedly published a century before he was born -- and works Mark Twain would have been surprised to find, exaggeratedly, that he himself had written two centuries before his own birth date...
Garbage In / Garbage Out, perhaps -- as Googlebooks' Clancy was to suggest later, blaming faulty library and publisher cataloging data which they had imported, but there are other problems.
Nunberg also listed out some "subject classification" howlers he'd found... Google apparently lost patience with the library community's "format integration" and other longstanding professional battlegrounds -- they lost this patience long ago, in GoogleTerms, aka a few years ago, while the battles have been going on since Dewey tackled the Système Clément during the late 1800s, and even long before.
But impatient GoogleBooks abandoned the cumbersome Library of Congress Subject Headings / LCSH, and the for-them-tedious task of maintaining those, favoring instead the to-them-more-efficient commercial publishing Book Industry Standards and Communications / BISAC subject headings list.
But to Nunberg, and some vocal commentators from the audience, the latter list presents among other problems only 3000 subheadings, as vs. the over 200,000 subheadings of LCSH, fundamental tool of US academic research. Not enough...
In librarian terms, that is far fewer "access points": BISAC, designed for shelving in a retail store, offers far fewer indexing terms, for figuring out what a book is "about", than LCSH does -- academic research requires / demands the greater access.
Subject classification is an old and thorny problem of librarianship. Since Aristotle, maybe... Reception of Melvil Dewey's 19th c. US approach, with its many classifications for American "Protestant" religions, into Catholic Europe, only illustrates the more general point.
Datamining has done much to enhance the search for meaning and relevance in information retrieval, and it may do more. Still, though, there is the problem of bias: "relevance" is not a computation but an epistemological concept, subject to the 2000+ years of puzzling over "meaning" which that field of enquiry has engendered. Nothing about relevance is going to be solved overnight, by datamining.
In the meantime the intercession of a human librarian armed with a tool like LCSH still offers our best guess as to what an author, and an information searcher, really "mean". This process is being aided, greatly, by datamining now, but not replaced: where retrieval does occur, now, searchers simply grab from the top of the Google search heap -- and for serious searchers that is a beginning but not enough, not for movie tickets and not for research.
One analog, to datamining's position with regard to librarianship, is chemistry's with regard to medicine: sure, some day we may have "pills" to cure anything, but until then we still need doctors...
Clifford Lynch spoke next. He currently is Executive Director of the Coalition for Networked Information / CNI, and he teaches at the iSchool; before that he served years in the UC Office of the President as chief of UC's "Melvyl" systemwide library catalog. And with his remarkable personal recall & precision qualities, he always can be relied upon to both remember and express the most salient points, in any "info-tech" conversation -- tough hombre in a fight, Cliff.
Today he was in a fighting mood:
"I think a broader frame for this conversation would be useful," he started off, "This feels bigger, than a simple legal settlement."
Music to academic ears... although "death" to lawyers and to businesspeople, relentless pragmatists usually furiously and too often blindly bent on getting a thing "done"...
But Lynch was very firm: "If this truly is to be The Last Library," he said, "it seems to me important to get that right." He posed "a few specific questions":
It is important to both sides to understand that this is not a "public/private" situation: there, very often, a well-delineated "project" is involved, with a well-if-not-always-meticulously-defined beginning and process and ending -- so much so that everyone can retain, and cherish, enormous shares of extracurriculr activity, and "secrets", held rigidly off-limits to the other parties involved.
But that is not true of "academic" inquiry: based on curiousity, always boundary-spanning, and testing "limits", and breaking them -- academic inquiry is normally not well-delineated, almost by definition, it is more typically "fuzzy" in fact.
So maybe that is what Lynch has in mind... some of it, anyway... he didn't really elaborate, in his short time on this particular stand. But his comments tend generally to address the most salient points, as I said, and to have a great deal backing them up...
So, perhaps unsurprisingly, Lynch insists that there must be a "process": no fixed one-shot panaceas -- not even LCSH, and certainly not BISAC -- whatever "system" is used, Lynch's question is, "who will maintain and update it?"
A questioner agreed, saying that any database needs flexible rules about inclusion and exclusion -- researchers want to and need to know, of any database, how and why it grows or shrinks.
Lynch added that knowledge of all these dynamics become doubly important, now that self-publishing is growing, and now that increasingly we have no gatekeepers such as the traditional "publishers" left.
Lynch then made several "broader points":
"a need for a books database"
-- and it is, Lynch concluded, "unreasonable to beat up on a profit-making corporation for not spending money on things that don't make money" -- there was a twist and maybe even a barb, in there, which did not escape Google's ever-alert Clancy, see below.
Lynch also said there is a fundamental issue, in all of this, concerning the role of public investment: compared to the Human Genome project, he said, libraries here are seeing a failure of public investment, but public data needs separate treatment from private commercial and other forms of data.
At this point in the conference, Googlebooks' Clancy no longer was sitting out in the audience, bounding up excitedly to the podium when called upon to give a response -- now he actually was embedded up there on-stage, with this panel, seated right between Nunberg and Lynch as the knives were going in and twisting -- when his turn came he avidly grabbed the microphone --
GoogleBooks is not the only library, he retorted, and certainly it is not the last library... That is precisely why GoogleBooks has open APIs out, to promote other input, as things will be changing further, he said.
Cataloging errors did increase with fulltext searching, Clancy admitted: he pointed out though that the GoogleBooks sources for cataloging were records obtained from publishers, and from libraries!
As for the Human Genome Project's public v. private "race", those comparisons were inappropriate, Clancy said: GoogleBooks is an open effort, involving collaboration among many players, which was not the case with Craig Venter's firm and the government human genome effort, he contended -- "This is not about one player", he said, with GoogleBooks this sort of project is "moving to multiples".
Lynch then volunteered that OCLC is "not necessarily together with all libraries", which can make the cataloging effort difficult. I thought of France, where until very recently Dewey and several different MARC flavors, plus many local variations and eccentricities, all got mixed together, sometimes within the same institution.
A question was asked about an OCLC "Creative Commons" license... then an excited wrangle went on for a bit about whether, and to what extent, OCLC or its member library "owned" that library's records...
Duguid spoke up in favor of the Last Library idea, worrying about it, Clancy chiding him not to worry so much... Clancy assured everyone, again, that some books will be re-scanned -- we will "need more scans" he said...
Scanning surely will not stand still, any more than any digital technology might: I would think better resolution, and pixels with more information, and particularly OCR improvements, all quickly will be moving things forward considerably from where they are now...
Someone from EFF commented -- ideas and conversation were flowing hot 'n heavy, now... -- that scans might be placed in a 28-year escrow? with royalties?... Someone else asserted that all this stems from The Copyright Act... "The Statute of Anne, actually," Duguid remarked...
Clancy reasserted his control of center-stage: reality-check time, on some issues --
Pamela Samuelson, then -- and the day's final panel...
Pam Samuelson -- of the UC Berkeley iSchool, and the law school, and the Samuelson Law, Technology & Public Policy Clinic there -- has become a leading light, and inspiration, for many in the US and elsewhere concerned about the public policy of our new digital information world.
Her crisp articles on arcane issues have clarified these for many -- she also holds strong opinions, and expresses them forcefully -- all well-demonstrated by her two specifically-GoogleBooks articles already linked initially here [Samuelson], and also in her general bibliography [http://people.ischool.berkeley.edu/~pam/papers.html]. I am not going to try to summarize Samuelson's positions or even her conference contributions here, though -- she's a good lawyer, too, and you don't mess with those -- instead read what she says herself...
But... Dan Greenstein, her panel's first speaker: he is Provost for UC Libraries, and he was instrumental in negotiating his institution's inclusion in the GoogleBooks project --
There has been a transition in libraries, Greenstein says, to "access" from "the care and feeding of books". In both cases, though, these activities and the resources they require have been public -- "a sacred public trust", he calls them.
Initially no one was willing to commit to GoogleBooks access, Greenstein says, the project simply was unimagineable to most. Yet now that might be said of a post-Settlement world without GoogleBooks: that would be a failure of imagination, he believes.
The project currently is in open access: post-Settlement, Greenstein favors a separate Alternative Registry, for public access items -- that plus an Orphans Registry, perhaps, for items for which authors and publishers cannot be located.
Whatever structure eventually evolves for its administration, it is the democratic impulses involved in the project which are compelling, Greenstein said.
Carla Hesse, this panel's next speaker, is a figure of particular interest to French readers here. She has had a distinguished career as an historian of France, writing about French intellectual history and particularly about its publishing [http://history.berkeley.edu/faculty/Hesse/] -- and recently she was appointed Dean of Social Sciences in the College of Letters and Science, one of the university's most important positions [http://berkeley.edu/news/media/releases/2009/07/16_hesse.shtml].
For Hesse this conference must have been fun, a welcome break from the daily maelstrom of university administration... She showed us some of the qualities for which the university appointed her as Dean, perhaps:
Geez -- as they do not exclaim in academia... -- I hope not! "Commercial" covers one important and interesting aspect of my existence, but only one: "profit", as we in California have learned the hard way since Reagan, is not all things to all people...
Hesse, however, was careful -- a good attribute, in addition to her boldness, for both academia and administration --
"All three will continue," she said, right away, but the new dominant mode will be "commercial"... Well, what's a "dominant mode", I wonder...
Hesse believes that the relationship of public libraries, to this process, will be one of "regulating" access: gatekeeper, similar to situations regarding other public resources -- transport, energy, news, administration itself -- all these employ commercial services, but in each there is some regulation from the public sector.
I expect there was some worry in the Berkeley audience, over this -- Berkeley has for a long time been a haven for devotees of the public sector.
Hesse went on: there is a "Too Big To Fail" problem with GoogleBooks, she says -- she listed three scenarios,
This last might be labeled a "Periodicals Scenario", Hesse suggests.
I thought then of the Très Grande Bibliothèque, in Paris, a place Hesse knows well: particularly of its colossus at Tolbiac, with its several layers of "access" corresponding directly to the structure of the new building itself -- initial / superficial uppermost layer for the "grand public" tourists -- additional layers below of space and permissions, for accredited readers and students only, progressing downward -- eventually reaching the inaccessible preserves of the research scholars, hidden far below -- and, also hidden away down there, or up in the towers somewhere among some of the books, a layer better-hidden than Umberto Eco's "Aedificium", reserved exclusively for "staff", equipped with all sorts of arcane and highly-secret & very special "access"...
Hesse points out that we are being asked to have faith in GoogleBooks.
So that perhaps is what all these "Too Big To Fail" scenarios and other issues come down to, matters of faith -- we are talking not about the past but about the future, here, and the latter game always has been fraught with great difficulties, very often relying upon just "faith".
Hesse concluded --
We have a process of commercialization under way, here, she says. This is not something new however, she points out, in the long history of book publishing. Libraries themselves can, for some purposes, be thought of as "large books", just like this new GoogleBooks Corpus, she mused. She mused, also, how interesting it is that the true value-added here may be in the para-text.
Hesse included a warning for university libraries, though: she said,
James Love spoke then. Among the day's panelists he seemed to be GoogleBooks' harshest critic -- although as we have seen several of the other speakers slipped knives in too, perhaps more diplomatically, in the time-honored tradition of academic "constructive criticism".
Love is Director of Knowledge Ecology International, a non-profit and Non-Governmental Organization / NGO which,
"...searches for better outcomes, including new solutions, to the management of knowledge resources. KEI is focused on social justice, particularly for the most vulnerable populations, including low-income persons and marginalized groups. There are probably 5 billion people who live in the margins of the global economy, and an entire planet that depends upon knowledge for economic and personal development, education and health, political power and freedom, culture and fun...
"KEI undertakes and publishes research and new ideas, engages in global public interest advocacy, provides technical advice to governments, NGOs and firms...
"KEI is particularly drawn to areas where current business models and practices by businesses, governments or other actors fail to address social needs, and where there are opportunities for sustainable improvements. KEI was created as an independent legal organization in 2006..."
The KEI Board of Advisors includes Joseph Stiglitz and Amartya Sen...
Interesting-sounding organization, then, with interesting advisors -- here is what he had to say --
This is a situation involving, "private sector competition with government agencies", as Love sees it: Google Inc. plus The Authors' Guild plus commercial publishers...
Framing the problem this way presents it more the way it is being, and will be, seen overseas, I believe. In the US we view the situation differently and somewhat uniquely: here private corporations compete on a level playing field with government, in many sectors -- government here has "checks & balances" and other restrictions -- while overseas, by contrast, "government" may be both relatively-unitary and "sovereign", internationally but also at home.
The oddball, overseas, in many cases is the private corporation: in most cases that entity, far more than government, is viewed with suspicion -- in some cases, still, and after many years of trying, private corporations are not even allowed to exist, or at least to operate freely.
So the US situation of Google Inc., in this GoogleBooks project and generally, reverses the situation in most places overseas: there, "government" does this sort of "cultural" thing, normally, particularly if a project is as large as one this, and there the private sector has literally no say.
Love suggested a reference for general reading on the GoogleBooks subject: "How the Settlement Will Work", by Isabel Howe [http://www.authorsguild.org/advocacy/articles/settlement-resources.attachment/how-the-settlement-will/How%20the%20Settlement%20Will%20Work.pdf], tinier URL http://tinyurl.com/lwtvu8...
Three major issues are involved in such massive book-digization, Love believes:
-- shades, in all that James Love said, of the much broader and very different "world out there" -- international and trans-national -- inadequately addressed if addressed at all by the other speakers. Not surprising, for an NGO with Stiglitz and Sen as his advisors...
And all stark reminder, that in a globalized world such as we have now -- particularly one so completely dominated, in hi-tech and other areas, by the US -- every word uttered or heard in this room today will have direct impacts outside US "legal" borders, consciously and intentionally or not.
The panel's and the day's final speaker was Molly Van Houweling, of the law school --
"Prices will go up," she said... Reality-check, and a useful wake-up call and reminder that the GoogleBooks project has a dynamic -- to any of us snoozing through the afternoon who had been thinking that all this simply was a problem that might be "fixed".
Van Houweling sits on the UC Faculty Committee on Libraries, she told us, where she has learned about the "bundling" that goes on when vendors want to raise journal subscription prices for libraries -- both for that reason and as a barrier to entry for competitors.
In GoogleBooks they face a very big "bundle", true -- and one more unique, in the resource and service it provides, than the most unique journal subscription offer.
With her "legal" hat on, then, Van Houweling also observed that the journal subscription contracts offer vendors a means of circumventing "fair use" and other copyright protections for the user: the contract can sign that away, between the parties.
There is a need, she believes, for institutional curation of a GoogleBooks registry, just the way universities now maintain institutional subscriptions databases.
Then in came Dan, again... I do not hold this against him, myself, although some in this audience did: Clancy's interventions, on behalf of his firm's project, were invariably informative, and they often were entertaining -- even if they were in large part sales-pitch, which after all is his job.
On this occasion -- this latest intervention -- he embraced the topic others had raised, of the "Broader Evolution" of GoogleBooks. First of all he asserted this does not involve all the world's books... There are librarians in France who fear that it does, in practice and popularity to a Web-besotted modern audience, if not theoretically -- see Jean-Noël Jeanneney's articles, linked initially here [Jeanneney], and he is not the only one.
GoogleBooks puts a spotlight on the last 2000 years of the book as physical goods, Clancy said -- I wasn't sure what he meant, right there, and neither were some audience-friends afterwards... And the real focus so far, he added, has been on out-of-print books only, which all agreed was true, but most of us were curious to discover what other foci might be out there...
Clancy mentioned, too, that 97% of the "book market" involves in-print books -- and so only 3% involves out-of-print books -- and GoogleBooks is about out-of-print books... and, so, there should be no antitrust or competition considerations involved here... he said...
I am not sure Pam Samuelson agrees with him, on this -- but read her own words and reasoning in her two articles linked initially here [Samuelson], as I've suggested...
As for prices -- Van Houweling's point that "prices will go up", echoing other panelists, and thought about repeatedly by everyone in the room -- broad Internet access will drive prices down... Clancy said...
An audience question, then -- one I had planned to ask myself, but this guy was a Ukrainian-looking and -sounding student, and so he was far better-suited to ask it, I suppose -- "What about us, the 'foreigners'?", he wanted to know, "What about the 'Third World' of 'information' -- how is all of this going to impact the Ukraine?"
Google's Clancy was immediately-careful: clearly a veteran -- others might have blundered into this one -- I myself might have, with my personal fascination for things-international, charging into foundation-less generalizations about "globalization" and so on. But Google is preparing Dan -- or Dan is preparing Google, perhaps, or perhaps both -- for several "next steps", in scaling-up this GoogleBooks project of theirs, I am sure. One such step being in-print publications, another being e-publications maybe, another being "international / trans-national"...
What Clancy replied was,
"International users -- Because this agreement resolves a United States lawsuit, it directly affects only those users who access Book Search
in the U.S.; anywhere else, the Book Search experience won't change. Going forward, we hope to work with international industry groups and
individual rightsholders to expand the benefits of this agreement to users around the world."
-- not a court document, the above, but clearly lawyer-drafted which is the next best thing -- careful to the point of myopia, though -- both Clancy and his corporate counsel clearly realize that there is an enormous Global Market lurking out there, that Ukrainian student's "Third World of Information", a Global Market thirsting heavily -- panting -- for access to all of this...
One attorney in the audience tackled Clancy head-on during the Q&A: "Was there 'diversity of interest' in determining the 'class representatives'?", she demanded, over and over again... Clancy said he didn't know, wasn't in the room, that was court-decided... Clearly "how representative" the class is, to the lawyers present, is one of the central issues of the upcoming class-action Settlement, just judging from the neutral smiles and intense focus of many faces in the room, as this clash was going on.
Other questioners and panelists were wondering about the price-setting mechanism, whether anything at all definite enough, or legally enforceable, is being established here.
Others, though, saw and were interested in issues extending far beyond that: Amy Kapczynski of the law school faculty, for example, followed up both the Ukrainan student's plea and Clancy's "pricing" explanation with a good question about Google's "foreign market" intentions -- there are some small and powerless nations overseas, she pointed out, representing small and even insignificant commercial markets -- so how will these gain access and be treated, as GoogleBooks scales up internationally, she wondered.
Clancy did not really have a reply for her. He said lamely that the plan is to go country-by-country, trying Google's best to satisfy the often-very-different legal and regulatory regimes and cultural demands in each place.
If there is no Settlement, he said, that task will be made immeasurably more complicated: the cost of clarifying rights is enormous -- legally-proper "letters of reversion" being a rare practice in the past generally and particularly on low press-run works -- and particularly in academic publishing, he added, looking out into the academic audience and wonderung out loud how many of those present had such letters in their files. Nervous smiles... very few, I'm guessing -- he struck a nerve there...
The Gordian Knot was not in fact Orphan Works -- when good-faith users of copyrighted content wishing to license a use but unable to locate the copyright owner after a diligent search -- Clancy said. Certainly not in out-of-print publications: most old books are tough to trace, he said -- international access is going to depend on clarifying rights, that is one reason why the Database and the Registry foreseen in the Settlement are so important, he said.
Panelist Jamie Love waded in with an example from his talk, repeating that there might be two Registries: currently the average of book royalties contemplated is $8.50 -- what if there were a "hi" registry and a "lo", he asked. Love worries that what GoogleBooks represents now, he says, is "authors' & publishers' price-fixing collusion via Google", he says.
Clancy countered with other similar pricing examples, already in operation on the Nets. He suggested the current GoogleBooks pricing-by-algorithm approach, which he believes avoids the collusive price-fixing accusation -- pricing based upon usage data -- shades, here, of the value-free objectivity implied by the "popularity" element in Google's famous search-algorithm...
And, yes, we finally did adjourn. The redoubtable Dean Saxenian bravely closed us down, promptly at 5:15pm, even though the questions then still were flying...
It's been a wonderfully-American academic meeting: very open, brawling, no-holds-barred -- well-structured but with bendable rules, subliminally hard-hitting but ultimately polite -- precisely the kind of meeting so hard to hold in other cultures, where either the subliminal gets too much emphasis, or far too little, or where the ultimate politeness too often does not get emphasized at all.
All of which indicates, however, the uniqueness of the US experience generally: nothing at all typical, about the US approach and experience -- or easily applied overseas in non-US situations. Which makes one wonder how any of this will "scale up", to non-US applications...
Not at all like recent US Health Policy "town meetings", either: those have been untypical even by US "brawling "standards...
And now there are rumbles coming in from a Broader Picture, from US "research" developments of which non-US researchers may find it difficult to imagine ever becoming a part --
"The Radical Future of R&D: it's a new world of collaboration across corporate and national boundaries", cover story for BusinessWeek, September 7, 2009
All-in-all, then, it was a fascinating and enjoyable and stimulating conference -- part friendly collaborative confab, and part gladiatorial contest. It is one of life's great pleasures, being a journalistic fly-on-the-wall at an event such as this -- many thanks and much-indebted, iSchool (I'm an alum) -- although pretty exhausting, by end-of-day, leaving all who attended and really paid attention pretty tired but with much to think about... which pretty well describes the fundamental attraction of academia...
It has been a good meeting for Google, too -- latest chapter, perhaps, in their ongoing struggle with their leading question, down at the Googleplex: Can a Good Thing be "monetized"?
The US answer being "yes, always" -- the continental European being "no, never" -- the British seemingly straddling the fence, splitting differences, as they do with their currencies-basket, and muddling-through -- and other places mounting various home-brewed amalgams.
Qua for-profit-corporation, Google faces this daily: brainchild famously and typically of two Stanford graduate students with birthplace the garage, following the Silicon Valley paradigm -- and the midwife just as famously, and typically, being John Doerr, venture capitalist to so many successful Silicon Valley startups, who follows a founder around at meetings whispering in his ear, on behalf of shareholders like himself, and me, "monetize! monetize! monetize!"
No better advice ever has been given, to youthful Silly Valley engineers, too many of whom recite pi values with ease but cannot balance their own checkbooks -- too many startup firms have foundered because they simply forgot, or never knew, that a firm needs income.
So, GoogleBooks, and its GoogleBooks Corpus... Globalization, and democratization, of the world's printed texts? Plus a chance at last for preservation of the printed remainder, as the texts they contain are "freed" for Internet access and digital use?
Or merely a bone tossed, to perennially-curious and hungry and gullible academia, part of an ongoing cynical campaign to dominate digital information commercially, not so much democratically as very definitely globally? [See George Russell's cartoon, in the September 8 San Francisco Chronicle, depicting Google as just that, a giant vacuum cleaner scooping up all the world's books and publishers, definitely-evilly proclaiming "Mine! It's all mine!" -- http://www.sfgate.com/columns/]
A little bit of both, maybe... Maybe it takes a little bit of both -- both idealism and realism -- to get a job like this put together, pushed through, and done.
I still am a great fan, I admit. I exult in the performance of my Google shares -- I admire the energy and imagination showing up continuously on the many Google sites which I use online daily -- I visit parks here in San Francisco neighborhoods where the Googlers mostly live, watch them busily telecommuting from lawns and benches and nearby cafés, remind myself how very young all of them appear to be.
And tomorrow morning, thanks to GoogleBooks, I will be reading J-J Rousseau's "Discours sur l'Inégalité", in its 1755 Geneva edition -- on my iPhone, sitting outside in the California sunshine, on just such a park bench -- an exemplaire in the collection of Oxford's excellent Taylor Institution, now easily-compared online to other editions and variants and pirates and so on, and I don't have to travel to England and negotiate a Taylorian card to get at it...
GoogleBooks is a glorious addition to scholarship, and to reading, and to the democratization of globalization, I believe. I hope that it gets to where it's going intact, and that it expands greatly, and I hope that others like Project Gutenberg and europeana.eu and Gallica and Amazon ebooks all will get to wherever they are headed, too. Readers' market, then, and I'm a reader...
They won't get there without a lot of work, though -- Google's, and that of many others. The stakes are high, so they'll all persevere I am sure. The current posturing and irate journal articles and angry lawsuits and EC hearings, and the high-level intellectual wrangling at this excellent UC Berkeley conference, all are important parts of the process.
[bibliography link to follow, here]
[updates, to follow too]]
FYI France (sm)(tm) e-journal ISSN 1071-5916 * | FYI France (sm)(tm) is a monthly electronic | journal published since 1992 as a small-scale, | personal experiment, in the creation of large- | scale "information overload", by Jack Kessler. / \ Any material written by me which appears in ----- FYI France may be copied and used by anyone for // \\ any good purpose, so long as, a) they give me --------- credit and show my email address, and, b) it // \\ isn't going to make them money: if it is going to make them money, they must get my permission in advance, and share some of the money which they get with me. Use of material written by others requires their permission. FYI France archives may be found at http://email@example.com/ (BIBLIO-FR archive), or http://listserv.uh.edu/archives/pacs-l.html (PACS-L archive), or http://www.lib.berkeley.edu/Collections/FYIFrance/ or http://www.fyifrance.com . Suggestions, reactions, criticisms, praise, and poison-pen letters all gratefully received at firstname.lastname@example.org . Copyright 1992- , by Jack Kessler, all rights reserved except as indicated above.
From this point you can link / jump up to,
or you can link / jump over to: