Libraries, Data, and Collectionism

So I’ve been thinking about a way to describe the enthusiasm about big data that has captured the business, government, and academic worlds over the past decade. I will admit up front that the rapid advances in deep learning technology that large businesses like Google, Microsoft, Facebook, and Amazon have accomplished in the past few years should be celebrated. The efforts by these companies and the fact that they have open-sourced many of the artificial intelligence frameworks they are creating should be applauded. So as a computer scientist manque I salute them and all the others back to the early days of map-reduce. Good job.

But as a soi-disant philosopher-social scientists who spends most of his time working with librarians and other academic knowledge producers, the drive toward big data has more than a few queasy aspects. The ethical problems of surveillance, by government, business, or researchers, are very fraught issues which need to be engaged thoughtfully. Nor is the view, shared by much of the silicon valley, that technology will solve our problems at all likely. Evegeny Morozoz is mostly correct when he complains about solutionism in the pronouncements of our mighty technology overlords.

What I’m particularly intrigued by is a default attitude I’ve encountered a handful of times in the library and research data management world. The attitude that keeping, preserving, and collecting data is an unalloyed good. I’m starting to think of this as a form of collectionism, a bias toward collecting whatever one can regardless of any immediate need or use for the data.

An example from my own experience may (clarify or muddy) the waters on this issue.

I’ve spent the past two years working on web archiving. Archiving the web is a massive challenge; there’s just so much of it. Researchers, such as the digital library groups at Old Dominion and Virginia Tech, have been doing incredible work on the technical challenge of harvesting data from social media and the world wide web. Non-profits, like the Internet Archive, have been bootstrapping their own data preservation systems since the early days of the web. National libraries, such as the British library and many others, have been collecting online artifacts because of legal requirements, e.g. national deposit laws, and for preserving national heritage. Without all of this work we would know even less than we do about recent cultural history. Whoever ends up writing the history of our current era will depend heavily on the often shoestring efforts of these people to preserve an ephemeral media.

The University of Alberta libraries have been helping out in this effort by building their own collections based on their judgment of material that is important now or may be in the future. These collections have been built using the services of Archive-IT, a subsidiary subscription service provided by the Internet Archive, and the library is now wrestling with the question of how to preserve these files. On the one hand the library wants to copy the data into its own preservation system for the technical reason that more copies are better, and for the legal reason that ownership of the data is potentially valuable intellectual property. But technically the amount of data is probably bigger than the library can accommodate given its current data capacity. The capacity is growing so tomorrow this limit may not be so salient.

An alternative would be to preserve a subset of the data with the library. Say just the full-text of the web pages or the link structure graph between pages in a collection. This data would be much smaller in size and is likely to be much more interesting to researchers than the “raw” data. But saving a subset of the data is contrary to one of the principles of digital preservation and archiving, namely keep the original data if at all possible. Having the original data would enable any future researcher to recreate the derived data, and perform additional analysis which current users may not envision.

I see the logic of keeping the original data if possible, but I’m even more interested in the effect of this logic. At an extreme this makes the library into a data warehouse, admittedly with better standards for metadata than the typical researcher, but basically a commodity service. If the library is just another cloud based storage facility then there is little reason for a researcher to choose the library over a commercial service such as Dropbox. A commercial service is likely to be easier to use even if its long-term viability is impossible to predict. Researchers are driven by their immediate needs, and often ease of use today outweighs unknown and unforeseeable uses tomorrow. Likewise for the library, the preference for original data makes the immediate goal of the library easy to understand - just provide enough storage - but it also leaves the library with not much else to do in the research process.