The Problem with the Internet Archive Book Library

The Internet Archive’s massive library of tens of millions of free scanned books was built over many years with thousands of hours of work by volunteers. It has incredible potential as a free library that anyone can access with their PC or Phone. Why do the organization’s own policies and procedures cripple its usefulness?

The Internet Archive (IA) Text Library offers an extraordinary library of books and documents— tens of millions of items freely available to the public. Just as impressive is the steady work of volunteers who have improved the site’s interface over time. Given all this, it is difficult to understand why IA’s own management policies are allowed to undermine the usefulness of this collection as a serious research tool.

A single example illustrates the problem.

Using the basic search tool, enter subject:(china – history) and select “Books/Documents.” The result is a collection of more than 12,000 items. This should be a tremendous resource. However, to make sense of it, a user must rely on the “Subject” filter in the left column—and this is where the system breaks down.

Click “More…” at the bottom of the Subject filter. You are presented with 91 pages and over 3,000 subject entries. A reasonable expectation would be that sorting by “Count” would highlight the topics reflected in the most books, and therefore of most interest. In practice, it does not.

Instead, the top results are filled with subjects that have little or no connection to “China—History”. Why do entries such as “Carl Schmitt” or “Vita Activa” appear prominently, each attached to roughly 100 books, while clearly relevant topics like “Foreign Relations” or “Cultural Revolution” are not equally visible? Skim through the subjects on the first 30+ pages, and you find more of the same: subjects that are irrelevant.

If you select “Carl Schmitt” and apply the filter, the reason becomes clear. The results are not organically related works, but rather a collection uploaded by a single archive.org contributor. Open any of these books and you will find hundreds of subject tags attached to a single, usually short, item. In this case, the contributor seems to have copied the same huge group of irrelevant subject tags to each of the 100 documents he submitted.

These excessive and indiscriminate subject tags are polluting the entire IA subject system. Returning to the “China—History” list and sampling other entries reveals similar patterns: individual contributors assigning dozens or even hundreds of subjects to each book. This is not how subject classification is meant to function, and it severely degrades the value of the catalog.

The root issue appears to be a lack of oversight at IA. Individuals are allowed to upload materials with their own metadata, including subject headings, without meaningful quality control. Evidence suggests this has been occurring for over a decade—since at least 2014—and it continues today.

The result is deeply unfortunate. A remarkable collection of millions of books is being compromised by a search system that is badly broken. A huge number of valuable books will not be found through subject searches, because the search tool has been rendered practically unusable.

The Internet Archive has built something invaluable. If it fixed this problem with its search tool, it could fully realize its potential as a world-class research library.