Loving the Alien: Machine Learning and Publishing

I’m relaunching this blog with the publication of my opinion piece “Loving the Alien: Machine Learning and Publishing” on the Digital Book World site. (Fellow music junkies will recognize the nod to David Bowie.)

It’s impossible to escape machine learning—the technology is quickly becoming pervasive. According to Google, over 700,000 articles have already been published in 2016 that mention “machine learning.” So far this week, Microsoft announced the integration of new machine learning driven research and editing technologies in Microsoft Office and Google announced “Smart Bidding,” a new machine learning driven pricing system for AdWords and DoubleClick Search. I’ve also read an article on how scientists are using machine learning to track Ebola carrying fruit bats and an article on how machine learning is being used to dramatically enhance mapping of the human brain, more than doubling the number of distinctly identified areas. This feels like one of those “Oh wow!” moments that come along every ten years or so—when a group of closely related technologies all reach viability at the same time enabling a big idea to finally fly from the nest. This is fun!

(March 8, 2017: F+W stopped updating the DBW site in late Janurary. I've added the text of the opinion piece mentioned above in case the DBW site disappears.)

Loving the Alien: Machine Learning and Publishing

By: Cliff Guren | July 28, 2016

Over the past few weeks, Mike Shatzkin, Neil Balthaser and Ali Albazaz have debated whether machine learning systems will be able to predict a bestseller (Mike’s initial blog post is “Full text examination by computer is very unlikely to predict bestsellers”; Neil’s response is “Yes, Machine Learning Can Help Predict a Bestseller”; and Ali’s article is “Artificial intelligence and the art of reader-driven publishing.”) It’s been an interesting exchange about the value provided by today’s publishing organizations and the future of predictive analysis. That said, the growing deployment of machine learning systems raises larger questions for publishers that must be addressed soon, before publishers lose control over their intellectual property.

Publishing is a technology-driven business—a byproduct of the printing press—that has evolved in lockstep with advances in print and distribution technologies. While ebooks have grabbed most of the attention over the last decade, search has been and will continue to be the most significant technology driving publishing. To that end, the evolution of search is being driven by innovations in machine learning-based discovery and recommendation.

Our concept of “knowledge” is in large part derived from our ability to classify information. The Dewey Decimal Classification system introduced the notion of relative location. Before this system’s introduction, books were stored on library shelves in the order in which they were acquired. The Dewey Decimal system made it easy to browse the shelves, discover new books in a given subject area, and form connections that would not have been visible while browsing a chronologically organized collection.

A similar revolution in how content is organized and accessed is underway in the world of computing. To date, our access to content online has (for the most part) been dictated by taxonomies and tags created by humans: BISAC codes, keyword systems and user-generated tags. These ways of identifying content look backward. We file a book based on how it fits into our historically derived categorization system. In this way, we are using an essentially static (or “solid”) form of classification in a world where knowledge is increasingly dynamic (or “liquid”).

Machine learning systems combine recent advances in computing platform technologies such as networking, data storage, and processing, with advances in fields such as computational statistics, natural language processing, and sentiment analysis (to name just a few of the related areas of relevant research). These systems now have access to huge collections of content (“big data”) that go well beyond what any one person or team can process in a lifetime. They have the ability to analyze content, dynamically derive tags and keywords, and discover conceptual relationships between content elements in the data collection that may not have been evident when the source content was first published. In addition, the quality of the computer-generated results improves over time. Put another way, these systems now have the ability to learn.

Once again, publishing is a technology-driven business. Digital technologies—including e-commerce, ebooks and audiobooks—have created new business opportunities. Machine learning is no different. Computers that can learn can help sustain publishing. Enhanced discovery and recommendation engines can help sustain a diversified retail ecosystem by democratizing access, giving independent retailers (and publishers who sell direct) the ability to personalize recommendations. Computers that can learn can also expose relevant content that’s not easily found using the algorithms currently employed by the industry giants.

The major players in software development, search and retail, including Google, Microsoft, Apple, IBM and Amazon, along with a large number of start-ups, are investing in machine learning. These companies will be knocking on the doors of publishers to ask for access to their assets. They will try to sell content owners on the value of enhanced discovery—and some of them will deliver. However, there are also long-term implications of the growth of machine learning that require careful consideration.

The same systems that deliver book recommendations today will be able to deliver highly targeted answers to specific queries tomorrow. These systems will quickly evolve from recommending books, to delivering excerpts, to delivering machine-authored responses that synthesize information from a wide variety of sources.

Publishers need to understand the applications of machine learning and how these systems may evolve. They need to determine whether they can afford to let others decide the direction these systems take, or make their own investments (in-house or through partnerships) that give them more of a say. They need to determine the market value of helping teach computers to author content and determine how their own authors will be compensated for these new uses of their work.

For more than 500 years, the publishing community has taken advantage of advances in technology to enhance its ability to produce and disseminate information, knowledge and creative content. Machine learning has the potential to drive a significant expansion of our notion of publishing. While the field is in its infancy, it’s growing up quickly. Publishers need to understand how machine learning is transforming content discovery, how to effectively evaluate the quickly evolving range of partners and platforms, and how to craft deals that appropriately reward them for this new use of their intellectual property.