While the dustup over Bing’s possible appropriation of Google’s long-tail search results is presently occupying the attention of the world of search, I thought I’d take a step back and offer a longer-term historical perspective about an aspect of search that fascinates me: namely the evolution of search algorithms to adopt ever greater amounts of human-generated input into their calculation of relevancy.
Last September, Facebook began including heavily “Liked” items in their search results, and Bing followed suit in December. While this news itself is now a few months old, it got me thinking about how the methods used to determine relevance have changed since the era of web search began. The inclusion of “likes” as a measure of relevancy represents another chapter in the evolution of the various techniques that have been employed to determine relevance ranking in search results.
The arc of relevancy’s story can be traced along one dimension by observing the amount of human input that is incorporated into the algorithm that determines the relevance ranking of search results.
Early search engines relied primarily on the words in each page (and some fancy math) to determine a page’s relevance to a query. In this case, there is one human (the author of that particular web page) “involved” in determining the page’s relevance to a search.
When we launched the Excite.com web search engine in October 1995, we had an index that contained a whopping 1.5 million web pages, a number that seemed staggering at the time, though the number of pages Google now indexes is at least five orders of magnitude larger.
Excite’s method for determining which search results were relevant was based entirely upon the words in each web page. We used some fairly sophisticated mathematics to determine how to stack rank each document’s relevancy to a particular search. This method worked fairly well for a time, when searching just a few tens of millions of pages, but as the size of our index grew, the quality of our search results began to suffer. The other first-gen search engines like Lycos, Infoseek and AltaVista suffered similar problems. Too much chaff, not enough wheat.
Enter Google. Google’s key insight was that the words in a web page weren’t sufficient for determining the relevance of search results. Google’s PageRank algorithm tracked the links in each web page and recognized that each of those links in a web page represented votes for other web pages, and that measuring these votes could help determine the relevance of search results, and do so dramatically better than cranking out complex math calculations based only on the words in the document alone.
Simply put, Google allowed the author of any web page to “like” any other web page simply by linking to it. So instead of a single page’s author being the sole human involved in determining relevancy, all of a sudden everyone authoring web pages got to vote. Overlaying a human filter on top of the basic inverted-index search algorithm created a sea-change in delivering relevant information to the users seeking it. And this insight (coupled with the adoption of pay-per-click advertising) turned Google into the juggernaut it became.
While Google’s algorithm expanded the universe of humans contributing to the relevancy calculation dramatically from a single author of a single web page to all the authors of web pages, it hadn’t fully democratized the web. Only content publishers (who had the technical resources and know-how) had the means to vote. The 90%+ of users online who were not creating content still had no say in relevancy.
Fast forward several years to the meteoric rise of Facebook. Arguably, Facebook’s rise is largely attributable to the launch of the newsfeed feature as well as the Facebook API, which opened the floodgates for third-party developers and brought a rich ecosystem of applications and new functionality to Facebook. After reaching well over half a billion users, Facebook unleashed a new powerful feature that may ultimately challenge Google in its ability to deliver relevant data to users: the “Like” button.
With over two million sites having installed the Like button as of September 2010, billions of items on and off Facebook have been Liked. In the early Google era, only those people with the ability to author a web page (a relatively small club in the late ‘90s) had the ability to “like” other pages (by linking to them).
Facebook’s Like button today enfranchises over half a billion people to vote for pages simply by clicking. This reduces the voting/liking barrier rather dramatically and brings the wisdom of the crowd to bear on an unprecedented scale. And beyond simple volume, it enables the “right” people to vote. Having your friends’ votes count juices relevancy to a whole new level.
A related behavior to clicking the Like button is content sharing, which is prevalent on both Facebook and Twitter. Social network “content referral” traffic in the form of URLS in shares and tweets in users’ newsfeeds is now exceeding Google search as a traffic source for many major sites. Newsfeeds are now on equal footing with SERPs in terms of their importance as a traffic source.
Not only are destination sites seeing link shares become a first-class source of traffic, but clearly users themselves are spending much more time in their newsfeeds on Facebook and Twitter than they do in the search box and search-results pages. Social networks’ sharing and liking gestures have resulted in an unexpected emergent property — users’ newsfeeds have become highly personalized content filters that are in some sense crowdsourced, but are perhaps more accurately described “clansourced” or “cohortsourced” since the crowd doing the sourcing for each user is hand-picked.
Beyond liking and sharing in the spectrum of human involvement is perhaps a move to a more labor intensive gesture: curation. Human-curated search results (aided, of course, by algorithms) are the premise behind Blekko, a new search engine focused on enhancing search results through curation. Making a dent in Google’s search hegemony is a tall order indeed, but my guess is that if anyone succeeds, it will be through a fundamentally new approach to search, and likely one that involves a more people-centric approach. And Google certainly faces a challenge as content farms and the like fill up the index with spam that is hard to root out algorithmically. For a cogent description of this problem, just ask Paul Kedrosky about dishwashers and the ouroboros.
One thing seems clear: the web’s ability to deliver relevant content to users relies on ever-sophisticated algorithms that not only leverage raw computational power but also increasingly weave sophisticated forms of feedback from a growing sample-size of the humans participating in the creation and consumption of digital media online.