The semantic web – the future of search or a dead end?

The other night I heard Rick Rashid, head of Microsoft Research worldwide, speak on the future of technology. His talk was so good that I’m going to try to persuade him to speak at Business of Software 2008 (only 24 hours until the early bird discount expires by the way, so book now).

A couple of the highlights of Rick’s talk were:

  • His demonstration of Photosynth. This app takes thousands of photos of an object or a scene and then stitches them together to produce a three dimensional view that you can fly around and zoom in to. They can’t do it yet, but one day you’ll be able to upload your own photos and construct your own 3D model. You’ll also be able to take a photo of an object – the Seattle Space Needle, for example – and the software will recognise where you are and tell you more about what you’ve photographed. For now, there are a bunch of 3D scenes you can look at, including St Mark’s square in Venice and the space shuttle Endeavour.
  • The World Wide Telescope. Microsoft have constructed a digital map of the sky from terabytes of data from the Hubble Space Telescope, the Sloan Digital Sky Survey and many other sources. You can use this virtual telescope to explore the heavens, panning and zooming thousands of light years away.

The best part, for me, was Rick’s answer to Hermann Hauser’s question about the future of the semantic web. Rick claimed little knowledge on the topic, but still managed to talk eloquently for several minutes. He said (I paraphrase, and any errors are mine) that the semantic web reminded him of research into natural language processing. For several decades, researchers have tried to work out ever-more complex grammars and rules to understand human language, but most researchers – the ones that are making progress anyway – have abandoned this avenue and are focussing on statistical, machine learning. This essentially involves dumping terabytes of data into complex algorithms and then using the results. Nobody understands the detailed internal connections that the models make, but the outputs seem promising.

Similarly, the semantic web relies on humans defining schemas for different objects. For example, Freebase has volunteers trawling Wikipedia’s unstructured data and structuring it, turning the free text of film stars’ biographies into structured tables of names, dates of birth and film titles [UPDATED: Freebase uses statistical methods as well as the community]. The problem with this approach is that the schemas, and the links between them, are man-made. Rick Rashid’s point is that we’ve just ended up with another set of bad data, but in a data structure. We may well find that computational, statistical models cope much better with understanding data than any fixed structure that a human can come up with.

Altogether it was an excellent talk: a strong mix of fantastic content and good presentation. Sign up to my RSS feed and I’ll let you know if I persuade Rick to speak in Boston.