New Technology Seeks To Let Startups Build Their Own Googles

Open source search projects like Hadoop, Lucene, and Nutch, combined with affordable, on-demand computing through Amazon Web Services, is putting scalable search infrastructure within the reach of most startups.One of the first questions online startups typically face these days from potential investors is “Why couldn’t Google build this?” Entrepreneurs are beginning to respond, “Why couldn’t we build Google?”

The slow but steady maturation of open source search projects like Hadoop, Lucene, and Nutch, combined with the availability of affordable, on-demand computing through Amazon Web Services, suggest that scalable search infrastructure is well within the reach of most startups.

Hadoop is a framework for running applications on clusters of commodity hardware that duplicates the functions of the distributed Google File System and Google’s MapReduce algorithm for processing large data sets. Lucene is a Java-based search and indexing system. Nutch expands on Lucene by adding Web-based crawling and additional search capabilities.

These open source search projects are already in use at companies and organizations like Krugle, Powerset, Wikipedia, and Zimbra.

Krugle, a search engine for programmers that helps users find code and technical information online, is built on Nutch and Lucene. “It would have been impossible for us to create the capability that we have and go live in the speed that we did without Nutch and Lucene,” says Krugle CEO Steve Larsen. “They were extremely important to us being able to solve the technical problems that we did in a short amount of time.”

Access to the code was also important, says CTO Ken Krugler, “so we had the flexibility for the things that we needed for a vertical solution. The commercial solutions are much more restrictive. It’s harder to tweak it and form it to what you need.”

Krugle maintains about 100 servers at a collocation facility. Krugler says Amazon’s Elastic Compute Cloud (EC2) looks promising but Krugler sees it more for companies that are just getting started. EC2 is simply virtual processing power than can be paid for as needed.

“It scales better than doing a co-host setup,” says Krugler, though he still considers it too new to rely on. “Technically it ought to scale, but you just don’t know.”

Search startup Powerset is using EC2 to power its forthcoming natural language search site, apparently without any such reservations.

In announcing Powerset’s use of EC2 at the Web 2.0 Summit earlier this month, founder and CEO Barney Pell said his company’s use of Amazon’s technology “represents an important shift in the competitive dynamics within the search industry” because Powerset doesn’t have to put up the capital to “to build out a datacenter big enough to scour the entire Web and serve queries for millions of users” in order to compete with Google and Yahoo.

Pell neglected to mention that his company is also using Hadoop to cache search results before storing them to its local network. In an e-mail sent to the Hadoop developer mailing list, Powerset CTO Lorenzo Thione describes how Hadoop and EC2 can be used in a fault-tolerant search system. “A nice feature of Hadoop as measured against our use of EC2 has been the capability of fluidly changing the number of instances that are part of the cluster,” wrote Thione. “Our instances are set up to join the cluster and the [Hadoop Distributed File System] as soon as they are activated and when — for any reason — we lose those machines, the overall process doesn’t suffer.”

Of course, there’s a lot more to Google than search infrastructure. Even if rivals reach some measure of technological parity, Google will still have a formidable user base and brand, barring some AOL-style data disaster. And that’s to say nothing of making search work as a business. At the moment, there’s no open source ad platform to rival what Google, Microsoft, and Yahoo have built, not to mention Amazon and eBay.

But as open source projects get used more frequently in commercially successful projects, the companies using that software drive its development. Krugle’s Larsen says his company has helped drive the development of Nutch and note that Yahoo continues to push Hadoop forward. That kind of work will end up giving future startups even more of a leg up.

People who read this, also read...


  • About
  • Revolutionary Pen-Size Computer Uses Bluetooth Technology
  • Samsung to demo digital TV for cars
  • Hackers take down Internet servers
  • World’s thinnest material yields semiconductor breakthrough
  • Seagate: 1TB drive by June
  • Wireless Transfer of Electrical Power Possible
  • China to develop ‘maglev’ train system

  • Comments are closed.