4 billion daily page views (up from about 65 million in 1996). Integration nightmare for different “properties” at Yahoo!, especially because of the acquisitions (like del.icio.us), also integration with partners, and also within Yahoo over time: the infrastructure deployed in 1998 needs to be integrated with the stuff running today. Yahoo scale is another problem — a link on Yahoo home page will trigger their own version of the Slashdot effect.
Mark is part of the Media Group, which handles things related to content. Each media property was reinventing the wheel all of the time, until some recognized it made sense to reuse components. Initial architecure: independent front end boxes, including databases, with a master database in the backend. Problems: Large datasets don’t push well. Adding capacity is expensive — if there’s a need to extend the News property with 50 machines, if the data needs to be pushed to the frontend machines this will take to long. Another problem: what to do after a push failure, another one: user-generated content. Cross property integration, e.g. Yahoo tech which integrates products and answers from Yahoo answers. Another new property: Yahoo! pipes. Old architecture had severe problems: one frontend box requests data from another frontend box.
Requirements for the new architecture: massive scalability, flexible deployment, highly dynamic, separation of concerns — result. move towards a Service-oriented architecture. PHP is much more common at Yahoo! than Java. Scalability, Simplicity, Reuse, Interoperability — decision: use HTTP, not WS-*.
Frontend boxes make requests through a cache to backend API servers, all via HTTP. Attributes: single source of truth. Cache replicates data (pull model vs. push model). Question from the audience: is this REST? Mark doesn’t want to talk about REST because he views this as a philosophical discussion — but yes, they are RESTful. UGC pushes through. Adding capacity is easy. HTTP-based backend.
One example of using HTTP instead of doing a big REST pitch: Caching intermediaries. Advantages: Freshness, validation (asking “has this changed” is a quick etag-based question to the backend). Precalculated results, validated against the etag of the calculation input. Next advantage: Metrics from the cache, shows the output from an internal tool, histograms showing cache misses and hits. (Based on Squid?). More benefits: Load balancing, cache peering — multi-GB memory caches not at all uncommon. 3-4 common cache peer protocols. Negative caching: if there’s an error out of the API server, the cache will cache the error, reducing the load on the backend. Collapsed forwarding: Multiple requests from the frontend can be collapsed to a single one, another great way to mitigate traffic from the frontend. Stale-while-revalidate: while the cache is refreshing something in the backend the cache can return a stale copy; another one: stale-if-error: if there’s a problem on the backend box, it can serve a stale copy, too. Invalidation channel: out-of-band mechanism to tell the cache something has become stale (an arrow from the backend to the cache).
Statistics: 16000 requests per sec per core when using caching — HTTP massively scalable to 10000 parallel connections. Squid is being used. Caching is a commodity: right now their using Squid, but they could easily take it out and replace it with something else.
Pitfalls: REST vs. WS-* wars (Mark wants to stay out of these things), theory vs. practice, human-intuitive, but not programmer-intuitive (very hard to explain REST to programmers, easy to describe it to his wife), different deployment/operational concerns (people don’t really know how to handle this — they do for single applications, but that’s not very useful), formats are hard (just like in the WS-* world), format/interface proliferation (if you give developers a new protocol construction toolkit, they’ll build protocols), authentication isn’t there yet (HTTP authentication mechanisms are unbelievably primitive), tools have a way to go (tools optimized for the browsing case, not the service case; same true for intermediaries).
What’s needed: tools, web description language (Mark likes WADL), data-oriented schema language (instead of something that describes markup), investment in the Atom stack (80% case use Atom/RSS to mitigate interface and format proliferation), HTTP test suite (should be standardized).
Unfortunately, my laptop’s battery was empty at this point in time — too bad, since there were a lot of good questions (and answers). Some that I can remember: Someone at Yahoo! Sports (who are always on the bleeding edge) seems to be doing some transactional stuff over HTTP. It would be great if the caching intermediaries actually could use the guarantees about PUT and DELETE and the fact that both they and POST invalidate a cached resource and take advantage — they don’t (this includes Squid). I tried to tempt Mark to admit that they are actually doing REST by asking about the way they build their URIs; his answer was that they don’t care about how the URIs are designed (which is fine with me), but obviously have to make sure they refer to something cacheable, i.e. “Bob’s preferences” instead of just “Preferences”.