This is a single archived entry from Stefan Tilkov’s blog. For more up-to-date content, check out my author page at INNOQ, which has more information about me and also contains a list of published talks, podcasts, and articles. Or you can check out the full archive.

Notes from Werner Vogels's Keynote

Stefan Tilkov,

In what has become a tradition when I'm in the mood, these are my unedited notes from Werner Vogels's keynote talk "Web-scale Computing: Compete on Ideas, not Resources" at IIR's (German) Web 2.0/SOA/EAM conference in Wiesbaden.

  • everybody in the audience except two guys are Amazon customers
  • when you put something in your shopping cart, you don't want to care about the technical details
  • now: put off your Amazon customer hat, think of Amazon as a technology provider
  • shows example - subscription model for toilet paper!
  • "buy box" (the blue area) shows the best product for the customer - even if it's not sold by Amazon.com
  • being a platform provider means you have to be absolutely neutral
  • many other examples of websites powered by Amazon.com
  • some statistical data - 80M customers, 1.3M active resellers ...
  • retail, ecommerce (associates), infrastructure ws, enterprise customers
  • shows Amazon.com from 1995 - key idea back then: do something on the Web that you couldn't do otherwise (have all of the world's book in stock)
  • history: app server/database (1995-2001) --> service orientation --> massively scalable services
  • for one year, Amazon ran a mainframe DB
  • in 2001, the Web servers hit a performance/scalability wall
  • 2001-2004: services
  • now, everything is massive scale
  • the secret sauce of Amazon.com: not its recommendations, but its capability to do anything at scale
  • 1st step: modularization - co-locate data and the logic depending on it, no direct DB access anymore
  • now: ~1000 different services
  • a page will hit 250-350 services - even single lines, such as "sales rank", call a service
  • large services at the bottom (customer, product, offer) serve as indices to additional services
  • each team has a small team associated with it, responsible for building and running it - no separate operations dept
  • no better motivation to fix a bug than your beeper going off at 4 in the morning
  • software as bits as opposed to software as a service
  • one bug/one fix approach
  • the whole saas thing is a big lie! there's a big elephant in the room nobody's talking about
  • reason: between test and operate, you need to handle all the non-functionals - load balancing, scaling, utilization, ...
  • vendors have no idea how to handle these things b/c traditionally, the customers did it
  • most of the engineers' time was spent configuring router, managing load balancers, spending 70% of their time on undifferentiated heavy lifting
  • example: picture of AT&T data center built near a trailer park - which of course was destroyed by a tornado
  • 365 Maine downtown SF run 8 generators in their data center - three months ago 6 of 8 generators failed despite being tested --> most of Web 2.0 offline
  • Google study: 10% of disks will fail per year - w/ 80000 disks in a typical data center means 8000 disks fail per year -> you'll have people employed who only change disks
  • graph of target.com and walmart.com -> holiday peaks 2-3 times the rest of the year's average
  • lessons learned - offer 1000 wiis, 100000 people will show up
  • pitch for 37signals' "Getting Real"
  • Amazon.com web services: s3, sqs, ec2, simpledb, fps
  • was used internally for 2 years before it was offered externally

Scalability

  • growth by good customer experience -> traffic -> sellers -> selection -> lower prices -> customer experience
  • incremental scalability is key
  • being able to grow systems one step at a time
  • infrastructure needs to move from capital investment to variable cost
  • elastic: capable of growing and shrinking on demand
  • minimal disruption to customer performance
  • addresses: different growth paths, fault-tolerance, heterogeneity, operational efficiency
  • you can't assume your infrastructure is homegeneous

Availability

  • everything fails, all the time
  • somebody cuts a cable in the Suez canal - the rest of the world thinks India is gone
  • failures are highly correlated
  • things fail in groups
  • things don't fail by stopping - instead, systems fail by sending out large amounts of garbage
  • a load balancer sending to a machine returning very fast responses -- all 500s
  • let go of control - take a probabilistic approach: determinism doesn't exist in real life

Performance

  • engineering for performance for 99.9%
  • averages are irrelevant

Cost effectiveness

  • uncertainty
  • acquire resources on demand - you can't predict anything
  • release resources when no longer needed
  • the new economy is all about much intensified competition
  • don't rely on resources
  • the power of your success is now no longer in your hand

  • these four non-functional properties of large systems are dominated by state management

  • categorization of data access patterns

    • primary key access (high read volume, always writable)
    • query-based access (relationless + relational)
  • two large services: S3 and SimpleDB
  • EC2 with persistent storage for dedicated purposes
  • billions of objects in Amazon S3
  • the traffic out of Amazon's web services is larger than the traffic of all retail properties combined
  • availability zones
  • explanation of persistent storage for EC2
  • the big deal is: any type of legacy system can be run within the cloud
  • the only thing needed to get started: a credit card and http://aws.amazon.com (no contract, negotiations, ...)
  • (Question by yours truly: does AMazon.com use the services internally?)
  • Yes, extensively, given it's . If S3 ever failed, you'd notice it in Amazon.com (Question: is Amazon impacted by the peak loads it has to handle?)
  • Amazon.com scale is basically dwarfed by the platform it offers for others, it profits just as much.

Great talk, too bad it was this short.

On April 17, 2008 7:46 PM, mnot.net said:

Thanks for the writeup — looks like they’re doing even cooler stuff than their products imply!

On April 23, 2008 10:32 PM, Armin Auth said:

Stefan, great summary. Are slides of his talk available somewhere?