This is a single archived entry from Stefan Tilkov’s blog. For more up-to-date content, check out my author page at INNOQ, which has more information about me and also contains a list of published talks, podcasts, and articles. Or you can check out the full archive.

QCon SF 2009: Michael Nygard, Software Architecture for Cloud Applications

Stefan Tilkov, Nov 19, 2009

These are my unedited notes from Michael Nygard's talk about Software Architecture for Cloud Applications at QCon SF 2009.

Anything as vaguely defined as Cloud Computing requires some up-front definition at the beginning of each talk
There are risks in moving to cloud computing, despite what the cheerleaders will tell you
Definition of Cloud, Grid, Utility – grid as parallelized, distributed, heterogenous services, utlity as a pay-per-use model; IaaS, PaaS, SaaS
Grid computing - most visible example SETI@home; getting hold of a large number of systems: moving data to the computation
Another example LHC: After filtering, each collision creates 15 TB of data
If CERN processed the data themselves, it would be a choice between heating the town or running the datacenter
More than 100,000 CPUs, 10 PB of image data each year
Software platform abstracts away the hardware differences - using GLOBUS
Grid Computing is really about these very large-scale problems
Utility Computing: Not Cloud Computing, not IaaS - just a billing method
Financial flexbility, not a lot of engineering
Good example: holiday volumes at a retailer - overspending 10 out of 12 months
Focus of the talk: IaaS/Cloud Computing
4 trends: virtualization, commoditization of hardware, horizontally scalable architecture - created the problem of rapid provisioning
Virtualization vendors tend to underplay the difference to Cloud computing
Differences: 4 key questions: Who allocates resources? Who deploys virtual machines? How quickly can new resources be allocated? Is provisioning under human or programmatic control? With virtualization, the answer is administrators, administrators, depends on the approval process, human; in the cloud, it's users, Users, minutes, programmatic
Cloud is generic computing platform, zero lead time, hardware appears homogeneous; specialized hardware is abstracted away and turned into declarative aspects
Cloud computing is like corn farming in Iowa … tiny margin, large fixed costs, scale is essential; small providers won't be around long
Doesn't get why remoteness is supposed to be a key aspect of Cloud Computing - e.g. Private Clouds are just fine
Easiest way to prepare your app for CC: Don't do anything
Advantages to Gain Scalability, Bundling, Ephemerality, Risks to Mitigate: Availability, Geography, Ephemerality
Amazon EC2 examples; Quirks: "Clean boot" is really clean, local storage not persistent, IP addresses assigned randomly
Advantage: Bundling, i.e. installing stuff on a pre-configured image, then bundling and storing the new image in S3
Can be part of the build process - turn out fully bootable machine images
Half of all failures arise within 24h after deployment of a change
The plus side of ephemerality: Do testing, after successful UAT, switch production IP to new system - keep old one running to be able to switch back quickly
No more deployments to production servers
Schema version conflicts: Old and new version may talk to the same DB
Multiple way to address handling of concurrent versions - advice: Don't accept downtime
Scalability patterns: the master/worker pattern
Work is enqueued, master looks at average completion time and creates more workers as needed
Examples: New York Times, Animoto (hoping for new examples soon)
Getting a few hundred servers to user over a weekend …
Explanation of Map/Reduce, Hadoop, HDFS - blend between Cloud and Grid
Amazon Elastic Map Reduce abstracts away from topology issues of using pre-configured Hadoop images
Horizontal, read-only replication as a scalability approach - works with most traffic
DB and Web Server can run on the same machine if it's just a read-only copy
Problem: Load balancer requires configuration, needs to know which nodes are available - changes often requires reboot
"Autoscaled load balanced group" addresses this - new nodes are automatically added to the pool of machines load is distributed to
If Oprah mentions you on Twitter, you're still dead
Natural affinity between Cloud Computing and Oopen Source
What about NoSQL? Makes a lot of sense, but relational DBs work, too
Scaling with caching servers such as memcached, GigaSpaces, Oracle Coherence, Terracotta
Every time you use virtualization, I/O througput suffers
SLA of Amazon EC2 is defined in terms of the entire zone
Guaranteed EBS availability is higher than that of a single disk, but still the risk needs to be mitigated
Snapshotting to S3 is one level of mitigating; transferring the snapshot to another availability zone addresses is further
Load balancing group across availability zones
Right now running a number of servers on the departmental secretary's credit card; after all, computing resources are just office supplies
Perception of risk of using the Cloud - real security professionals talk about the degree of risk, not about absolutes
What's missing is data about threats, their frequency, and their impact
Wouldn't it be horrible if … "WIBHI" reaction doesn't help
Cloud Security Alliance works on standards regarding this
Real risks: Control plane threats, patent shutdowns, lack of risk management info
Security facilities to apply: transport level encryption, e.g. SSL; storage level encryption; access control (security groups in AWS); control plane/multi-tenancy: outside of your control
Cloud Computing offers significant cost and time-to-market advantages
From Q&A: Main advantage of IaaS as opposed to PaaS: Adoption is much easier