This is a single archived entry from Stefan Tilkov’s blog. For more up-to-date content, check out my author page at INNOQ, which has more information about me and also contains a list of published talks, podcasts, and articles. Or you can check out the full archive.

QCon SF 2009: Max Ross, Mapping Relational Data Patterns to the App Engine Datastore

Stefan Tilkov, Nov 19, 2009

These are my unedited notes from Max Ross's talk about Mapping Relational Data Patterns to the App Engine Datastore at QCon SF 2009.

Datastore is transactional, natively partitioned, hierarchical, schema-less, based on BigTable – not a relational database
Goals: Simplify storage by simplifying development, management
Even though Datastore is based on the ridiculously scalable BigTable, you don't need to have scalability problems to benefit from it
Scale always matters - the problem is not in the second step, it's the first step
Free to get started (not only for the first 30 days), pay only for what you need
Let someone else manage upgrades, redundancy, connectivity
Let someone else handle problems
Detailed post-mortem of GAE downtime available somewhere
Scale automatically to any point on the scale curve
Trying to get people out of the business of managing their database in production
Basic entity: Kind, Entity group, key, age, + any number of properties
Datastore is schemaless - soft schema model. Much of the stuff available in the DB (constraints, type checking, schema) needs to move up to the app layer (but is usually replicated there anyway)
primary benefit of the schemaless datastore: much faster iterations
soft schemas can give you type safety despite using a simple key/value store underneath
JPA annotations provide soft schema - even though targeted at creating DB information, GAE can benefit from it
JPA annotations are a data definition language (proof: relational DB schema can be created from annotations)
Primary key in the datastore contains the kind and are hierarchical, e.g. /Person:13/Pet:Ernie
Analogy: Hierarchical datastore keys are similar to composite primary keys
Surrogate keys are harder to move - dropping is often not an option. Mapping options: 1) make surrogate part of the key a property 2) make surrogate key primary key, put rest into property

Transations:

transactions in the Datastore apply to a single Entity Group
Entities in the same Entity Group share the same root part of the key
This makes Entity Group selection a critical design choice, with obvious effects on transactions
Too coarse hurts througput, too fine limits usefulness of transactions
Datastore does optimistic concurrency checks at the Entity Group level
[Strong relationship between data modeling and transaction processing – reminds me of the old debate on EJB 2.0 pre-final entity beans and dependent objects]
Unreleased new feature: Transactional tasks can update multiple entity groups, a task in a queue can participate in a DB transaction
Example: Deferred, transactional, async balance update (eventual consistency) as well as synchronous
Two-phase commit protocol algorithm implemented at Berkely, implemented by a Google developer (Erick Armbrust)

Relationships

Letting a framework manage relationships can simplify code for RDBMS, but especially for App Engine Datastore
Goal: make handling relationships with JPA as easy as possible
Google's JPA implementatin has some sensible defaults: Ownership implies entites are placed in the same Entity Group
E.g. Person with a @OneToMay to Pet (with a back reference of @ManyToOne) makes both part of the same Entity Group

Queries

Testing set membership – requires a join table with an RDBMS, can use a multi-value property in the GAE datastore (select from User where hobbies = 'yoga')
Other than that, no joins supported
Conflict: Google promises that query performance scales linearly with the size of the result set; not possible when cross products are needed to fulfill queries
Making good progress with a subset for join progress, not releases yet - nowhere near ready for production
RDBMS encourage cheap writes and expensive reads; datastore encourage expensive writes and cheap reads. Denormalization enouraged where it makes sense
Obvious problems with denormalized data

Taking code somewhere else

App engine is in general more restrictive
Suggestion: Decide early whether or not portability matters to you
Shows examples of portable code - somewhat ugly
Congratulations, you have already sharded your data model

Key takeaways

App engine datastore simplifies persistence
JPA adds typical RDBMS features to the datastore
Important to understand how the datastore is different
Easier to move apps off than on
If portability is important, plan for it
http://gae-java-persistence.blogspot.com

Q&A

Q. Does the shown transaction example really solve the problem? A. No, not to the full extent. lot of Google's billing software is built without multi-row transactions
Q. Is JPA a good model when starting from scratch? A. Many people like the low-level API, then start building an ORM on top of it … possibly better to start using an existing one.
Q. What kind of apps are on GAE? A. Not really known, many backend applications for iPhone apps, Facebook, … Obama virtal town hall meeting peaked at 700 req/s
Q. Export features? A. Some bulk import/export, but there should be more
Q. Caching? A. No direct support for JPA caching using memcached, but should be pluggable
Q. Is Python going to be replaced by Java? A. Absolutely not, the Java team rather has to fight to be accepted as an equal citizen
Q. Restrictions on some JDK features relevant? A. No.
Q. Staging area? A. No, not yet.
Q. JDO? A. GAE supports both, datanucleus supports both; JPA was chosen randomly for this talk today.
Q. Can apps be run offline? A. You can run the app SDK locally, but it won't scale; but stub implementations are pluggble and they could be replaced.