Friday, September 17, 2010

Oracle Outage at JPMorgan and Formal Proof of DB Validity

Andrew Wiles, Prover of Fermat's Last Theorem
An interesting post over at DBMS2 about the JPMorgan Oracle DB Outage.  Interestingly enough I was on a similar enterprise-level massive Oracle DB running on Solaris and we had a hard outage which required a manual restart.  I mean it took the whole server down.  And we had corruption which required a restore from backup.  Not completely sure if it's related, but these huge databases are not as reliable as they are made out to be.

A code error can always cause some corruption, because a monolithic DB kernel running on a monolithic Unix kernel could always have problems.  However, there has recently come to light a way out of this snafu, a mathematical light at the end of the tunnel.  Something used for microprocessors but just starting to be used for operating systems.  Formal mathematical verification.

In normal programming tasks, formal verification is too restrictive and unwieldy for regular use.  However for infrequently changed but extremely critical software, formal verification is becoming more than just a toy.  An entire microkernel, seL4 of the L4 family, has recently been formally proven correct.  The increasing speed of computers and sophistication of automated proof checkers is making this possible.  One can picture this being only the groundwork for formal proofs of additional basic kernel libraries, i/o, filesystems, network stacks.  And eventually, applications, at least the critical parts of them.

And this brings us to the DB.  On top of a formally verified microkernel, and a verified filesystem, I want a verified database microkernel.  I'm not talking about verifying that all my SQL statements are 100% accurately implemented, no, what I'm seeking here is formal verification of ACID semantics.  So that certain operations are guaranteed to occur, and file corruption is guaranteed never to occur.

We're not just talking about Oracle / DB2 / Sybase / SQLServer here, you want the same consistency guarantees on NoSQL solutions as well.  I may be able to tolerate my data being updated at different times in different places, but I don't want my account balance to be randomly altered due to some corrupted file.

This formal verification approach at the DB level would not only remove a huge chunk of potential bugs, it would bring us certainty on the basics, so we can focus more on the larger tasks, like distributed horizontal key/value stores such as BigTable, efficient spatial functions, social graphs solvers.  And it might give DBAs more sleep at night.

Formal verification of an operating system security kernel (Computer science)

1 comment:

  1. Formal verification, as done by the l4.verified team, is indeed a great step forward in software correctness, and is likely to bring near to zero the number of system outages caused by "hard to test software corner cases". I suspect, however, that database and filesystem outages are just as likely to be caused by unrecoverable hardware faults.

    To be resilient to that kind of event requires additional techniques used in the safety critical community. These techniques involve identifying hardware (and environmental) failure modes that could interfere with operation, and building in redundancy to enable the system to keep running. In the safety critical world, it's not uncommon to have 3-5 redundant systems, and to have the independent components vote on the correct answer.

    If, in practice, most Oracle outages are due to software failures, a verified software implementation would indeed help, though it's a herculean task -- the l4.verified team said it took something like 7 PhD-years to verify 7,000 lines of l4 code. Modern SQL databases are much larger, more complicated pieces of code.

    ReplyDelete