The Trend of Managed Schemas: A Database is Not a Messaging System

This thread on the Boston MySQL User Group Board is getting interesting:
http://mysql.meetup.com/137/boards/view/viewthread?thread=2280640

(From the original poster:)

I think that nonequivalence comes from the past when the data sharing was a
rare situation. The data changes were always initiated by application and it
always knew about those changes. Now the situation is different. When the data
are shared between multiple remote applications you have to notify other
interested parties about those changes.

Currently databases are mostly used as "pull" components. If they had standard
"push" functionality they could compete with messaging systems with the advantages
of automatic data persistence and powerful query language.

(my response:)
Well, and that's the problem -- the paradigm *has* changed. MySQL is fast and reliable because it does NOT put things like messaging into their database, which Oracle and SQL Server do. A database is not a messaging system, it's a database.

What effect would notification that there have been changes have on MVCC? I do wish there was a "pull" way to check if the data has changed.

The paradigm change of the application managing the schema causes this. I do not believe messaging is the correct way to handle this problem.

Consider the parallel to source code version control. Much like MVCC, you check out code, change it, and commit the code. Unlike many source code version control systems, though, MVCC ("data version control") does not have the equivalent of an "update" command, except for doing another pull from the database. It would be great if there was an easy way to do a "diff" of what's in the database versus what the application is changing, but that seems like it would be a programmatic thing (function or method), not a database thing.

And consider the database overhead and bandwidth....instead of just running queries, MySQL would have to somehow keep track of which thread has what data, and then notify every single thread that has that data, that it's changed. The applications will have to be written to keep threads open longer, which will consume lots of resources. That's lots more overhead for the database, and much more bandwidth, because there may be instances of the application that are using data that they do not care if it changed....so the messaging system would be wasting bandwidth, sending messages to instances that do not care. Although that could be mitigated by the application keeping a thread open when it cares about whether or not the data has changed.

Then again, I'm not fond of managed schema in the application...or at least, when the developers write that code. Seems to me it should be the DBA writing that code. It's *very* useful for data consistency and integrity, which is a function of the DBA, not a developer.

What effects do you see the managed schema having on databases? Who should be responsible for writing a managed schema? Should a managed schema be used for database consistency within an application? Where is the line drawn between the application putting the required information into the database, and the database's job of maintaining consistency and integrity?

It's somewhat ironic, since for a long time MySQL advocated using the application to ensure the consistency and integrity (ie, before MySQL had a storage engine with foreign keys and transactions).

I often say that the biggest reason MySQL is a widely used database is because it is fast. A fast database can be complemented by an application that adds the features the database is missing; but a slow database that is full-featured cannot be made faster by an application. So it worries me when folks request very specialized systems such as a messaging server (or ANY "push" system) into the database, because that could be easily done with a "pull" mechanism, only using the bandwidth needed by the instances of the applications that care. Otherwise, it will end up adding Microsoft-level bloat to a really nice and fast program.

Indeed, Jim Starkey.

Indeed, Jim Starkey. Acronyms are easy. Names are hard;)

Sheeri, what exactly is a

Sheeri, what exactly is a managed schema? Maybe this is a good topic for another post.

Matthew, I think you mean Jim Starkey, not John Starsky.

>MySQL is fast and reliable

>MySQL is fast and reliable because it does NOT put things like messaging into their >database.

Back in the 3.23 and 4.0 days a lot of folks said MySQL was fast and reliable since it wasn't bogged down with unnecessary stuff like stored procedures, views, and triggers. Now we have 5.0 which is less fast *and* less reliable.

Perhaps messaging is scheduled for 6.0?

Hey Sheeri, I agree that

Hey Sheeri, I agree that MySQL's speed has been important to its success, but I believe there is more to it. It grew in popularity, because it was widely used within the growing niche of web publishing. For that purpose, MySQL has always been feature complete and damn fast. As that niche exploded in size, so did the entire market's awareness of MySQL, its speed, and its scalability.

In the era of Web 2.0, web publishing is no longer the paradigm. The new web is about collboration and event driven applications where the events may originate either within the client (e.g. browser) or on the server. We both agree server originating events may propagate events to the client either by pushing notifications or in response to client polling.

Web 2.0 developers are looking for speed and functionality. If polling scales well and delivers timely events to clients, then all is good, but whether push based systems can scalably deliver timely information is unclear. It is becoming increasingly popular to combine COMET programming techniques with the now old school AJAX tricks, e.g. google chat. The Apache foundation has responded by designing changes into Apache 2.2 to make the long running connections used for pushlets more efficient.

Apache has recognized that the web is changing. Dedicated messaging and push servers like iPush Server have appeared to meet this demand. WHATWG even has a spec for server sent DOM events. Apache has changed in response to these pressures. Whatever stack of technologies is needed for Web 2.0+, whether LAMP, or something else, the stack will continue to contain an "A", and that "A" will continue to stand for the Apache Web Server. Will there still be an "M"?

When giving his talk at the Boston MySQL Meetup, John Starsky pointed out that the web was changing, and the database must be designed to solve the problems on the horizon. Of course, in his genious, he figure out how to solve the problems to come and make things faster for the common usage at the same time.

So, while I understand your fear of bloat, I believe the question should be "Is polling data fast enough for tomorrow's Web App needs, and can a push based solution be implemented efficiently?"

Remember Starsky has shown that what may at first appear like an engineering tradeoff, e.g. his bitmaps in falcon, can end up having unforseen uses in the rest of the system. For example, can knowledge about which data clients are registered to receive events give hints as to what information should or should not be kept in cache?

I also understand that MySQL is used outside of the Web. My own needs have led me to use it for many purposes, and though I would not want to see bloated features adversely impact the performance for those uses, I am not convinced we can't have our cake and eat it too. But then then again, maybe I'm just too fond of cake.