Wednesday, July 21, 2010

Five open source Projects I wish I could fund

I've always said to myself that if I ever become independently wealthy, I'm going to bankroll some things I've always wanted that the opensource community hasn't felt a need to provide. Mind you, I'm not independently wealthy so don't expect to see much from me.

Anyway, here's my current "wishlist":

OpenWire Ruby drivers for ActiveMQ.

For that matter, I'd love wire level drivers for a bunch of stuff. In the case of ActiveMQ, it's nice that it's all plaintext but it doesn't support some of the same semantics as the OpenWire drivers and quite honestly wasn't very reliable in the testing I did. Say it with me folks, stateless protocols are not the way to talk to queue servers and ESPECIALLY not over HTTP. REST semantics don't map properly to core message queue concepts.

Non-Win32/DLL Ruby drivers for MSMQ and other Microsoft products

This really bit me in the ass at the AJC. It would have made my life a whole lot easier if we had a method for talking to MSMQ from a non-Windows platform. Sure, Microsoft documents the protocols for the most part but unless I'm planning on learning C and implementing a native extension, I don't see me doing it.

An open source ETL/DW/BI suite built on NoSQL. Bonus points for supporting rolling warehouse loads.

It may sound silly but I always thought that of all the promise of NoSQL concepts, the fact that your warehouse is denormalized makes it a great fit. I also think Map/reduce is a much more logical construct for BI reporting. There are a few headaches though which is why, even as a self-contained suite, it will take effort to gain traction:

  • ETL vendors would need to support the NoSQL engine on the Load side
  • BI/Reporting tools would need to support the NoSQL engine
  • Report creators (many times, employees from each business unit stakeholder and non-technical) need to learn Map/Reduce concepts for scheduled reports
  • Map/reduce is a poor/impossible choice for Ad-hoc queries at least as far as the current crop of NoSQL engines is concerned.

Essentially, you would HAVE to create your own suite - soup to nuts - and provide a way to move people from thinking in SQL for report generation. Maybe a hybrid approach makes more sense. Assuming I were king for a day, the warehouse side would be a hierarchical design - all data is dumped denormalized into a NoSQL engine. Scheduled reporting is done via Map/reduce against that data. Additionally a second load phase either concurrent with or post NoSQL load (does that make it ETLL?) dumps a business rule defined amount of data in a traditional RDBMS store for Ad-hoc purposes. I dunno. I could be over-engineering it ;)

PostgreSQL and MySQL move to a pluggable replication architecture based on message queues.

I'm not sure if this is still the case but many years ago, DB2 was using MQ Series for geographical replication. Message queues are message agnostic and implement all the features required of replication - guaranteed delivery and ordered delivery for instance. Imagine how easy it would be to scale out MySQL read slaves if they weren't all hitting the master server? Message queues are perfect for this. I might implement it something like this with ActiveMQ:

  • Replication messages are pushed to a queue for known slaves. One queue per slave.
  • Said messages are duplicated into a Topic
  • New slaves subscribe to the Topic and come current
  • New slave is then converted to its own queue

Slaves never talk to the master server directly. You can spin up slaves at any time even without a backup. Just bring the slave up, point to the topic and get current on your own time. At some given point, you're converted to your own queue and unsub from the topic.

A DSL for implementing random binary protocols.

I thought this was what protobuf did but as I look at it more, I realize I might have been mistaken. Imagine if you could take the MSDN docs that describe the MSMQ protocols. Convert that information into said DSL and execute 'foo' against the DSL. Blammo, you have a driver for that protocol. Is that even possible?

Anyway, there goes my business ideas for the next century. I do hope someone runs off with them and does something fun. In seriousness, I can't be the only person who's ever thought of these things. Hell, look at the database replication one. I straight stole that from IBM.

Besides, there's probably patents on all of these ideas already =P

2 comments:

Robert Treat said...

If you look at Slony or Londiste, they are basically replication solutions built on top of queue based messaging systems. They don't allow for a pluggable queue system, but I bet it wouldn't be too hard to make them work that way. (I'd guess switching the slony triggers to use something like pg_amqp (http://lethargy.org/~jesus/writes/amqp-for-postgresql) would get you half-way there)

lusis said...

Interesting on pg_amqp. Wasn't aware of it. Of course when I saw Theo's smiling mug, I figured it had to be pretty cool.

I'll give it a gander. Thanks for the tip!