The Undercover Architect: storage

Tuesday, 24 June 2014

An Englishman in New York...

...or an idiot abroad; i'll let you decide.

Anyhoo, just got back from a whistle stop tour of Toronto and New York, and familiarising myself with the subtle differences between us English speaking nations: "It's the little differences... Do you know what they call a Quarter Pounder with cheese?", "Royale with Cheese!".

Despite the general lack of sleep, overeating and drinking, through excessive coffee consumption I did manage to get quite a lot crammed into my 5 days.

Met some really smart guys from Apprenda, who have a very interesting offerings in the PaaS space. In particular their private PaaS solution addresses a lot of the underlying problems in IT organisations: i.e. from being infrastructure centric. They stitch together the infrastructure into a grid using their peer-to-peer fabric, which creates some interesting options in managing message flow.

Their approach enables development resources to focus on the job at hand: write code, and pushes the configuration and service level management to the platform. One thing I really liked is that they've managed to use containers not only on Linux, but also Windows...

It looks very slick, and for web apps it's a no brainer, but my challenge is that I face is a legacy of mid-range and mainframe apps, and all the [cultural] baggage that comes along with that... Any thoughts as to how to start the transition?

I also spent some time talking to the Hadoop distro vendor MapR to bounce around my IT Operations Analytics concept I'm trying to get some traction on. They also have a really interesting offering, though their marketing has let them down up to now. Basically their approach is not to champion particular products within the Hadoop ecosystem, rather they will support anything on top of their MapR-FS file system.

With MapR-FS they've basically ripped out HDFS and replaced with a file system that addresses some of the key issues in HDFS, yet still support the HDFS APIs:

The NameNode is a bottle neck
Start times on a NameNode recovery are lengthy
Lack of POSIX compliance
Supporting legacy UNIX/Linux apps
Small files support

You can find out more on their architect at the website, but the inclusion of an NFS server really helps get data into Hadoop, and as we are finding, the sooner we can start storing data the more we have once we work out what we're going to do with it ;-). They also have a tonne of NetApp like snapshot and replication tools, which is no surprise given their CTO is ex-NetApp.

I'm about to start kicking the tyres with MapR, so will report back once I have a bit more experience, but I'm impressed with what I've seen thus far.

Signing off.

~~Sting~~, sorry Alex.

Tuesday, 25 March 2014

Can we really store everything forever?

There seems to be a growing expectation within organisations that we store every piece of data forever. If ever I raise retention policies, particularly with legal or compliance teams, the response is that we need to store data forever. Whilst I understand that some data is required to be kept in legal hold, do we need to really store everything forever? And sometimes, there could be a risk in keeping records beyond the date required by the regulators... On the flip side, in the world of Big Data we may not know that a piece of data is valuable yet?

So the challenge is that archived data takes up valuable space in an organisations data centre, and is typically not used for revenue generating purposes. It may allow a business to operate: think regulatory data, but the business doesn't make money from it, or doesn't yet... So, it's a pure bottom line cost, a cost that need to be minimised as much as possible to increase P&L. I like to call this type of data Write-Once-Read-Rarely, or cold storage.

Historically, long term archive meant magnetic tape. Tape is still the highest density storage media, but the challenge with tape is that is degrades, so ensuring data integrity is a challenge, particularly for data with long retention periods. Managing the tape pool becomes a full time job, and remember that this is a non-revenue generating post. Whilst disk is not as dense as tape, data integrity can be provided via software: parity, check-sums, continual checking. A lot of organisations have gone to pure disk storage solutions for archive and backup. One of the other benefits of disk is also faster data retrieval times.

There is another type of storage that can slot into an information life-cycle management (ILM) strategy, and that is cloud storage. The interesting thing with cloud is that the funding model typically changes from CapEx to OpEx, so it's pay-per-use, and it plays to my earlier point about non-revenue generating infrastructure: i.e. outsource it to a utility provider. Obviously there are the usual cloud concerns: security, legal, mobility etc, but if you can put in place compensating controls, the value proposition is compelling.

Not all data types are suitable for storing in the cloud: there may be regulatory or jurisdiction constraints that mandate where data is at rest. So I think we need some form of hybrid cloud storage: a mix of on and off premise storage, where the requirements and constraints dictate where the data is placed as a part of the life-cycle management. The end goal is to ensure the cost back to the business is kept as low as possible, and a company's resources are used for driving up revenue.