Unless you have been living under a rock for the last couple of weeks, you will be aware of the significant outage that the RBS/Natwest/Ulster bank group has experienced.
Just in case you do claim a rock as your humble abode take a look at the links below for details:
http://www.theregister.co.uk/2012/06/25/rbs_natwest_what_went_wrong/
http://www.telegraph.co.uk/finance/newsbysector/banksandfinance/9355028/City-puts-cost-of-RBS-glitch-at-between-50m-and-100m.html
This outage has sent a ripple through the mainframe world, especially with those clients who run CA's venerable scheduling software tool CA-7 and those looking to use offshore operations resource. Without wishing to point the finger at my former employer and the quality of CA-7, a product that I have sold in the past, it would be remiss of me to mention that other mainframe scheduling products are available:
http://www-01.ibm.com/software/tivoli/products/scheduler/
But seriously, wider than a bit of professional gloating and opportunism, the fundamental question arises - Should large organizations run significant core business critical applications on the mainframe when a single outage at a bank can cost upwards of £100m?
Well lets look at some history:
- The mainframe has been around for some 40 years and throughout this time has been supporting key clients IT needs.
- IBM back in 1974 committed to maintain investment in this platform
- IBM provides 'integrity' commitments for the O/S
- System z mean time between failure stats run into decades
But apart from these IBM commitments, large organizations have been running their IT largely glitch free for years on this platform. In fact the very lack of this type of outage is the very reason why it has been such big news. By way of proof - 96 of the top 100 banks globally run System z and this is the first time I have heard of such an outage of this magnitude.
Also other platforms are not without their high profile glitches:
http://gigaom.com/cloud/will-amazon-outage-ding-cloud-confidence/
http://www.computerworlduk.com/news/networking/3246942/london-stock-exchange-tight-lipped-on-network-outage/
http://www.itnews.com.au/News/306421,vodafone-suffers-near-nationwide-3g-outage.aspx
In summary one of my clients runs 93% of their business logic on 4 servers that cost them 7% of their total IT Budget, whilst running 4000 distributed servers that contribute to 7% of the business logic and cost 93% of the total IT spend. Which begs the question which looks the better bet?
Answers on a post card to @stevendickens3
CA7 is a great product and runs hundreds of thousands of critical production jobs for major FI's around the world ... the schedule is made up of a number of important queues as the workload (batch jobs) dynamically gets passed from input to execution and finally completion. If you have a catastrophic failure in the CA7 scheduling software due to an abend, in most cases one can restart with a "warm start" and easily pick up where you left off, however if one loses any or all of those Q's a cold start is required. In essence you now need to know where you are, where you were and conduct a restore of those critical queues.... this is not easy when your in the midst of a very busy dynamic workload with numerous applications running simultaneously, it requires the correct execution recovery procedures and of course the position data at the time of failure..... if you don't have a professional team with knowledge of the recovery and data precious time can be lost, if your near the end of your batch window your out of luck as the next days processing or Onlines need to be brought up for customers..... now that's how you get behind as your into the next days processing .... the root cause of course can be one of many incidents .... but deletion or corruption of those Q's is a major headache for any application team.... good old "memo post" systems been with us for 40 plus years but don't get behind the window... $$$$$$$ I suspect from what I have read a "cold start" was most likely required and unavoidable to to loss of one or more q's ...... there are recovery logs that get built of course to aid in recovery ... but its still very complex and working under pressure to process 20 million transactions worth of batch is not trivial ..... there are ways to mitigate your risk and speed recovery .... but I will leave that for another time ....
ReplyDeleteAgree spent 3-years selling CA-7 great and normally very stable product. Also agree that a highly skilled (on-shore) team is also a must. Shame that this outage has made the mainframe front page news for all the wrong reasons. If you want to discuss further please contact me Twitter at @StevenDickens3 as it seems you have a very deep understanding of the topic and it would be good to talk...
Delete