Computing Cpk statistics using SQL

Part 1 – Introduction

For the past several years I have performed a myriad of projects for this favorite chemicals company client of mine. Recently, I was called in to help design and develop a Production Reporting application for them, to meet new requirements from their new, German-based, owners. The application reports on things like how much product is produced, what was consumed to produce it, production yields, etc. Several factories are involved, so we’re tieing all the data into one database.

The first challenge involved converting all the English-based measurements into metric. Yes, I did say it was one of those metric-loving European countries, right? So, we’ve been going over different conversion equations – some of them are interesting, because they’ve involved variable factors like the energy content of natural gas and the specific gravity of oil.

Still, though, a piece of cake – just linear equations to convert from one unit of measure to another. Had more problems with the quality of data (often missing from several plants) than the equations themselves. So, we’re progressing fine on the application; then, one day I’m in a meeting with all the plant managers, and they start talking about CPK statistics. They talked like of course everyone knows what this statistic is. I’m sitting there, though, going “huh?” I have a computer science undergraduate degree, with oodles of calculus, differential equations and statistics, and a economics graduate degree with its own heavy mathematical load in correlation analysis and statistics. Yet, I’ve never heard of Cpk (properly, that’s how the statisticians spell it, C sub pk. I still don’t know what the initials mean).

Never one to feel stupid in front of a client, I finally felt free to ask, “what is Cpk?” They knew me well enough to know I wasn’t really stupid. “It’s one of the six sigma statistics; it measures the variability of a process.” Oh, I’m thinking … Deming. Yep, I used to work for a TQM (Total Quality Management) consulting company, so I’m quite familiar with Deming, kaizan, and such. Still, with all that knowledge, hadn’t heard of Cpk – but knowing it was related to variability, I was set on the right track.

I started googling for the equation used to compute Cpk, and was surprised to find so many different answers! It started making me think this must be some sort of pseudo-statistic – it hasn’t passed muster with real statisticians yet, with some sort of theorem proving it’s signifance or some such.

The problem became even worse when I started examining the client’s software that was already computing Cpk from all the lab data. One was Javascript-code, built into a specialized document management application. The other was just some spreadsheet equations. Both used a different algorithm. That, plus all the different equations floating around on the ‘Net had me really puzzled!

Turns out it appears both the Javascript code and the spreadsheets were using just approximations of Cpk. I finally tripped across a good “Six Sigma” book that gave me a concise equation and explanation of what’s going on. The confusion involves whether you’re talking about Cpk, Cp, Ppk, standard deviation (sigma), or estimated standard deviation (sigma hat), etc. Ok, my head was ready to hurt – beyond all my years of math education, I hadn’t done much beyond basic accounting – i.e., addition and subtraction – all these years in the business world. Now, I was back in real statistics.

Next Chapter: Will the Real Cpk Please Stand Up … and … How to Cpk the SQL Way.

Published by

kevin