This post is based on an e–mail series I sent to a friend in industry. He was responsible for overseeing the design of a clinical trial, and he asked me as a professional to tell him in simple words how certain basic statistical ideas work. He’s intelligent and educated, but his own expertise doesn’t happen to be in statistics. Like many such people, he thinks of statistics as an impenetrable black art. As I mentioned, he’s a friend—so I tried to level with him and explain in simple, colloquial language what I wanted him to know. Other people might like to read it. I’ve tried to avoid technical or mathematical prerequisites and to explain as I go along, but it probably would help if you’ve previously heard at least buzz about terms like p–value.
Clinical Trial Design — What do probability and statistics have to do with it?
People are familiar with the idea of random variability. When you flip a coin 10 times, you “expect” 5 heads and 5 tails—but you’re not at all surprised to get different numbers. Perhaps this time you might get 6 and 4 . . . or perhaps 4 and 6. You could get 7 and 3, and you wouldn’t be knocked out of your chair if you got 8 and 2.
The chance discussed above—of getting 8 heads—has 4.4% probability. Since the chance of 8 tails is the same, their combined probability is 4.4% + 4.4% = 8.8%. So, out of 100 people, we’d expect about 9 to get either 8 heads and 2 tails or else 2 tails and 8 heads. In practice, lopsided outcomes are a definite, if infrequent, occurrence: some observers will get them. The probability of all heads is only 0.1% so, in a hundred people, we likely wouldn’t see anybody getting that; but in a crowd of 1000 people there could be 1 with all heads—and another with all tails. They would get a very distorted picture of the shape of the universe — but, still, if you happened to be one of those individuals, that’s what your data would tell you.
With coin flips, this variability is no real problem. You’re so certain about the true odds in a coin flip that observing a score of 8 and 2 wouldn’t make you question some randomly selected coin you’d picked up as change from a merchant. You wouldn’t seriously wonder if its “true” heads rate was 80%. You’d normally reject the “evidence” provided by the data and cling to the more reasonable hypothesis that the coin is fair. Problems only arise when we don’t know the “true” rate a priori. Then the data, even though randomly perturbed, are the best evidence we have available. We have to steer a course that allows us to be sensitive to what empirical evidence tells us but doesn’t make us superstitiously over-interpret the apparent actuality of data patterns that might, in fact, be meaningless random scatter.
In clinical trials the variation arises because the random selection of subjects and their random assignment to treatment could bring an atypically large number of “difficult” or of “easy” subjects to one treatment over the other. Treatment A, which has a true success rate of 50%, could easily show 3 successes in 10 subjects, while Treatment B, which has a true rate of only 40%, could show 5 successes in 10 subjects. Then, based on our total combined sample of 20, we could become wrongly, stubbornly convinced that Treatment B is better.
The squares in the gray-shaded diagonal represent outcomes in which an equal number of successes are observed for both treatments, even though Treatment A is superior. The sum total of the probabilities of these outcomes equals 16.0%. In practical terms, if this were an experiment assigned by a biology professor to a class of 100 students, it could be expected that 16 students would get data wrongly suggesting that the two treatments are equally effective. The black-shaded squares above the diagonal represent outcomes in which Treatment B is observed to have more successes than Treatment A. The combined probability of these outcomes equals 24.8%, so the biology professor can expect about 25 of the 100 students to submit lab reports concluding wrongly that Treatment B is superior. Only 59 students in the class of 100 will observe data (the white shaded squares below the diagonal) that will lead them to the correct conclusion.
In “small” samples testing new treatments in a few patients, the possibility of error is very real.
We will continue this example later.
The root problem is that the particular characteristics of our random sample may or may not match the reality of the world outside. The point of clinical trial design and interpretation is to develop rigorous, quantitative markers to control the risk of error. Even more, the object is to discover the truth.
People find statistics hard because they think it’s mystical (not to mention too mathematical). It’s not, though—not at all. It’s totally practical. It’s actually very much like accounting. If you lay your decision out in terms of cost/benefit, the answer isn’t usually hard. You have to decide what level of risk you can afford and rationally justify. Remember that a negative trial is likely in practice to kill that particular development program. This is not only costly to the sponsor but it has a terrible cost to society, which loses out on finding a useful treatment.
In other words, clinical trial design is, or should be, completely practical.
Clinical Trial Design — What kinds of error are there?
In simple terms, schematically, there are 4 possibilities depending on whether the data in our particular sample agree with the reality in the population at large.
In 2 of these 4 possibilities our idea corresponds to reality; in the other 2, we walk away with the wrong idea—we make an error.
The possibilities in the bottom row happen in the universe where the treatment really isn’t effective. This is the universe that regulatory agencies are worried about: namely, in that case they don’t want to approve it.
But possibilities in the top row happen in the universe where the new treatment is effective. This is the universe that you as the sponsor are concerned with: your concern is not to miss out on evidence that will convince the regulators.
In other words, you and the regulators have distinct, though overlapping concerns. Type II error is the kind that you are trying most to avoid—the kind that you would make if you can’t convince the regulators while they’re worried most about Type I error.
As I mentioned, though, the sponsor’s real concerns do overlap with the regulator’s. Most regulators are well aware of the need for adequate protection against Type II error. Also, an intelligent sponsor wants to avoid a Type I error, which would only lead him or her into further expending resources on a development program that will later result in failure or possible legal repercussions.
Clinical Trial Design — How to quantify this?
If treatment isn’t effective then the possibilities in the bottom row are the only outcomes, and they are exclusive. On that assumption, their probabilities therefore must add up to 1. The probability of making an error is called α (“alpha”) and the probability of avoiding error is 1–α. Rather than specify alpha ahead of time, many people like to look afterwards and see what the risk of Type I error is for the data actually obtained. This is called the p–value. It’s the risk of a false positive conclusion.
Similarly, for treatments that are effective: the probability for the top row must add to 1. The probability of making an error is called β (“beta”) and the probability of avoiding error is 1–β, which is also called the “power.” Power is the probability of avoiding Type II error, for an effective treatment, and therefore of proving its effectiveness.
In clinical trials, what people will look at is whether p<.05. This emphasis is confusing. It makes us think that power is extraneous—bells and whistles that the college professors like. Nothing could be more wrong. The power is the probability (assuming all along that the new treatment actually works) that you’ll in fact be able to get p<.05 and therefore be able to satisfy everybody. In simple terms, power is your chance of success. Making sure you have adequate power is equivalent to making sure you succeed.
Of course there’s a drawback—namely, you have to pay for power with sample size. So you can’t usually have unlimited power. But you want as much as you can afford. Power is a form of insurance. You’d like to have $100 million in life insurance so your family is protected — but paying the premiums on that much would make them live in poverty now. And so you have to strike a practical balance: protection versus cost.
So, how much is “enough?” Many people test at 80% power. How does one interpret that number? Well, it’s actually hard to interpret—as long as you’re only looking at your one particular study. It’s even harder if you believe in a mystical concept called “luck.” I mean, 80% might sound great — because it’s so much more than the 50% you get on a fair coin.
But look at it from a statistician’s perspective. He or she consults for many people, and so usually has several studies going at once. Suppose he has 10 clients. (Also suppose that all of them are testing treatments that are effective.) If he lets them test at 80% power, then he can expect 8 clients to have successful trials. But he can also expect 2 of them to fail. I personally, as a professional, find these numbers unacceptable. If 2 out of 10 clients get hurt, it’s bad for my business and bad for my sense of responsibility to my clients and to society. I prefer numbers more like 1 in 20, or — at the very least — 1 in 10.
For yourself, on the other side of the table, you have to decide if you want to be in a 1-failure-in-5 situation — or if you prefer 1 in 20.
Clinical Trial Design — What are the steps to find such numbers to design a trial?
Background — an analogy
Let’s try to make the ideas concrete by using a simple analogy. Suppose I want to start here in Buffalo, NY, and drive to Portland, ME.
You can easily work out arithmetic equations to relate the following 4 quantities:
- number of miles to be driven.
- miles driven per gallon
- cost of gas per gallon
- how much, in total, I have to pay for gas.
If I know any three of these, I can calculate the fourth.
Which of these quantities are under our control and what can we do about them? My car gets about 30 mpg on the highway. Since I don’t contemplate buying a new car for the trip, we can consider this figure as a fixed constant of the universe. Next, gas is currently about $3.00 per gallon. Since there’s no real way to convince service station operators (the regulatory agencies) to give me a special gas price for the sake of my trip, we can consider this price to be a fixed constant: not within my control to change. So, it looks like we’re talking about 10 cents per mile.
There now remain only two adjustable quantities: how far do I want to drive — and how much do I spend total?
It seems like I could decide how much to spend and then use the formulas to figure out how far I can drive. . . . Or else I could do the reverse: first decide how far I need to drive and then the formulas would tell me how much I’d need to spend.
However, the first of these options is illusory. In fact, Google informs me that Portland is 557 miles from Buffalo. If I don’t buy enough gas, I won’t get there. Suppose I can only spend $37.00. Well, then, bad news: that much gas will leave my tank empty just this side of Springfield, MA — where I’ll then be faced with a serious situation. If I do want to make it to Portland, it will cost me a total of $55.70.
Now, an executive might tell me that I’m an impractical idealist, that I need to get more in touch with the realities of finance and budgeting and to learn to do what I can afford. Spending a “more reasonable” $37.00 will at least get me a good distance on the road to Portland. My response is that it’s the executive who’s being impractical. Running out of gas in Springfield still leaves me with a heck of a long walk to Portland—unless I’m eager for the adventures and uncertainties of hitchhiking. And, in any case, I don’t feel like leaving my car behind in Springfield.
In reality, there are only two rational possibilities. Either somehow come up with $55.70 — or else postpone the trip, and save my $37.00, until some time when I’ve got more money or can find a rider to share the cost. But setting out without enough gas money is not practical: it’s just plain stupid.
The design variables: effect size, α (“alpha”), power, and sample size. How to calculate them.
In a clinical trial, just as in the gas mileage example, there are mathematical formulas (the values can be often be looked up in tables or computed with software, so the formulas themselves don’t always need to be known) that relate 4 quantities:
- the effect size the trial should be able to detect
- α — the probability of Type I error
- power — the probability of avoiding Type II error
- sample size
If we know any 3 of these, we can work out the fourth.
See, for example, the calculators at Statistical considerations for clinical trials and scientific experiments by David A. Schoenfeld at Massachusetts General Hospital.
Let’s illustrate how to use Dr. Schoenfeld’s page to calculate power. Earlier in this article, we discussed an example in which each member of a class of biolology students did an experiment comparing 10 subjects on an active treatment with a success of 50% to subjects on a placebo with a 40% spontaneous recovery. Suppose now each student is required to do a formal statistical test. Will they be able to correctly detect the difference between treatments?
Go to Dr. Schoenfeld’s page, where you will be asked to tell what kind of study you are doing. For this example, you have a “parallel study” (meaning some patients are on the active treatment while other patients, in parallel, are on control) rather than a “crossover study” (in which each patient would start on one of the treatments and then cross over to the other). Also, your outcome measurement is a yes/no indicator of success or failure rather than a numerical value or a time. So you would click the upper left link in the table marked “Find.”
This brings up a separate window with a Java applet. Be sure that “two sided” is selected. For “p-value,” enter .05. The applet requires that you refer to the better treatment as “Treatment A,” so enter .5 as the “Response rate of Rx A” and .4 as the “Response rate of Rx B.” Tell the applet that there will be 10 patients on each treatment. Then press the button marked “Compute power.” The answer will be 0.0200596. If you press the “Write paragraph” button, it will give you a wording to use that summarizes your situation.
Returning to the example, we found above that only 59 students out of 100 will get data pointing to the qualitatively correct conclusion. We’ve now found that only 2 of them will have strong enough data to be able to “prove,” not just “suspect,” their result. Using small data sets to “get an approximate quick-and-dirty” idea whether or not to proceed to the next stage of development is an extremely bad idea that will likely lead you into the wrong decision. There are techniques that can be used in a small study, but they should only be used under the direct guidance of a professional statistician.
The design variables: effect size, α (“alpha”), power, and sample size. How to use them.
Which of these 4 variables are in our control?
Let’s postpone discussing the effect size for a few paragraphs. The regulatory agencies, for the reasons discussed above, are careful about retaining the right to set α, so it’s not in our control. This leaves two quantities free: power and sample size. If we know one, then the formulas tell us the other.
How do we use this to plan?
Power and sample size
Few executives understand power. What they see is that the regulatory agencies are interested in α — in the p-value — and therefore it must be what’s important. They tend to lock themselves into an unnecessarily adversarial relationship with the FDA, trying post hoc to chisel a few points of α. They regard β as a theoretical academic nicety that only pleases the statisticians. They should instead recognize that power is what ensures in advance that they’ll make the α hurdle that the regulators are setting. Power is what buys us the gas to get our car to Portland without having to get out at Springfield and walk. It’s the thing we’re most concerned about.
However, executives do immediately understand sample size—because it’s directly related to cost.
Thus, the usual thought process is the one that seems so dumb in the “driving to Portland” analogy. People figure out how much sample they can afford—in terms of budget, ability to recruit subjects, and administrative burden—and then they use that number to decide how much power they need. They’re happy to get 80% because they don’t understand the basic idea anyway—and by now 80% is traditional so it seems prudent and cautious.
If you challenge them about this, they’ll reply that they’re doing as well as they can and that the trial will still give them “some idea of the right answer” — that it’s better than doing nothing. But the truth is that it’s worse than nothing. They’re paying money for a trial that has a very realistic potential for giving them (and the public) a misleading answer that will result in decisions that will hurt them. They’d be better off just not to do the trial and save their money for a better day—or use it for a different, more promising project.
So, how much power would be adequate? I feel like I’m guiding my clients into a gambling casino. At 80% power, they will each sit down at a table with 4 other clients and the croupier will distribute 5 identical envelopes, one of which will say “No” inside. I have zero control over which client gets the bad one—and, even if I did, I want them all to win. This leaves me with no choice but to try to guide them to a different table (95% power), one where 20 people will sit and each will have much less chance for the single “No” envelope.
(There are other ways I can improve the game for them, although I can’t completely remove the inherent element of chance. For example, there may be two distinct types of subjects and one may be more susceptible to benefit from the new treatment than the other. I can try to ensure that the client has a reasonable chance of proving efficacy, at least in the one subgroup, without having the effect diluted out by the other group. Generally, I regard this as my job: to get my clients to sit at a gambling table where they’ll have the best chance for some kind of decent win.)
Power and the effect size
There is, however, another aspect to power that has to be kept firmly in mind. Recall that I wrote above that there are 4 quantities in the equation and I deferred talking about the effect size. Let’s turn our attention to that now. Assume that α is fixed and that we’ve decided on a reasonable level of power. So, the sample size and the effect size are the 2 free variables which have to be traded off against the other.
Think of sample size as the magnification control on a microscope: a larger sample allows us to discern smaller effects with the same degree of assurance. If we’re looking for a larger effect, then a smaller sample will be sufficient, but if we need to see smaller effects, we need a larger sample.
It’s a common mistake to use an unrealistic estimate of the effect size in order to falsely inflate the power and allow a smaller sample.
So, how do we know, or determine, the effect size?
Again, the usual method (estimating it according to pilot data) is not a very good one.
To illustrate, for a spinal injury trial, we had pilot data indicating 1 success out of 14 patients (7%) on placebo, but 7 of 14 successes (50%) on the new treatment. This large apparent effect would seem to permit a modest sample size. But, wait a minute. Although our pilot data did show a significant difference between treatment and placebo (p=.03), the small sample size means that the confidence intervals around the effect size are very wide. The success rate on placebo might well be much higher than 7% and the success rate on the new treatment might be much lower than 50%. So the effect size might be smaller than we think. Anyway, we don’t need it to be so large. A treatment for spinal cord injury would not have to be so dramatic in order to be clinically important—even one that “only” raised the success rate (the proportion of patients avoiding paralysis) from 20% to 30% would have a major impact.
Therefore the correct way to estimate the effect size has nothing to do with pilot data or mathematical calculations. Instead, one uses medical judgment to determine the smallest effect size that would be clinically significant—the smallest effect that we want to make sure our trial has the power to be sure of detecting. (The way to use pilot data is to do a reality check to see that this effect size is within the confidence limits allowed by the data. If not, then you probably shouldn’t do the trial.)
Clinical Trial Design — Summary of steps
Using our knowledge of the variables in the sample size equations (if not necessarily of the equations themselves), the steps are these. The significance level α is set by regulators and is not in our control. We should set power at 95% (or, at the very least, 90%) — which would make β equal to α, or .05. Then we would get medical domain experts to help us decide the minimum clinically significant effect size. Finally, you’ve now established three of the four variables, so use the formulas (or tables or software) to compute the sample size.
Clinical Trial Design — further reading in this blog
Readers of this post can continue in More on clinical trial design for beginners which offers a pdf book chapter download.
A further continuation is available as a pdf presentation download, geared to students in a graduate course in clinical trial management. Statistical Design: Clinical Development of Drugs and Biologics.
A different, related topic they may also be interested in: Medicine: Using probabilty to treat people versus using it to treat groups.