Posts filed under Probability (66)

April 25, 2013

Internet searches reveal drug interactions?

The New York Times has a story about finding interactions between common medications using internet search histories.  The research, published in the Journal of the American Medical Informatics Association, looks at search histories containing searches for two medication names and also for possible symptoms.  For example, their primary success was finding that people who searched for information on paroxetine (an antidepressant) and pravastatin (a cholesterol-lowering drug) were more likely to search for information on a set of symptoms that can be caused by high blood sugar.  These two drugs are now known to interact to cause high blood sugar in some people, although this wasn’t known at the time the internet searches took place.

This approach is promising, but like so many approaches to safety of medications it is limited by the huge number of possibilities.  The researchers knew where to look: they knew which drugs to examine and which symptoms to follow. With the thousands of different medications, leading to millions of possible interacting pairs and dozens or hundreds of sets of symptoms it becomes much harder to know what’s going on.

Drug safety is hard.

April 19, 2013

Are Adam and Steve waiting out there?

Graeme Edgeler says on Twitter

I know many gay couples will want to marry quickly, but there *must* be a couple named Adam & Steve and we should totally let them go first.

Should we expect an Adam and Stephen couple? This is an opportunity to use public data and simple probability to get a rough estimate.

StatsNZ reported just over 5000 cohabiting male couples in 2006. That’s an underestimate of male couples, but probably an overestimate of those planning to marry soon.

I remembered seeing Project Steve, from the National Center for Science Education.  They collect signatures supporting the teaching of evolution from scientists named Stephen (after Stephen J. Gould) — they are currently up to 1268 — and make the point that under 1% of US males are named Stephen.

It turns out that they get this information from the US Census.  The most recent data is 1990 (and, of course, is US) so it’s not ideal, but it will give us a rough idea.  Stephen comes in at 0.54%, and when you add in Stephan, Esteban, Stefano, it still is no more than 0.6%.  Adam is 0.259%.

Under random assignment, then, there would be less than a 1 in 10 chance that there’s a couple called Adam and Steve living together in NZ, and even then they might well not be planning to get married.

 

 

[Update: Brendon correctly points out that I missed ‘Steven’, which is actually the most common variant. Apart from demonstrating that I’m an idiot, this doesn’t change the basic message.]

April 12, 2013

Random numbers from a radioactive source

Here’s a fun link which talks about the difference between truly random numbers and pseudo-random numbers. When we teach this, we often mention generation of random numbers (or at least the random number seed) from a radioactive source as one way of getting truly random numbers. Here is someone actually doing it. The sequel is well worth a watch too if you have the time.

April 11, 2013

Power failure threatens neuroscience

A new research paper with the cheeky title “Power failure: why small sample size undermines the reliability of neuroscience” has come out in a neuroscience journal. The basic idea isn’t novel, but it’s one of these statistical points that makes your life more difficult (if more productive) when you understand it.  Small research studies, as everyone knows, are less likely to detect differences between groups.  What is less widely appreciated is that even if a small study sees a difference between groups, it’s more likely not to be real.

The ‘power’ of a statistical test is the probability that you will detect a difference if there really is a difference of the size you are looking for.  If the power is 90%, say, then you are pretty sure to see a difference if there is one, and based on standard statistical techniques, pretty sure not to see a difference if there isn’t one. Either way, the results are informative.

Often you can’t afford to do a study with 90% power given the current funding system. If you do a study with low power, and the difference you are looking for really is there, you still have to be pretty lucky to see it — the data have to, by chance, be more favorable to your hypothesis than they should be.   But if you’re relying on the  data being more favorable to your hypothesis than they should be, you can see a difference even if there isn’t one there.

Combine this with publication bias: if you find what you are looking for, you get enthusiastic and send it off to high-impact research journals.  If you don’t see anything, you won’t be as enthusiastic, and the results might well not be published.  After all, who is going to want to look at a study that couldn’t have found anything, and didn’t.  The result is that we get lots of exciting neuroscience news, often with very pretty pictures, that isn’t true.

The same is true for nutrition: I have a student doing a Honours project looking at replicability (in a large survey database) of the sort of nutrition and health stories that make it to the local papers. So far, as you’d expect, the associations are a lot weaker when you look in a separate data set.

Clinical trials went through this problem a while ago, and while they often have lower power than one would ideally like, there’s at least no way you’re going to run a clinical trial in the modern world without explicitly working out the power.

Other people’s reactions

March 9, 2013

The HarleMCMC Shake

I’m sure that many of our readers are familiar with the latest internet trend, the Harlem Shake. Recently, a statistical version appeared that demonstrates some properties of popular Markov Chain Monte Carlo (MCMC) algorithms. MCMC methods are computer algorithms that are used to draw random samples from probability distributions that might have complicated shapes and live in multi-dimensional spaces.

MCMC was originally invented by physicists (justifying my existence in a statistics department) and is particularly useful for doing a kind of statistics called “Bayesian Inference” where probabilities are used to describe degrees of certainty and uncertainty, rather than frequencies of occurrence (insert plug for STATS331, taught by me, here).

Anyway, onto the HarleMCMC shake. It begins by showing the Metropolis-Hastings method, which is very useful and quite simple to do, but can (in some problems) be very slow, which corresponds to the subdued mood at the beginning of a Harlem Shake. As the song switches into the intense phase, the method is replaced by the “Hamiltonian MCMC” method which can be much more efficient. The motion is much more intense and efficient after that!

Here is the original video by PhD students Tamara Broderick (UC Berkeley) and David Duvenaud (U Cambridge):

http://www.youtube.com/watch?v=Vv3f0QNWvWQ

Naturally, this inspired those of us who work on our own MCMC algorithms to create response videos showing that Hamiltonian MCMC isn’t the only efficient method! Within one day, NYU PhD student Daniel Foreman-Mackey had his own version that uses his emcee sampler. I also had a go using my DNest sampler, but it has not been set to music yet.

So, next time you read or hear about a great new MCMC method, you should ask the authors how well it performs on the “Harlem Shake Distribution”. Oh and thanks to Auckland PhD student Jared Tobin for linking me to the original video!

March 3, 2013

Keep calm and ignore tail risk

Consider a problem in statistical decision theory.

Suppose you can create a t-shirt slogan at zero cost, and submit it for market testing.  If it’s popular, you make money; if it isn’t popular, you pay nothing.  It’s easy to see that you should submit as many t-shirt designs as you can generate: there’s no downside, and the upside might be good.

The problem is that you might create slogans that are sufficiently offensive to get the whole world mad at you.  And if you create more than half a million t-shirt slogans, it’s not all that unlikely that some of them will be really, really, really bad. And it’s not a convincing defense to say that the computer did it, and you didn’t bother checking the results.

January 28, 2013

More lottery nonsense

From Stuff, on this week’s Lotto

The winner chose the six lucky numbers they played regularly but, in a clever move, chose to play the same numbers on 10 different lines with each Powerball number. That way they were ensured to win Powerball if their lucky numbers came up.

This strategy doesn’t increase the chance of winning Powerball, because you can’t increase the chance except by cheating or magic.   If they match the six numbers they are certain to win Powerball, rather than having only a 10% chance, but this is exactly compensated by their ten-fold lower chance of having a ticket that matches all six numbers.

The strategy does reduce the chance of winning the first division, by a factor of ten, but increases the fraction of the first-division prize that they snag (in this particular win, from 1/4 to 10/13). Over all, the strategy reduces expected winnings compared to ten random picks. On the other hand, if you cared primarily about average expected return you wouldn’t be playing Lotto.

January 14, 2013

More about Lotto numbers …

From today’s New Zealand Herald:

Five Lotto numbers prove very lucky

Two Saturdays in a row, five of the same numbers were drawn in Lotto.

But a statistician says the chances of that happening aren’t as high as they may seem – 1 in 5500 ….

Said statistician is our very own Russell Millar. The rest of the story is here

 

September 28, 2012

Coincidences

There’s a been a lot of coverage around the world of the Oksnes family in Norway: three of them have won significant sums in the lottery, all three at times close to when one family member, Hege Jeanette, was giving birth.

We’ve been asked what the probability was.  This isn’t even really a well-defined question, because it’s hard to say what would count as the same event.  Presumably a different Norwegian family would still count, and probably a Chilean family.  What if the family members won near their own 30th birthdays rather than near the time their child/niece/nephew was born? Or if they’d each won on the day they graduated college? If we can agree on what counts as the same event it’s then hard to work out the probability because we’d need data on number of lottery players all around the world, and on how many of them had children.

There are some things we can compute.  Suppose that there is one lottery prize a week in Norway, and that about 1 million of the country’s 4.9 million people play.  Divide them up into groups of six people.  The chance that three prizes end up in the same group of six people over ten years would be about 1.5 in ten thousand. Extending this to the whole world, it’s pretty likely that three people in the same family have won. That doesn’t cover the pregnancy, which restricts us to three periods of a few weeks in the ten years. Suppose we say that it’s three three-week periods that would give a close enough match.  The chance of all the wins lining up with the births would be about five in a million.  So, if we specified that the coincidence had to be about giving birth, it’s pretty unlikely.

Alternatively, we could ask how likely is it that a coincidence remarkable enough to get reported around the world would happen in a lottery.  The probability of that is pretty high, and we can tell, because lots of unlikely lottery coincidences do get reported.

Finally, we could ask what is the probability that the coincidence was really just due to chance. The answer to this one is easy. 100%.

September 5, 2012

Lotto and abstract theory

There is a recurring argument in statistics departments around the world about how much abstract theory should be taught to students, and how much actual applied statistics. One of the arguments in favour of theory, even for students who are being trained to do applied data analysis, is that theory gives you a way to substitute calculation for thought. Thinking is hard, so we try to save it for problems where it is needed.

The current top Google hit for “big wednesday statistics” offers a nice illustration.  It’s a website selling strategies to increase your chance of winning, based on a simple message

If you play a pattern that occurs only five percent of the time, you can expect that pattern to lose 95 percent of the time, giving you no chance to win 95 percent of the time. So, don’t buck the probabilities.

For example,

When you select your lotto numbers, try to have a relatively even mix of odd and even numbers. All odd numbers or all even numbers are rarely drawn, occurring only one percent of the time. The best mix is to have 2/4, 4/2 or 3/3, which means two odd and four even, or four odd and two even, or three odd and three even. One of these three patterns will occur in 83 percent of the drawings.

Now, if you understand how the lottery is drawn and know some basic probability, you can tell that this advice can’t possibly work, without even reading it carefully. But if you had to explain the fallacy to someone, it might take a bit of thought to locate it.  If 99% of wins are have a mixture of odd and even (actually, more like 98%), why doesn’t that make it bad to choose all odd or all even?

When you have an answer (or have given up), click through for more:

(more…)