Being a guest of the Data Science Phil podcast

Not long ago, I was interviewed for the Data Science Phil podcast. I had prepared excitedly with three subjects for discussion: the German tank problem, capture-mark-recapture, and the Fermi problem. These are methods to estimate population sizes from very scarce data. They can be used in diverse settings, including business intelligence.

Data Science Phil is Philipp Packmohr‘s alias for his podcast. In it, he covers subjects from data analysis, mathematical modelling and statistics. He interviews students and professors, and asks them to talk about their projects and areas of expertise.

I got to know Phil at programming and data science meetups. As he found out more about my career path, he became very keen to recruit me for the podcast. He said that by listening to a mathematician like me even for a couple of minutes he learnt something new so the interview would be a walk in the park for me. Still, I wanted to come up with some clear structure and topics which would be interesting, new for most listeners, and at the same time insightful while not deeply technical.

Being the contrarian I am, I wanted to provoke our listeners by bringing anti-data science to the podcast. The current popularity of machine learning is said to be due to three major factors: the availability of large datasets (e.g. online browsing behaviour, labelled images, sensor data), the availability of cheap computer power to process the data (via rental in the cloud), and new discoveries and accruing experience in the methodology of machine learning. (I should add the development and standardisation of machine learning software libraries.)

Just because machine learning and artificial intelligence have become possible and applicable in many enterprises, this does not mean that all data analysis problems and all useful inference methods must use machine learning. My philosophy is to be broad and to know about a wide range of methods, be the expert to suggest and select the relevant approaches, and finally, be technically capable to implement them both in the sense of understanding the mathematics and by being a skilled programmer.

There is a lot of value in selecting the right methodology (and preventing the implementation of a hopeless one). There is value in recognising the applicability of modelling where people did not see the scope for mathematical or computational modelling. And after all that, of course there is value in writing actual code.

With this motivation, I had researched three methods for making inference based on very little data. This is the opposite of machine learning, which is basically trying to approximate some unknown function $f:\, x\mapsto y$ given a lot of pairs $(x_n, y_n)$ of observations (data). From little data you cannot make very reliable or accurate estimates but let’s see if we can get at least the order of magnitude of an unknown right.

The podcast runs for a long hour and a half, but it conveniently comprises three blocks of about a half hour each.

The German tank problem

The first block is about the German tank problem. (Wikipedia was indeed one of my main sources.) During World War II, the Allies wanted to estimate the rate at which the Germans were manufacturing tanks. They used both classical intelligence and statistics for this. The statistical method applies to the setting where some objects receive a sequential serial number, and you capture (sample) randomly some of the tanks (some of the objects). Then you want to be able to say what the maximum number in the sequence is expected to be.

I tied this in with the Doomsday argument. You can formulate the Doomsday argument to claim that there are only one or two more generations of humans to ever exist. Then it’s game over.

Or is it?

I encourage you to read up on this fascinating argument and its rebuttals at the above link.

Mark and recapture

In the second block, I describe a method from ecology which has found applications in many other fields. Ecologists often want to estimate population sizes of plant or animal species. How do you do it with animals which always move around and also all look the same?

You capture some, mark them (e.g. by putting a ring on a bird’s leg), and release them. Then you come back a week later, giving enough time to the marked and unmarked animals to mix together. You capture some, and compute what proportion of the captured animals had a tag in your second sample. If you tagged eight birds originally, and in your second sample one fifth of the birds are tagged, then you can expect that there are roughly five times eight birds in total in this population. This technique has various names: mark and recapture, capture-mark-recapture, sight-resight, mark-release-recapture, Lincoln–Petersen method, and variants thereof.

The beauty of this method is its broad applicability. What you really need is that individuals in the population have some unique identifier (like the ring on a bird with a unique number). If we talk about people, it can be simply their names. You have to sample them twice in two independent (`orthogonal’) sampling campaigns. Then you compare the two samples and focus particularly on how many individuals were present in both samples. The two samplings can take place simultaneously. The trick is not to separate them in time but to make them independent from one another.

One remarkable example of its sophisticated application is described by Sir Bernard Silverman in an 11-minute TEDx talk Modern slavery: the size of the problem. He was Chief Scientific Advisor to the UK Home Office between 2010-2017, and was formerly statistics professor at different universities, including my alma mater, the Department of Statistics at the University of Oxford. If you click through, you can still be one of the first thousand viewers of this video.

I also took this opportunity to talk about sampling biases: about selection bias and survivorship bias. (Click for the famous diagram of an aeroplane with hits by enemy fire or here for a popular science account. With that, we are back at statistical problems studied by the Allies during World War II!) It is important for any data analyst to be aware of such pitfalls in order to avoid them.

Fermi problem

If you go to a job interview, you might well face a question like `How many petrol stations are there in the capital?’, with the interviewer requiring you to estimate it on the spot using only information that you can recollect from memory. There is a standard way of solving this by breaking it down to factors. I explain this in the podcast, with a special focus on why it works and how well you can expect it to work.

Asking a mathematician to do products is like asking an artistic painter to whitewash a bathroom. If I were in that job interview, I would tell them all about the Fermi problem but I would try to weasel out of the risky business of multiplying actual numbers…

One famous application of this technique is the Drake equation, which serves as the foundation of a research programme to guesstimate whether there exists extraterrestrial intelligence. In the podcast I recommended a great summary about the Fermi paradox on Tim Urban’s Wait but Why blog, which is, similarly to the Doomsday argument, a mind-boggling intellectual journey.

So that’s the story of how I became a radio star. You can send chocolate and flowers to my address. Thanks for listening!