I have been working with probability and machine learning lately, particularly with fitting distributions to datasets. Fitting data is something covered in conventional pre-college curriculum, but I have only ever done it when all of my data has been complete. More recently I ran into a problem.
One fundamental feature in autonomous driving is the distance to the car in front of you. This tends to be a real number, something hopefully a bit bigger than a few meters when travelling at high speeds and potentially quite large when traffic is low. If you have a set of sensors on your car, like radars or lidar, you can only pick up cars up to a certain distance away from yourself. What do you do if there is no car in front of you? Set the distance to infinity?
This is a good example of censored data, a feature where data must fall within a given range but you know when it does so. The other types are truncated and missing data. Data is truncated when the only data you have is within a certain range, and you do not know of the occurrences when it falls outside of that range. Missing Data occurs when you missed or corrupted a reading, or for some reason it is not available. A good example is the velocity of the car in front of you.
So how does one handle fitting distributions to such features?
The answer is a surprisingly straightforward application of Bayes’ theorem.
Consider first a toy problem:
Unstable particles are emitted from a source of decay at a distance \(x\), a real number that has an exponential probability distribution with characteristic length \(\lambda\). We observe \(N\) decays at locations \(x_1, x_2, \ldots, x_N\). What is \(\lambda\)?
“Information Theory, Inference, and Learning Algorithms” by David MacKay
Solving this for the case with perfect data provides us some insight.
Fully Observed Data
The probability distribution for a single sample point, given \(\lambda\), is:
$$P(x\mid \lambda) = \lambda e^{-\lambda x}$$
from the definition for the exponential probability distribution.
Applying Bayes’ theorem and assuming independence:
$$P(\lambda \mid x_{1:N}) = \frac{P(x_{1:N}\mid \lambda)P(\lambda)}{P(x_{1:N})} \propto \lambda^N \exp \left( -\sum_{1}^N \lambda x_n \right) P(\lambda)$$
We can see that simply by conditioning on the data available and setting a prior we can determine the likelihood distribution for the value of \(\lambda\). From here we can do what we wish, such as picking the most likely value for \(\lambda\).
Truncated Data
Suppose, however, that the data is truncated. Suppose that we only get readings for particle decays between \(x_\min\) and \(x_\max\). Fitting in the same way is going to cause inconsistencies. Let us start again, following the same steps.
The probability distribution for a single sample point, potentially truncated, given \(\lambda\), is:
$$P(x\mid \lambda) = \begin{cases} \lambda e^{- \lambda x} / Z(\lambda) & x \in [x_\min, x_\max] \\ 0 & \text{otherwise} \end{cases}$$
where $Z(\lambda)$ is a normalization factor:
$$Z(\lambda) = \int_{x_\min}^{x_\max} \lambda e^{- \lambda x} \> dx = \left( e^{-\lambda x_\min} – e^{-\lambda x_\max} \right)$$
We then apply Bayes’ theorem:
$$P(\lambda \mid x_{1:N}) = \frac{P(x_{1:N}\mid \lambda)P(\lambda)}{P(x_{1:N})} \propto \left( \frac{N}{Z(\lambda)} \right)^N \exp\left( -\sum_1^N \lambda x_n \right) P(\lambda)$$
This is very similar, and was quite easy to determine.
Censored Data
Without going into detail, we can derive the model for censored data. Suppose \(x\) is censored to be less than \(x_\max\), which occurs if our particle detector has a finite length and particles that would have decayed after \(x_\max\) instead slam into the back and report as \(x_\max\):
$$P(x\ mid \lambda) = \begin{cases} \lambda e^{-\lambda x} / Z(\lambda) & \text{for } x \leq x_\max \\ Z'(\lambda) \delta(0) & \text{otherwise} \end{cases}$$
where $Z(\lambda)$ is the probability of \(x\) being uncensored, $Z'(\lambda)$ is the probability of \(x\) being censored, and \(\delta\) is the Dirac distribution. Here, $Z(\lambda)$ is:
$$Z(\lambda) = \int_0^{x_\max} \lambda e^{-\lambda x} \> dx = (1 – e^{-\lambda x_\max})$$
$$Z'(\lambda) = 1 – Z(\lambda) = e^{-\lambda x_\max}$$
Missing Data
The final case is missing data. Here, you know when the data is missing but you have no information about where it was. This sort of data must be fitted using the original method using only the observed values.