How We Wrote a Textbook

My coauthor and I have finished a textbook! That’s right, a full-length, fully-featured academic textbook. Algorithms for Optimization has its own page on the MIT Press website and comes out February of 2019. And its available for free!

We started working on it a little over two years ago, in the fall of 2016. Now it is nearing the end of the fall of 2018 and it is becoming a reality.

What does it take to write a textbook? That’s a good question; one that I certainly did not know the answer to going in, and one for which I now need several hours of recollection and introspection to tease out the appropriate response. This blog post is my answer to that question. I hope that it is useful to you, whether it helps understand the scope of the work involved, brings up aspects that hadn’t been considered, or points out tools that might help you with your own book.

What We Made

Our book, Algorithms for Optimization, is a full-length college-level academic textbook based based on material from Prof. Kochenderfer’s Engineering Design Optimization course at Stanford. Prof. Kochenderfer and I authored the book together.

The book provides an introduction to optimization with a focus on practical algorithms. There are many optimization books that cover specific disciplines (such as structural optimization), or specific techniques (such as convex optimization). Our book’s twenty-one chapters span the optimization literature, covering the mathematical and algorithmic fundamentals of optimization. Algorithms for Optimization is probably best compared to the popular Numerical Recipes books, which cover a wide range of algorithms encountered in computer science. Like Numerical Recipes, we provide real code, are not discipline-specific, and are not domain-specific.

Algorithms for Optimization is filled with easy-to-understand Julia implementations of all algorithms. This is real code that actually runs. We run real code – these algorithms – to generate the many illustrations.

Did I mention illustrations? The book is filled with data-rich figures and side notes to explain concepts. In my option, these illustrations are fundamental in conveying understanding. Plus, they make the book look great.

This figure uses the method of small multiples to show how the Nelder Mead algorithm proceeds. The figure is generated from Julia code written directly in the LaTeX file and is executed via the pythontex package during compilation.

We made liberal use of the side margins to elaborate on concepts via side notes without interrupting flow. Figures can be placed in the margins, and all of our citations show up in the margins as well. This was inspired by Edward Tufte, the father of data visualization.

Everything was written in LaTeX with Bitbucket for source control. Everything is a text resource. That’s right. I’ll say it again. Everything is a text resource.

All figures are either defined in the LaTeX document directly using TikZ, or come from TikZ code that is generated during compile time from Julia code blocks defined in the LaTeX document. More on this later.

The Tufte style provided by the tufte-book class provides wide side margins for comments and figures. Image from the tufte-book documentation.

All in all, Algorithms for Optimization is unique in form and function. Our primary selling points are:

  • Easy-to-understand Julia implementations of all algorithms.
  • Data-rich illustrations and side notes to explain concepts without interrupting flow.
  • A chapter on multidisciplinary design optimization that deviates from the traditional academic treatment to greatly simplify the exposition.
  • Chapters on surrogate models and surrogate optimization methods.
  • A comprehensive overview of continuous numerical optimization with an undergraduate audience in mind.
  • Additional chapters on discrete and expression optimization.

Timeline

Prof. Kochenderfer and I started working on Algorithms for Optimization in the fall of 2016. I was his graduate student at the time and had just started my third year of grad school.

The timeline for Algorithms for Optimization. It only took us a little over two years! Copyediting takes a long time.

Prof. Kochenderfer had been teaching this class, AA222, called Multidisciplinary Design Optimization at the time. Prof. Kochenderfer had inherited it from another professor, and had shifted the focus of the class from MDO to a more general overview of engineering design optimization. The class was great. It got great reviews. Prof. Kochenderfer had taught it twice. The problem was, there was no textbook for it, because there weren’t really any textbooks that covered so many optimization techniques.

Prof. Kochenderfer had already written a textbook, Decision Making under Uncertainty, so he knew how to go about it. The problem was, as a professor, Prof. Kochenderfer was extremely pressed on time. He had the course notes, but little time to devote to turning them into a textbook. Enter Tim Wheeler.

As a grad student, I had more free time on my hands. Maybe not far more. (Those who know Prof. Kochenderfer’s schedule could easily argue that yes, grad students have far more free time than he does.), but more for sure. Furthermore, Prof. Kochenderfer was able to obtain some funding toward writing the textbook, which was a fantastic way to get the ball rolling.

A fortuitous happenstance helped get the book off the ground: I spent the fall of 2016 in Germany as part of a student exchange. This meant I was removed from my friends, family, and peers, so I had very little to do in the evenings. I started the book in earnest on my flight to Germany, and continued working on it for my entire time there, making first passthroughs of around 80% of the planned material.

I worked on the book in the evenings, trying to have finishing one chapter passthrough every week. Prof. Kochenderfer thoroughly reviewed and edited my passthroughs thereafter. It worked out really well. In a nutshell, I created vast quantities of content and he edited every detail micro- and macroscopically.

There are several steps involved that I had not expected, or at least had not considered. The prospectus is this document that has to be put together that proposes the book to the publisher. In it one lists the contents of the book, the target audience for the book, what it does differently than other textbooks, a huge list of textbooks like this textbook and why they don’t serve the same audience or function, etc. Thankfully Prof. Kochenderfer had done this before. We wrote the prospectus fairly early on, and attached our first polished chapter for review.

The publisher takes the prospectus and the sample chapter and typically sends it out to other professors and experts in the field for review. They may recommend the book for publishing. I learned that most professors care about textbooks insofar as they can be used in class. This makes sense – professors teach classes.

We ran into the interesting problem that our textbook doesn’t really cover anything novel from the perspective of experts in optimization. Experts in optimization are not our target audience. Our book was written to summarize a large body of existing work that emerged within a variety of different communities, not to present any novel research results. (Novel research is typically presented in journal articles and conference papers.) Some of the reviewers had tepid responses largely because the book did not go into the kind of depth of prior books dedicated to particular subareas. Thankfully the general consensus was that the topics we cover are uniquely broad within a single text. MIT Press had worked with Prof. Kochenderfer before, knew that his previous book was a huge success, and intuited that this book would be too, despite not having graduate students in optimization as its target audience.

We have used the book in AA 222 for two quarters now. Getting direct feedback from students has been an immense help. So does hearing all of the praise. Not every textbook author has access to such a captive audience. I consider us extremely lucky.

The copyediting phase of the textbook is far longer than I expected. The copyeditors we have worked with are incredibly thorough. I have a huge amount of respect for them. They find every. single. issue. Who knew that North American writers should use toward instead of towards? One should use 1 when referring to the quantity and one when referring to the pronoun. The rules for which and that are now forever ingrained in my muscle memory. Commas are complicated. Every figure, example, and algorithm had to be referenced in the text. I am very thankful for the reviewers.

The final book is over 500 pages. Each page represents many hours of effort. We have written and rewritten every sentence. We agonize over spacing. Every figure goes through so many tweaks and changes.

Our book will have taken about two and a half years from start to finish. Textbooks take a long time. To put this timeline into context, Sebastian Thrun told me Probabilistic Robotics took seven years. Prof. Kochenderfer and I had a fortuitous synergy in work ethic, division of labor, time availability, and grit.

Our Toolkit

What tools did we use to write the book? I have already mentioned some.

LaTeX

First and foremost, the book was written in LaTeX, a typesetting language that is the go-to tool for mathematical typesetting. Everything is written in text. The expression:

$$x^2$$

for instance, is written $x^2$. LaTeX documents are compiled to PDF.

LaTeX does a lot of work for us. Figures are numbered automatically. Figures are referenced from the text automatically. Layout is determined automatically. It works so well that we actually provide the final PDF to MIT Press rather than having them do any editing of the document themselves. From my understanding, this is very uncommon.

We use lualatex to compile our book. The primary reason is that it gives us access to some better fonts and font tools. We use the fontspec package. We are on the latest version of texlive.

Tufte Style

Our book uses the tufte-book style. This gives us access to a wide side margin in which we can place comments and figures. In-line comments (like this) interrupt the flow. Side margin comments allow us to provide additional information while minimizing interruptions. We can also include some more advanced commentary without overwhelming the reader.

Every citation in the book is placed in the side margin, so you can find the exact paper without having to flip to some other page in the book. Every figure has a caption and we reference every figure in the text.

Beyond the explicit style, we also follow the Tufte philosophical principles. The method of small multiples is used throughout the book to show how an algorithm progresses. This involves multiple charts with the same scale and axes allow one to quickly see the meaningful differences between them.

We use a consistent color scheme throughout the book. In fact, we try to be as consistent as possible throughout the book in every aspect – notation, typography, how we phrase things, etc.

Pythontex

We use juliaplots.sty to make it easy to integrate Julia code and plots in our book. It is extremely easy to use and install. It uses pythontex under the hood.

The pythontex package allows one to display, execute, and execute & display code in LaTeX. It was originally written for python but now supports quite a few languages. We use it for typesetting all Julia code blocks and for generating many of the figures in the book. All figures in the book are defined using text – either in TikZ or in julia code via pythontex that generates TikZ.

The Julia code blocks are juliaverbatim environments containing our code. When compiled, these blocks are typeset and syntax highlighting is applied. We created our own lexer and syntax highlighting theme using the pygments package.

The figures we generate using pythontex are jlcode environments with the structure:

If there is interest in diving deeper into how we structured this, with code examples and a minimum working example, let me know!

Custom Environments

Prof. Kochenderfer is a LaTeX wizard. He created several environments that we use throughout the book. Algorithms are in algorithm environments and examples are in example environments. Both have a light gray background and support automatic labeling with numbers based on the chapter. For example, the second algorithm environment in the fifth chapter is automatically labeled Alg 5.2.
I believe the two environments are identical under the hood – we just count algorithms and examples differently.

We also have question and solution environments. These allow us to have finer control over where the questions and solutions are placed. When writing the chapters it is nice to be able to have the questions and solutions be placed immediately after one another at the end of the chapter. In the final version of the book all solutions are moved to a solutions section at the end. Again, question numbers are automatically generated by LaTeX.

Bitbucket

We use git for source control, and Bitbucket to host it. Everything is defined in text files, so we can easily commit, push, pull, etc. to and from our repository. Having a project with only two people allowed us to work almost exclusively in the master branch directly. In some cases I branched, such as when I overhauled the book to use Julia 1.0 rather than Julia 0.6.

We have a server that uses git hooks to listen for commits. Any time we push a change to master, the server updates its git repo and compiles the full book as well as each individual chapter into a PDF. We have our own password-protected website with URLs to these PDFs that we release to potential reviewers. Thus they always have access to the latest version of the book. We used the vc package and a bash script to automatically generate a footer on every page of the PDF containing the version number, so you always know what version you are looking at.

We used Bitbucket to track issues. This was extremely valuable. In early forms of the book we left comments and todo notes in the text. Once we started releasing chapters to students we didn’t feel that was appropriate. We thus filed issues for every little (and big) thing that we had to remember to do. Issues were resolved as appropriate and we managed to whittle down the list over time. We have had 560 issues and 2555 commits.

Code testing is done using Travis-CI, a continuous integration framework that is also triggered on every commit. Travis runs test scripts we wrote to test the code in our algorithm environments. We track code coverage with coveralls to build confidence that we have tested everything thoroughly.

Zendesk

We used Zendesk to help us track external issues. Anyone with feedback to give could send an email to comments@alg4opt.net. Zendesk would receive the email and put it into its own handy issue system. We could easily assign messages to each other, respond to the writer, complete them, and have hidden discussions within each message chain.

Based on our email exchanges we had around 187 reviewers that were not AA222 students.

Mental Considerations

Perhaps the most challenging aspect of writing a textbook has been the mental aspect. Those who write textbooks need to be in it for the long haul. Every sentence is going to be written and rewritten, considered from different angles, potentially misinterpreted by students or reviewers, and needs to stay consistent with respect to the style of the rest of the text. You can go down endless rabbit holes in terms of tweaking different aspects. There is an ever-expanding list of new chapters or sections that could be written, new problems and solutions that could be explored, and new ways to cover material.

On top of that, there are all of introductions and tasks that are not as fun to write, but someone has to write them. Oftentimes once one gets down to it is isn’t that bad, and in many cases illuminating. Nevertheless one needs grit to overcome these moments.

An issue I ran into was criticism. No-one likes being critiqued, and when writing a \num{500} page book there are plenty of chances for people to point out dumb mistakes, not so dumb mistakes, sections which are unclear, and sections you should consider including. Don’t get me wrong, almost all of the criticism we got was constructive. Taking criticism early on always caused a hit my ego, and learning to handle that better has helped me develop as a person.

Lastly, writing a textbook with a co-author requires working together with someone in an intense collaborative manner for a very long time. I find it hard to imagine a better co-author than Prof. Kochenderfer. His undying passion to complete the book, seemingly inexhaustible capacity for detail-oriented work, and on-point review of anything and everything I ever committed made this book successful. No corners were ever cut. No concept was ever explained out of order. We worked together incredibly well.

I highly, highly stress that if you do consider writing a book with a co-author, it is imperative that you find someone with whom you really click. Because you’ll be crafting your beautiful product together for a long time.

Conclusion

I am incredibly happy with how Algorithms for Optimization evolved into what it is today. It is an achievement for which I have immense pride. I hope that it is useful to someone, hopefully many someones, for that is the overall purpose of all this.

AlphaGo Zero – How and Why it Works

DeepMind’s AlphaGo made waves when it became the first AI to beat a top human Go player in March of 2016. This version of AlphaGo – AlphaGo Lee – used a large set of Go games from the best players in the world during its training process. A new paper was released a few days ago detailing a new neural net—AlphaGo Zero—that does not need humans to show it how to play Go. Not only does it outperform all previous Go players, human or machine, it does so after only three days of training time. This article will explain how and why it works.

Monte Carlo Tree Search

The go-to algorithm for writing bots to play discrete, deterministic games with perfect information is Monte Carlo tree search (MCTS). A bot playing a game like Go, chess, or checkers can figure out what move it should make by trying them all, then checking all possible responses by the opponent, all possible moves after that, etc. For a game like Go the number of moves to try grows really fast.
Monte Carlo tree search will selectively try moves based on how good it thinks they are, thereby focusing its effort on moves that are most likely to happen.

More technically, the algorithm works as follows.  The game-in-progress is in an initial state s0, and it is the bot’s turn to play. The bot can choose from a set of actionsA. Monte Carlo tree search begins with a tree consisting of a single node fors0. This node is expanded by trying every actiona∈A and constructing a corresponding child node for each action. Below we show this expansion for a game of tic-tac-toe:

The value of each new child node must then be determined. The game in the child node is rolled out by randomly taking moves from the child state until a win, loss, or tie is reached. Wins are scored at \(+1\), losses at \(-1\), and ties at \(0\).

The random rollout for the first child given above estimates a value of \(+1\). This value may not represent optimal play-it can vary based on how the rollout progresses. One can run rollouts unintelligently, drawing moves uniformly at random. One can often do better by following a better-though still typically random-strategy, or by estimating the value of the state directly. More on that later.

Above we show the expanded tree with approximate values for each child node. Note that we store two properties: the accumulated value \(W\) and the number of times rollouts have been run at or below that node, \(N\). We have only visited each node once.

The information from the child nodes is then propagated back up the tree by increasing the parent’s value and visit count. Its accumulated value is then set to the total accumulated value of its children:

Monte Carlo tree search continues for multiple iterations consisting of selecting a node, expanding it, and propagating back up the new information. Expansion and propagation have already been covered.

Monte Carlo tree search does not expand all leaf nodes, as that would be very expensive. Instead, the selection process chooses nodes that strike a balance between being lucrative-having high estimated values-and being relatively unexplored-having low visit counts.

A leaf node is selected by traversing down the tree from the root node, always choosing the child \(i\) with the highest upper confidence tree (UCT) score:

$$U_i = \frac{W_i}{N_i} + c \sqrt{\frac{\ln N_p}{N_i}}$$

where \(W_i\) is the accumulated value of the \(i\)th child, \(N_i\) is the visit count for the \(i\)th child, and \(N_p\) is the number of visit counts for the parent node. The parameter \(c \geq 0\) controls the tradeoff between choosing lucrative nodes (low \(c\)) and exploring nodes with low visit counts (high \(c\)). It is often set empirically.

The UCT score (\(U\)’s) for the tic-tac-toe tree with \(c=1\) are:

In this case we pick the first node, \(s_{0,1}\). (In the event of a tie one can either randomly break the tie or just pick the first of the bet nodes.) That node is expanded and the values are propagated back up:

Note that each accumulated value \(W\) reflects whether X’s won or lost. During selection, we keep track of whether it is X’s or O’s turn to move, and flip the sign of \(W\) whenever it is O’s turn.

We continue to run iterations of Monte Carlo tree search until we run out of time. The tree is gradually expanded and we (hopefully) explore the possible moves, identifying the best move to take. The bot then actually makes a move in the original, real game by picking the first child with the highest number of visits. For example, if the top of our tree looks like:

then the bot would choose the first action and proceed to \(s_{0,1}\).

Efficiency Through Expert Policies

Games like chess and Go have very large branching factors. In a given game state there are many possible actions to take, making it very difficult to adequately explore the future game states. As a result, there are an estimated \(10^{46}\) board states in chess, and Go played on a traditional \(19 \times 19\) board has around \(10^{170}\) states (Tic-tac-toe only has 5478 states).

Move evaluation with vanilla Monte Carlo tree search just isn’t efficient enough. We need a way to further focus our attention to worthwhile moves.

Suppose we have an expert policy \(\pi\) that, for a given state \(s\), tells us how likely an expert-level player is to make each possible action. For the tic-tac-toe example, this might look like:

where each \(P_i = \pi(a_i \mid s_0)\) is the probability of choosing the \(i\)th action \(a_i\) given the root state \(s_0\).

If the expert policy is really good then we can produce a strong bot by directly drawing our next action according to the probabilities produces by \(\pi\) , or better yet, by taking the move with the highest probability. Unfortunately, getting an expert policy is difficult, and verifying that one’s policy is optimal is difficult as well.

Fortunately, one can improve on a policy by using a modified form of Monte Carlo tree search. This version will also store the probability of each node according to the policy, and this probability is used to adjust the node’s score during selection. The probabilistic upper confidence tree score used by DeepMind is:

$$U_i = \frac{W_i}{N_i} + c P_i \sqrt{\frac{\ln N_p}{1 + N_i}}$$

As before, the score trades off between nodes that consistently produce high scores and nodes that are unexplored. Now, node exploration is guided by the expert policy, biasing exploration towards moves the expert policy considers likely. If the expert policy truly is good, then Monte Carlo tree search efficiently focuses on good evolutions of the game state. If the expert policy is poor, then Monte Carlo tree search may focus on bad evolutions of the game state. Either way, in the limit as the number of samples gets large, the value of a node is dominated by the win/loss ratio \(W_i / N_i\), as before.

Efficiency Through Value Approximation

A second form of efficiency can be achieved by avoiding expensive and potentially inaccurate random rollouts. One option is to use the expert policy from the previous section to guide the random rollout. If the policy is good, then the rollout should reflect more realistic, expert-level game progressions and thus more reliably estimate a state’s value.

A second option is to avoid rollouts altogether, and directly approximate the value of a state with a value approximator function \(\hat{W}(x)\). This function takes a state and directly computes a value in \([-1,1]\), without conducting rollouts. Clearly, if \(\hat{W}\) is a good approximation of the true value, but can be executed faster than a rollout, then execution time can be saved without sacrificing performance.

Value approximation can be used in tandem with an expert policy to speed up Monte Carlo tree search. A serious concern remains-how does one obtain an expert policy and a value function? Does an algorithm exist for training the expert policy and value function?

The Alpha Zero Neural Net

The Alpha Zero algorithm produces better and better expert policies and value functions over time by playing games against itself with accelerated Monte Carlo tree search. The expert policy \(\pi\) and the approximate value function \(\hat{W}\) are both represented by deep neural networks. In fact, to increase efficiency, Alpha Zero uses one neural network \(f\) that takes in the game state and produces both the probabilities over the next move and the approximate state value. (Technically, it takes in the previous eight game states and an indicator telling it whose turn it is.)

$$f(s) \rightarrow [\vec p, W]$$

Leaves in the search tree are expanded by evaluating them with the neural network. Each child is initialized with \(N = 0\), \(W = 0\), and with \(P\) corresponding to the prediction from the network. The value of the expanded node is set to the predicted value and this value is then backed up the tree.

Selection and backup are unchanged. Simply put, during backup a parent’s visit counts are incremented and its value is increased according to \(W\).

The search tree following another selection, expansion, and backup step is:

The core idea of the Alpha Zero algorithm is that the predictions of the neural network can be improved, and the play generated by Monte Carlo tree search can be used to provide the training data. The policy portion of the neural network is improved by training the predicted probabilities \(P\) for \(s_0\) to match the improved probability \(\vec \pi\) obtained from running Monte Carlo tree search on \(s_0\). After running Monte Carlo tree search, the improved policy prediction is

$$\pi_i = N_i^{1/\tau}$$

for a constant scalar \(\tau\). Values of \(\tau\) close to zero produce policies that choose the best move according to the Monte Carlo tree search evaluation.

The value portion of the neural network is improved by training the predicted value to match the eventual win/loss/tie result of the game, \(Z\). Their loss function is:

$$(W-Z)^2 + \vec \pi^\top \ln \vec p + \lambda \|\vec \theta\|_2^2$$

where \((W-Z)^2\) is the value loss, \( \vec \pi^\top \ln \vec p\) is the policy loss, and \(\lambda \|\vec \theta\|_2^2\) is an extra regularization term with parameter \(\lambda \geq 0\) and \(\vec \theta\) represents the parameters of the neural network.

Training is done entirely in self-play. One starts with a randomly initialized set of neural network parameters \(\vec \theta\) are then improved by using gradient descent (Or any of the more sophisticated accelerated descent methods-Alpha Zero used stochastic gradient descent with momentum and learning rate annealing.) on the loss function for a random selection of states played.

Closing Comments

And that’s it. The folks at DeepMind contributed a clean and stable learning algorithm that trains game-playing agents efficiently using only data from self-play. While the current Zero algorithm only works for discrete games, it will be interesting whether it will be extended to MDPs or their partially observable counterparts in the future.

It is interesting to see how quickly the field of AI is progressing. Those who claim we will be able to see the robot overlords coming in time should take heed – these AI’s will only be human-level for a brief instant before blasting past us into superhuman territories, never to look back.

Model Inference in the Presence of Truncated, Censored, or Missing Data

I have been working with probability and machine learning lately, particularly with fitting distributions to datasets. Fitting data is something covered in conventional pre-college curriculum, but I have only ever done it when all of my data has been complete. More recently I ran into a problem.

One fundamental feature in autonomous driving is the distance to the car in front of you. This tends to be a real number, something hopefully a bit bigger than a few meters when travelling at high speeds and potentially quite large when traffic is low. If you have a set of sensors on your car, like radars or lidar, you can only pick up cars up to a certain distance away from yourself. What do you do if there is no car in front of you? Set the distance to infinity?

This is a good example of censored data, a feature where data must fall within a given range but you know when it does so. The other types are truncated and missing data. Data is truncated when the only data you have is within a certain range, and you do not know of the occurrences when it falls outside of that range. Missing Data occurs when you missed or corrupted a reading, or for some reason it is not available. A good example is the velocity of the car in front of you.

So how does one handle fitting distributions to such features?

The answer is a surprisingly straightforward application of Bayes’ theorem.

Consider first a toy problem:

Unstable particles are emitted from a source of decay at a distance \(x\), a real number that has an exponential probability distribution with characteristic length \(\lambda\). We observe \(N\) decays at locations \(x_1, x_2, \ldots, x_N\). What is \(\lambda\)?

“Information Theory, Inference, and Learning Algorithms” by David MacKay

Solving this for the case with perfect data provides us some insight.

Fully Observed Data

The probability distribution for a single sample point, given \(\lambda\), is:

$$P(x\mid \lambda) = \lambda e^{-\lambda x}$$

from the definition for the exponential probability distribution.

Applying Bayes’ theorem and assuming independence:

$$P(\lambda \mid x_{1:N}) = \frac{P(x_{1:N}\mid \lambda)P(\lambda)}{P(x_{1:N})} \propto \lambda^N \exp \left( -\sum_{1}^N \lambda x_n \right) P(\lambda)$$

We can see that simply by conditioning on the data available and setting a prior we can determine the likelihood distribution for the value of \(\lambda\). From here we can do what we wish, such as  picking the most likely value for \(\lambda\).

Truncated Data

Suppose, however, that the data is truncated. Suppose that we only get readings for particle decays between \(x_\min\) and \(x_\max\). Fitting in the same way is going to cause inconsistencies. Let us start again, following the same steps.

The probability distribution for a single sample point, potentially truncated, given \(\lambda\), is:

$$P(x\mid \lambda) = \begin{cases} \lambda e^{- \lambda x} / Z(\lambda) & x \in [x_\min, x_\max] \\ 0 & \text{otherwise} \end{cases}$$

where $Z(\lambda)$ is a normalization factor:

$$Z(\lambda) = \int_{x_\min}^{x_\max} \lambda e^{- \lambda x} \> dx = \left( e^{-\lambda x_\min} – e^{-\lambda x_\max} \right)$$

We then apply Bayes’ theorem:

$$P(\lambda \mid x_{1:N}) = \frac{P(x_{1:N}\mid \lambda)P(\lambda)}{P(x_{1:N})} \propto \left( \frac{N}{Z(\lambda)} \right)^N \exp\left( -\sum_1^N \lambda x_n \right) P(\lambda)$$

This is very similar, and was quite easy to determine.

Censored Data

Without going into detail, we can derive the model for censored data. Suppose \(x\) is censored to be less than \(x_\max\), which occurs if our particle detector has a finite length and particles that would have decayed after \(x_\max\) instead slam into the back and report as \(x_\max\):

$$P(x\ mid \lambda) = \begin{cases} \lambda e^{-\lambda x} / Z(\lambda) & \text{for } x \leq x_\max \\ Z'(\lambda) \delta(0) & \text{otherwise} \end{cases}$$

where $Z(\lambda)$ is the probability of \(x\) being uncensored, $Z'(\lambda)$ is the probability of \(x\) being censored, and \(\delta\) is the Dirac distribution. Here, $Z(\lambda)$ is:

$$Z(\lambda) = \int_0^{x_\max} \lambda e^{-\lambda x} \> dx = (1 – e^{-\lambda x_\max})$$

$$Z'(\lambda) = 1 – Z(\lambda) = e^{-\lambda x_\max}$$

Missing Data

The final case is missing data. Here, you know when the data is missing but you have no information about where it was. This sort of data must be fitted using the original method using only the observed values.