Consistent reasoning

According to Richard Cox, the minimum requirement of expressing our relative beliefs in the truth of a set of propositions is to rank them in a transitive manner. For example, if we believe (proposition) \(A\) more than \(B\), and we believe \(B\) more than \(C\), then we must necessarily believe \(A\) more than \(C\). Such transitive ranking can be obtained by assigning a real number to each of the propositions in a manner so that the larger the numerical value associated with a proposition, the more we believe it.

If degrees of plausibility are represented by real numbers, then there is a uniquely determined set of quantitative rules for conducting inference. These rules are uniquely valid principles of logic in general without reference to "chance" or "random variables". Such quantitative formulation could be used for general problems of scientific inference, all of which arise out of incomplete information rather than "randomness".

Cox made two assertions about the rules such number had to obey in order to satisfy simple requirements of logical consistency: 1. If we specify how much we believe that something is true, then we must have implicitly specified how much we believe it's false. 2. If we first specify how much we believe that (proposition) \(Y\) is true, and then state how much we believe that \(X\) is true given \(Y\) is true, then we must implicitly have specified how much we believe that both \(X\) and \(Y\) are true.

This consistency could be ensured if the real numbers attached to the beliefs in propositions obeyed the sum rule (Equation 1.1) and the product rule (Equation 1.2) of probability theory:

$$\tag{1.1} \text{Pr}(X|I) + \text{Pr}(\overline X|I) = 1$$ $$\tag{1.2} \text{Pr}(X,Y|I) = \text{Pr}(X|Y,I) \times \text{Pr}(Y|I) \, ,$$

where \(0=\text{Pr}(\text{false})\) and \(1=\text{Pr}(\text{true})\) defining certainty; \(\overline X\) denotes the proposition \(X\) is false, the vertical bar \(|\) means "given" (so that all items to the right of this conditioning symbol are taken as being true), and the comma is read as the conjunction "and".

The probabilities are conditional on \(I\) to denote the relevant background information at hand, because there is no such thing as absolute probability. Although the conditioning on \(I\) is often omitted in calculations to reduce algebraic cluttering, we must never forget its existence.

Bayes' theorem and marginalization

If we write Equation (1.2) with \(X\) and \(Y\) interchanged:

$$\text{Pr}(Y,X|I) = \text{Pr}(Y|X,I) \times \text{Pr}(X|I) \, ,$$

and combine it with Equation (1.2), we obtain Bayes' theorem:

$$\tag{1.3} \text{Pr}(X|Y,I) = \frac{\text{Pr}(Y|X,I) \times \text{Pr}(X|I)}{\text{Pr}(Y|I)} \, ,$$

because the statement "\(Y\) and \(X\) are both true" is the same as "\(X\) and \(Y\) are both true". The importance of Bayes' theorem becomes apparent if we replace \(X and \)Y\( by \)hypothesis\( and \)data$:

$$\underbrace{\text{Pr}(hypothesis|data,I)}{\text{posterior probability}} \propto \underbrace{\text{Pr}(data|hypothesis,I)}{\text{measurements through likelihood}} \times \underbrace{\text{Pr}(hypothesis|I)}_{\text{prior probability}} \, .$$

Prior probability represents our state of initial belief in the hypothesis before considering the data. The likelihood function The experimental measurements through likelihood function represents the degree to which the data support the hypothesis. Posterior probability represents our state of knowledge about the truth of the hypothesis in the light of the data. Bayes' theorem encapsulates the process of learning.

The equality of Equation (1.3) is replaced with proportionality because the term \(\text{Pr}(data|I)\) (called evidence) has been omitted, which is fine for parameter estimation problem, since the missing denominator is simply a normalization constant that ensures that the posterior probability is a valid probability distribution. It does not depend explicitly on the hypothesis. In some situations, such as model selection, this term plays a crucial role. Evidence represents the probability of observing the data averaged over all possible values of the hypothesis.

The evidence (also called the marginal likelihood or the normalizing constant) represents the probability of the observed data averaged over all possible values of the hypothesis. The evidence does not depend on the specific hypothesis being considered because it is a marginal probability that integrates over all possible hypotheses.

The marginalization equation:

$$\tag{1.4} \text{Pr}(X|I) = \int_{-\infty}^{+\infty}{\text{Pr}(X,Y|I) \, \text{d}Y} \, ,$$

Randomness or lack of information

Whether randomness is inherent in the physical world is a debated topic in the field of philosophy of science and probability theory. Some argue that randomness is a fundamental feature of the universe, while others argue that what we perceive as randomness is simply a lack of information about underlying deterministic processes.

In some cases, events may appear to be random because we do not have complete knowledge of all the variables and factors that influence them. For example, the outcome of a coin flip may appear random because we cannot measure all the variables that determine the outcome, such as the initial position and velocity of the coin.

However, there are also phenomena that appear to be genuinely random, such as radioactive decay or quantum events. In these cases, the randomness cannot be explained by a lack of information, and it is thought to be an inherent feature of the physical world.

For more discussion on the subject, the book Quantum probabilities as Bayesian probabilities (Caves, 2002) provides an insightful perspective on the foundations of quantum theory.

Caves argues that traditional interpretation of quantum probabilities as objective probabilities is problematic and leads to various paradoxes and inconsistencies. Instead, he proposes that quantum probabilities should be interpreted as subjective probabilities that reflect an agent's beliefs about the system being measured. The reason for this interpretation is based on the following arguments.

Firstly, in the standard interpretation of quantum mechanics, probabilities are often seen as objective, or inherent in the system being measured. However, this interpretation faces several problems, including the measurement problem and the paradoxes of quantum entanglement. These problems suggest that the standard interpretation of quantum probabilities as objective is problematic and may not be able to provide a consistent and coherent account of quantum phenomena.

Secondly, Bayesian probability theory provides an alternative way of understanding probabilities that does not rely on the assumption of objective probabilities. In Bayesian probability theory, probabilities are seen as subjective degrees of belief, or the agent's willingness to bet on the outcome of an event given the available evidence. This approach allows probabilities to be updated as new evidence is obtained, providing a flexible and dynamic way of dealing with uncertainty.

Thirdly, the Bayesian interpretation of quantum probabilities has been shown to be compatible with quantum mechanics and provides a coherent framework for interpreting quantum phenomena. In this interpretation, the probabilities associated with quantum events are seen as reflecting an agent's subjective degrees of belief about the system being measured, rather than objective properties of the system itself.

Comparison of inference methods

Frequentist methods are not relevant to real problems because they assume independent repetitions of a random experiment, disregarding the relevant prior information.

Bayesian methods correct the defects of frequentist approach. They take into account prior information and make allowance for nuisance parameters, easily solving problems where frequentist methods fail. However, Bayesian methods do not have apparatus to deal with initial exploratory phase of scientific problems so that some problems cannot pass through this stage to develop enough structure (a model, sample space, hypothesis space, prior probabilities, and sampling distribution).

The Principle of maximum entropy only requires to define a sample space; there is no need to define a model or sampling distribution. The model is created straight from the data which proves to be optimal to many different problems.

In contrast, we may choose a model for Bayesian analysis based on prior knowledge or working hypotheses which may be speculative (usually such hypotheses extend beyond what is directly observable in the data), whereas maximum entropy methods invoke no hypotheses beyond the sample space and the evidence available from the data.

Maximum entropy methods predict only observable facts (functions of future or past observations) rather than values of parameters which may exist only in our imagination, therefore they are more appropriate when we have little knowledge beyond raw data. The difficulty of this method arises when the information is extremely vague, which makes it difficult to define an appropriate sample space.

Successful applications of maximum entropy methods include maximum entropy spectrum analysis and image reconstruction algorithms. However, as we become more aware of the appropriate model and hypotheses space, these two applications will evolve into the Bayesian phase which allows us to make use of more prior information.

  • Bayesian Spectrum Analysis and Parameter Estimation (Bretthorst, 1988) provides a detailed overview of Bayesian methods for spectral analysis, including the use of Bayesian inference to estimate power spectra and cross spectra. The author explores the applications of Bayesian methods in parameter estimation, including the estimation of model parameters in nonlinear and non-Gaussian systems with analysis in astronomical data, the analysis of neural data, and the analysis of signals in communication systems.
  • Data Analysis - A Bayesian Tutorial (Sivia, 1996) provides a detailed overview of the Bayesian approach to data analysis, including the specification of prior distributions, the computation of posterior distributions, and the use of Bayes factors for model comparison. The author explores the applications of Bayesian analysis in various fields, including the analysis of astronomical data, the analysis of chemical spectra, and the analysis of clinical trial data. In each field, the author presents case studies and examples of how Bayesian analysis can be used to solve practical problems.
  • Maximum Entropy in Action (Macaulay, 1991) presents a comprehensive introduction to the principle of maximum entropy, including its application to parameter estimation, hypothesis testing, and model selection in various fields, including physics, statistics, engineering, and economics.