```
m <- 0.1
q <- 0
rxy <- function(n) {
tibble(
x = rnorm(n, sd = 1),
y = m * x + q + rnorm(n, sd = 1)
)
}
```

If , has two distinct eigenvalues, and we can decompose as follows:

with:

that satisfy:

Using Equations 3 and 4, one can immediately compute the exponential:

We can obtain the case from the limit of Equation 5. Observing that:

from Equation 5 we find:

that yields:

BibTeX citation:

```
@online{gherardi2024,
author = {Gherardi, Valerio},
title = {Exponential of a 2x2 Real Matrix},
date = {2024-11-04},
url = {https://vgherard.github.io/posts/2024-11-04-exponential-of-a-2x2-matrix/exponential-of-a-2x2-real-matrix.html},
langid = {en}
}
```

For attribution, please cite this work as:

Gherardi, Valerio. 2024. “Exponential of a 2x2 Real
Matrix.” November 4, 2024. https://vgherard.github.io/posts/2024-11-04-exponential-of-a-2x2-matrix/exponential-of-a-2x2-real-matrix.html.

The arguments start by operationalizing^{1} the (otherwise quantitatively vague) concept of “degree of belief”. The degree of belief of a subject into a given uncertain event^{2} is supposed to be a number that can be quantified as follows:

- The subject must be willing to pay an amount , with the condition that he will be rewarded an amount if event obtains.
- The subject must also be willing to accept an amount , with the condition that he will give up an amount if event obtains.

(the concrete unit of payment in these imaginary bets is irrelevant to the arguments, but is assumed to be a continuous quantity). The idea is that, for the subject not to incur into a sure loss, the rule must define a single valued function that follows the laws of probability.

Let us start by showing that the function must be single valued (*i.e.* the degree of belief is operatively well-defined). Suppose that someone assigns two different degrees of belief and to the same event , and suppose that , for instance. This means that this person must be willing to place a bet of on and, at the same time, accept a bet of on . But this doesn’t make sense, because the net result of these combined bets is a sure loss (of an amount ). We can conclude that the degree of belief assigned by any individual into a given event must be a single number.

With a similar approach we can prove the three basic axioms of probability:

**Positivity.** Degrees of belief must be positive numbers, because no rational agent would be willing to pay a positive amount (which is the only reasonable way to interpret accepting a negative amount ), for later having to pay an additional amount if event obtains.

**Normalization.** Suppose that the event is certain, and let be your degree of belief into it. If , one can force you to accept an amount , and to give in return, with certainty since is always true. If , one can force you to pay an amount to get in return. In both cases, you face a sure loss of an amount , unless .

**Additivity.** Suppose that events and are mutually exclusive, and let , and be your degrees of belief into , and ( denoting *or* ). If , one can force you to accept the following bets:

- You pay and receive if obtains.
- You pay and receive if obtains.
- You receive and pay if obtains.

The fact that and are mutually exclusive, implies that the cash flow from the returns of these three bets is always zero, and the net flow is entirely set by the initial payments. For you, this is , which is a sure loss. If , your opponent could also force you into a sure loss by reversing the bet directions in the previous argument.

There are also Dutch book arguments showing that when some rational agent has to update its degrees of belief based on new information, it must do so according to the standard conditioning rule:

where denotes the degree of belief on after finding out that is true, while denotes the event that both *and* are true. It is worth to notice that the argument given below relies on the fact that the agent’s assessment of is fixed and declared a priori, and does not change according to whether actually obtains or not.

The argument goes as follows: suppose that Equation 1 does not hold and, for instance . We can then force the agent to participate in the following bets:

- The agent accepts , to give in return if obtains
^{3}. - The agent pays to get in return if both and obtain.
- If obtains, the agent accepts to give in return if obtains.

It is easy to see that, irrespective of the outcomes of the and events, the agent always ends up loosing an amount . If , reversing the direction of these bets also leads to a sure loss for the agent. Clearly, the only way out is to set conditional probabilities according to Eq. Equation 1.

Which means “giving a practical way to measure”.↩︎

I use the word “events” in a conventional way. Strictly speaking, to reflect the wider generality of the subjectivist point of view, I should rather talk about “uncertain propositions” or “hypothetical facts”.↩︎

This is a slight generalization of the type of bet used in the operative definition of “degree of belief”, that assumes a unitary return amount. The reason why this is equivalent to the original definition is that, given bets with unitary return amounts, and positive numbers , we can always find positive integers , such that the total gain from repeating times bet for all always has the same sign of the total gain from “generalized bets” that result by changing the return amount of bet to .↩︎

BibTeX citation:

```
@online{gherardi2024,
author = {Gherardi, Valerio},
title = {Dutch Book Arguments},
date = {2024-08-12},
url = {https://vgherard.github.io/posts/2024-08-12-dutch-book-arguments/dutch-book-arguments.html},
langid = {en}
}
```

For attribution, please cite this work as:

Gherardi, Valerio. 2024. “Dutch Book Arguments.” August 12,
2024. https://vgherard.github.io/posts/2024-08-12-dutch-book-arguments/dutch-book-arguments.html.

Beside the important role that the parametrization plays in the axiomatic formulation of thermodynamics given in (Lieb and Yngvason 1999a), which makes diagrams interesting *per se*, there’s a practical advantage in these two variables, in the fact that they uniquely characterize pure substances everywhere in the phase diagram. In particular, a triple “point” becomes a triangle in the UV diagram, as can be seen by the parametrization:

where , and are the masses, specific internal energy and volumes of the gaseous, liquid and solid phases, respectively (the total mass is assumed to be fixed). This triangle is projected into a line in the diagram, and into a single point in the diagram.

The subsets of the plane representing the fusion, sublimation and vaporization curves (or any other curve on the diagram representing the coexistence of two distinct phases) are still two dimensional submanifolds, but the parametrization is more involved:

where and are the two coexisting phases, and the specific energies and volumes vary with temperature. These sets are obtained by joining for each value of the two corresponding points on the curves and in the plane.

By pure coincidence, an example of the diagrams I am referring to is shown in (Lieb and Yngvason 1999b), the erratum to the original reference (Lieb and Yngvason 1999a).

Lieb, Elliott H., and Jakob Yngvason. 1999a. “The Physics and Mathematics of the Second Law of Thermodynamics.” *Physics Reports* 310 (1): 1–96. https://doi.org/https://doi.org/10.1016/S0370-1573(98)00082-9.

———. 1999b. “The Physics and Mathematics of the Second Law of Thermodynamics (Physics Reports 310 (1999) 1–96).” *Physics Reports* 314 (6): 669. https://doi.org/https://doi.org/10.1016/S0370-1573(99)00029-0.

BibTeX citation:

```
@online{gherardi2024,
author = {Gherardi, Valerio},
title = {The {UV} Diagram for a Pure Substance},
date = {2024-07-01},
url = {https://vgherard.github.io/posts/2024-07-01-the-uv-diagram-for-a-pure-substance/the-uv-diagram-for-a-pure-substance.html},
langid = {en}
}
```

For attribution, please cite this work as:

Gherardi, Valerio. 2024. “The UV Diagram for a Pure
Substance.” July 1, 2024. https://vgherard.github.io/posts/2024-07-01-the-uv-diagram-for-a-pure-substance/the-uv-diagram-for-a-pure-substance.html.

The essential postulates of classical thermodynamics are formulated, from which the second law is deduced as the principle of increase of entropy in irreversible adiabatic processes that take one equilibrium state to another. The entropy constructed here is defined only for equilibrium states and no attempt is made to define it otherwise. Statistical mechanics does not enter these considerations. One of the main concepts that makes everything work is the comparison principle (which, in essence, states that given any two states of the same chemical composition at least one is adiabatically accessible from the other) and we show that it can be derived from some assumptions about the pressure and thermal equilibrium. Temperature is derived from entropy, but at the start not even the concept of ‘hotness’ is assumed. Our formulation offers a certain clarity and rigor that goes beyond most textbook discussions of the second law.

The paper can be subdivided into two parts. In the first part, the authors deduce the existence of a universal entropy state function from a primitive notion of adiabatic accessibility. A key assumption in this derivation is the so-called Comparison Hypothesis, that postulates that for any two equilibrium states and of a given thermodynamic system, either is adiabatically accessible from or is adiabatically accessible from .

The second part is devoted to the derivation of the Comparison Hypothesis (which becomes thus a Comparison *Principle*) for an important class of thermodynamic systems. The main ingredients for this derivation are: - The First Law of Thermodynamics and, in particular the concept of internal energy. - The notion of thermal equilibrium and its transitivity (sometimes called the “Zero-th Law”). - An assumed convex structure in the state space of thermodynamic systems.

So, how does such a formal endeavor improve our understanding of Nature? On one side, entropy is clarified to represent nothing more than a numerical encoding of adiabatic accessibility. Its existence is independent of the concepts of energy, heat or temperature that appear in the usual formulations of Thermodynamics. On the other side, the classical statements of the Second Law of Kelvin-Planck and Clausius, which are explicitly formulated in terms of heat and temperature, are seen to be *theorems* on the adiabatic (in)accessibility of certain states, that follow from how the concepts of internal energy and thermal equilibrium interact with the primitive notion of adiabatic accessibility. Both of these are major advancements, because they clarify the physical content and universality of the Second Law of Thermodynamics.

The last Section of (Lieb and Yngvason 1999) provides a useful summary of the formalism and of the logical paths followed by the mathematical derivations. The rest of this post are some personal notes on the original reference.

Systems are described as collections of thermodynamic states, with two binary operations that have the operative interpretations of “composition” (of non-interacting systems) and “scaling”, respectively. In detail:

**Systems.**A*system*is just a set , whose elements correspond to physical (thermodynamic) states.**Composition.**A binary operation that takes as input two thermodynamic states and of two systems and , and returns a new state , where the latter set can be informally thought as the Cartesian product . Since the ordering in the enumeration of systems is inconsequential from the physical point of view, we assume that compositions are equipped with an equivalence relation , such that and that .**Scaling.**A binary operation that takes as input a positive number and a thermodynamic state , and returns a new state $X $ as output. The operation is assumed to satisfy , as well as .

Some care is needed in the operative interpretation of the scaling operation for systems that are not scale invariant, such as a droplet of liquid with a non-negligible surface energy. This particular example is discussed in detail by the authors in Sec. 3.1, in connection with simple systems. In short, for such systems we should interpret the state space as a submanifold of an extended state space corresponding to a scale invariant system. It is for this latter system that the axioms of (Lieb and Yngvason 1999) are assumed to hold.

A central notion in (Lieb and Yngvason 1999) is that of *adiabatic accessibility*. From the operative point of view this corresponds to the possibility of transforming a thermodynamic state into another without supplying heat to the system. Mathematically, adiabatic accessibility is a relation on the set of all thermodynamic states that satisfies (among other things) the axioms of a pre-order: reflexivity and transitivity.

More in detail, is assumed to satisfy the following axioms:

**(A1) Reflexivity**. for any .**(A2) Transitivity**. and implies for all , and .**(A3) Consistency**. and implies for all , , and**(A4) Scaling invariance**. implies for all , and .**(A5) Splitting and recombination**. for all and .**(A6) Stability**: if holds for some sequence , then .

The first part of (Lieb and Yngvason 1999) is concerned with the existence and uniqueness of the entropy function. The authors prove that, under axioms A1-A6 characterizing adiabatic accessibility, the two following conditions are equivalent:

*(i)*Given states and of a system and positive numbers for , the two states and are always comparable, in the sense that either or . This is expressed by saying that “the comparison hypothesis holds on multiple scaled copies of ”.*(ii)*There exists a function with the following property: if and , where , we have if and only if^{1}:

Furthermore, the function is given by for some constants with , where^{2}:

where are two reference states such that but the reverse does not hold (the definition enforces and ).

**(A7) Convex combination**. The space is a locally convex subset of a vector space, meaning that for all and . Furthermore, we have that .

It is easily seen that if an entropy function (in the sense of the previous section) exists, then the second part of the previous axiom (involving ) is equivalent to the requirement that entropy is a concave function, *i.e.* .

The convex combination of states is most easily interpreted for simple systems (see below), and corresponds to a state in which energy and work coordinates take the intermediate values . The possibility of forming convex combinations of states and their adiabatic accessibility is quite obvious for systems such as gas or liquids. For example, suppose we are given moles of gas with energies and volumes , where , and the two gases are separated by a rigid, adiabatic wall. The convex combination is the state of 1 mole of gas that is produced by simply removing the barrier. For solids or more complicated systems, the axiom looks much more questionable, but can be at least experimentally tested for any given system, by exploiting its equivalence with the concavity of entropy.

The discussion moves on to concrete systems whose state space is assumed to be an open convex subset of for some , in agreement with A7. The points of these systems are denoted by pairs , where and have the physical interpretation of energy and extensive work coordinates, respectively. The possibility of a parametrization with internal energy as one coordinate is the content of the First Law of Thermodynamics.

Simple systems are characterized by the following axioms (the separation of S2 into S2a and S2b is mine):

**(S1) Irreversibility**. For each there is a point such that . (This axiom is actually implied by the stronger axiom T4, see below.)**(S2a) Tangent Plane**. For each , the forward sector has a unique tangent plane parametrized by an equation of the form:

- **(S2b) Lipschitz Continuity.** The functions in the previous axiom are locally Lipschitz continuous functions. - **(S3) Connectedness of the boundary.** The boundary of any forward sector is connected.

Using only axioms **(S1)** and **(S2a)**, the Kelvin-Planck statement of the Second Law of Thermodynamics can be proved, namely:

No process is possible, the sole result of which is a change in the energy of a simple system (without changing the work coordinates) and the raising of a weight.

Including also **(S2b)** and **(S3)**, the main theorem of this section can be proved, stating that *any two states of a simple system are comparable*.

**(T1) Thermal Contact**. For any two simple systems. For any two simple systems with state spaces and , there is another simple system, the*thermal join*of and , with state space: whose states satisfy:**(T2) Thermal Splitting**. For any state of the thermal join there exist states such that and: In particular, if is a state of a simple system and , we have: The right-hand side here is an element of , while the left hand side is an element of the thermal join of scaled copies .**(T3) Zero-th law**. The relation defined in the following is an equivalence relation: if .**(T4) Transversality**. If is a simple system and , there exist such that and .**(T5) Universal temperature range**. If and are simple systems, and is in the projection of onto its work coordinates, for every there is a such that

The main result from these additional axioms is *comparability in the product spaces of simple systems*. Also, thermal equilibrium is found to be characterized by maximum entropy, in the sense that if are states of simple systems with (consistent) entropies respectively, then:

If (T3) is assumed, and if we postulate that for each state there is at least one state in thermal equilibrium with it - which is guaranteed by (T5) - then the second part of (T2) is equivalent to scaling invariance . In fact, transitivity and symmetry of implies for any . Then the second part of (T2) is equivalent to for any and any , which is in turn equivalent to for any and .

Temperature is a derived concept in the formalism of (Lieb and Yngvason 1999), by means of the usual bridge law . The differentiability of entropy is shown to be a consequence of the axioms stated so far.

Using the derived concept of temperature, the authors prove the Clausius inequality for a Carnot cycle:

where and are the *final* temperatures of the reservoirs, that do not need to be assumed of infinite mass.

The last result of the paper concerns thermodynamic processes that connect states of different systems. The notion of connection of state spaces is introduced: two state spaces and are said to be connected if there are state spaces and states , and and , such that:

If is connected to we write . The last axiom is:

**(M) Absence of sinks**. .

With this additional axiom, one can prove a “weak” form of the entropy principle: that the multiplicative and additive constants in the definition of for each simple system can be chosen in such a way that implies for any two comparable states (not necessarily of the same system).

The strong form of the entropy principle consists in the statement that all states in connected spaces are comparable, and that adiabatic accessibility is completely characterized by , in the sense that if and only if . This result depends on a particular assumption (Eq. (6.34) of the reference) that the authors do not axiomatize, but that is subject to experimental verification.

Lieb, Elliott H., and Jakob Yngvason. 1999. “The Physics and Mathematics of the Second Law of Thermodynamics.” *Physics Reports* 310 (1): 1–96. https://doi.org/https://doi.org/10.1016/S0370-1573(98)00082-9.

In particular, the states and are always comparable if . This does not need to be the case if the two sums differ.↩︎

If , we can interpret as .↩︎

BibTeX citation:

```
@online{gherardi2024,
author = {Gherardi, Valerio},
title = {“{The} {Physics} and {Mathematics} of the {Second} {Law} of
{Thermodynamics}” by {E.H.} {Lieb} and {J.} {Yngvason}},
date = {2024-06-25},
url = {https://vgherard.github.io/posts/2024-06-25-the-physics-and-mathematics-of-the-second-law-of-thermodynamics-by-eh-lieb-and-j-yngvason/lieb-yngvason-second-law.html},
langid = {en}
}
```

For attribution, please cite this work as:

Gherardi, Valerio. 2024. “‘The Physics and Mathematics of
the Second Law of Thermodynamics’ by E.H. Lieb and J.
Yngvason.” June 25, 2024. https://vgherard.github.io/posts/2024-06-25-the-physics-and-mathematics-of-the-second-law-of-thermodynamics-by-eh-lieb-and-j-yngvason/lieb-yngvason-second-law.html.

Suppose we observe a random vector from some distribution in a known family with unknown parameters. We ask the following question: when is it possible to split into two pieces and such that neither part is sufficient to reconstruct by itself, but both together can recover fully, and their joint distribution is tractable? One common solution to this problem when multiple samples of are observed is data splitting, but Rasines and Young offers an alternative approach that uses additive Gaussian noise — this enables post-selection inference in finite samples for Gaussian distributed data and asymptotically when errors are non-Gaussian. In this article, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and -value masking. We exemplify the method on several prototypical applications, such as post-selection inference for trend filtering and other regression problems, and effect size estimation after interactive multiple testing. Supplementary materials for this article are available online.

The paper offers a clear review and systematization of older work, most prominently (Rasines and Young 2022) (cited in the abstract), with some useful generalizations.

The idea is cool, but I find the applications to practical regression cases given in the paper somewhat… impractical. For usual linear regression with a continuous response, the applicability of the method relies on (1) noise being homoskedastic and gaussian, (2) the existence of a consistent estimator of noise variance, and (3) samples being large enough (guarantees are only asymptotic). On the other hand, in the theoretically simpler case of logistic regression, there’s a technical complication in that, under the usual GLM assumption , the relevant log-likelihood for maximization in the inferential stage is not a concave function of , possibly hindering optimization. If I got it right, the authors suggest to ignore the conditional dependence of on to circumvent these complications (see Appendix E.4), which I honestly don’t understand.

A case in which planets align and results have a nice analytic form is that of Poisson regression, for which I will sketch the idea in some detail. Suppose that we are given data independently drawn from a joint distribution, and we assume for some unknown function we would like to model. The key observation is (*cf.* Appendix A of the reference) that if , then and , with and unconditionally independent. Hence, if we randomly draw according to , and set , the two datasets and , are conditionally independent given the observed covariates . This allows to decouple different aspects of modeling, such as model selection and inference, avoiding the usual biases associated with the intrinsic randomness of the selection step.

The authors focus on regression with fixed covariates, because in that setting the simpler option of data-splitting is less motivated, calling for alternatives. However, the method can be applied equally well to deal with selective inference in random covariates settings, since it leads - at least in principle - to inferences which are valid conditionally on the observed covariates and (in the general case) the randomized responses of the selection stage.

James Leiner, Larry Wasserman, Boyan Duan, and Aaditya Ramdas. 2023. “Data Fission: Splitting a Single Data Point.” *Journal of the American Statistical Association* 0 (0): 1–12. https://doi.org/10.1080/01621459.2023.2270748.

Rasines, D García, and G A Young. 2022. “Splitting strategies for post-selection inference.” *Biometrika* 110 (3): 597–614. https://doi.org/10.1093/biomet/asac070.

BibTeX citation:

```
@online{gherardi2024,
author = {Gherardi, Valerio},
title = {“{Data} {Fission:} {Splitting} a {Single} {Data} {Point}” by
{Leiner} Et Al.},
date = {2024-06-03},
url = {https://vgherard.github.io/posts/2024-06-03-data-fission-splitting-a-single-data-point-by-leiner-et-al/data-fission-splitting-a-single-data-point-by-leiner-et-al.html},
langid = {en}
}
```

For attribution, please cite this work as:

Gherardi, Valerio. 2024. “‘Data Fission: Splitting a Single
Data Point’ by Leiner Et Al.” June 3, 2024. https://vgherard.github.io/posts/2024-06-03-data-fission-splitting-a-single-data-point-by-leiner-et-al/data-fission-splitting-a-single-data-point-by-leiner-et-al.html.

Kelvin’s Postulate.It is impossible to construct an engine that, operating in a cycle, will produce no effect other than the extraction of heat from a reservoir and the performance of an equivalent amount of work.

Clausius’ Postulate.It is impossible to construct a refrigerator that, operating in a cycle, will produce no effect other than the transfer of heat from a lower-temperature reservoir to a higher temperature reservoir.

Either formulation is equivalent to the other and leads to the fundamental *Clausius’ theorem*. This asserts the existence of a universal state function , the *absolute temperature*, defined for any thermodynamic system, that satisfies the *Clausius inequality*. Concretely, if a system undergoes a cyclic process, during which it absorbs quantities of energy in the form of heat from reservoirs at absolute temperatures , the inequality:

always holds.

The derivation of Equation 1 from Kelvin’s and Clausius’ postulates, a clever argument that employs ideal Carnot engines, is standard textbook material; see for example (Fermi 1956). On the other hand, I’ve never seen the converse being stressed, that is, that Clausius theorem allows one to recover versions of Kelvin’s and Clausius’ postulates. Here are two (fairly obvious) arguments in this direction.

Consider a cyclic process of a thermodynamic system during which a quantity of heat is absorbed from a reservoir at constant temperature . Equation 1 applied to this special process implies:

The fact that means that the heat reservoir can only absorb energy during a cycle, which must be supplied by performing a positive work on the system. This is the content of Kelvin’s postulate.

Similarly, if the system performs a cycle exchanging amounts of heat and with two heat sources at temperatures and respectively, Equation 1 implies:

But , the external work performed on the system during a cycle. Hence:

Therefore, with requires . In other words, in order to perform a cycle in which a positive amount of heat is transferred from a low-temperature reservoir to a high-temperature one, we must necessarily perform some positive work^{2}. This is the content of Clausius’ postulate.

A subtle point that may require some elucidation is that, in the usual logical exposition of Thermodynamics, the temperature to which Kelvin’s and Clausius’ postulates make reference is the *empirical* temperature, call it . This is the “quantity measured by a thermometer” (Fermi 1956), and is logically distinct from the absolute temperature , whose existence is a consequence of the second law. What we actually proved here are versions of Kelvin’s and Clausius’ postulates *formulated in terms of the absolute temperature*, .

Now, if we take Kelvin’s or Clausius’ postulate (formulated in terms of ) as our logical starting point, we can actually prove that is an increasing function of , in which case there is no point in specifying which temperature the postulates refer to. However, if our starting point is

Clausius’ Theorem, there is no *a priori* logical reason for a relation between and , which should be considered as an additional assumption.

Even though this goes a bit beyond the original scope of the post, I’d like to show here how Equation 1 leads the existence of another state function, the *entropy* , which satisfies a generalized version of Equation 1, namely:

where quantities have the same meaning as in Equation 1, but the process is not necessarily cyclic. One can additionally show that the differential of is given by:

where is the differential heat absorbed by the system in a reversible process, and is the system’s temperature.

We start by observing that, for a reversible process, equality must hold in Equation 1. This is so because, for a reversible cycle, the inverse cycle, in which the system absorbs amounts of heat at temperatures , must also be possible. Altogether, the Clausius inequalities for the direct and inverse cycles thus imply:

Imagining an ideal cyclic process, in which the system exchanges infinitesimal amounts of heat with a continuous distribution of sources at temperatures , we should replace the sum in Equation 7 with an integral:

We now fix a reference state of our system, and define for any other state :

where the integral is taken along *any* reversible path that connects and , and is the amount of heat exchanged at temperature along this representative process. The fact that the integral in Equation 9 depends only upon the final states and is guaranteed by Equation 8.

By construction, we see that Equation 6 must hold with being the temperature of a source that, if placed in thermal contact with the system, can produce a reversible exchange of heat. It remains to be shown that this temperature is nothing but the temperature of the system itself. Consider a reversible process in which two systems at temperatures and exchange an (infinitesimal) amount of heat. From what we have just said:

where is the heat absorbed by system , and is its corresponding entropy change. However, since the composite system is thermally insulated, we must have and ^{3}. Equation 10 then implies that, if the process is reversible, we must necessarily have . This completes the proof of Equation 6.

Dittman, Richard H, and Mark W Zemansky. 2021. “Heat and Thermodynamics SEVENTH EDITION.”

Fermi, E. 1956. *Thermodynamics*. Dover Books in Physics and Mathematical Physics. Dover Publications. https://books.google.es/books?id=VEZ1ljsT3IwC.

We may compare these with the corresponding formulations given in Enrico Fermi’s famous book (Fermi 1956). For instance, Kelvin’s postulate reads:

*“A transformation whose only final result is to transform into work heat extracted from a source which is at the same temperature throughout is impossible.”*. Even though I’m a big fan of Fermi’s book, I find the more modern formulations given in (Dittman and Zemansky 2021) clearer.↩︎In fact, Equation 4 tells us a bit more than Clausius postulate, since it gives the maximum theoretical efficiency of a refrigerator operating between temperatures : ↩︎

The additivity of entropy is a consequence of the additivity of heat, which in turn would require a dedicate discussion. Such a requirement boils down to the additivity of external work, which holds generally if the interaction energies of the systems being composed are negligible. This is always assumed (more or less explicitly) whenever discussing the interaction of a system with a heat reservoir.↩︎

BibTeX citation:

```
@online{gherardi2024,
author = {Gherardi, Valerio},
title = {Statements of the {Second} {Law} of {Thermodynamics}},
date = {2024-06-01},
url = {https://vgherard.github.io/posts/2024-06-01-statements-of-the-second-law-of-thermodynamics/statements-of-the-second-law-of-thermodynamics.html},
langid = {en}
}
```

For attribution, please cite this work as:

Gherardi, Valerio. 2024. “Statements of the Second Law of
Thermodynamics.” June 1, 2024. https://vgherard.github.io/posts/2024-06-01-statements-of-the-second-law-of-thermodynamics/statements-of-the-second-law-of-thermodynamics.html.

From the paper’s abstract:

The songwriting duo of John Lennon and Paul McCartney, the two founding members of the Beatles, composed some of the most popular and memorable songs of the last century. Despite having authored songs under the joint credit agreement of Lennon-McCartney, it is well-documented that most of their songs or portions of songs were primarily written by exactly one of the two. Furthermore, the authorship of some Lennon-McCartney songs is in dispute, with the recollections of authorship based on previous interviews with Lennon and McCartney in conflict. For Lennon-McCartney songs of known and unknown authorship written and recorded over the period 1962-66, we extracted musical features from each song or song portion. These features consist of the occurrence of melodic notes, chords, melodic note pairs, chord change pairs, and four-note melody contours. We developed a prediction model based on variable screening followed by logistic regression with elastic net regularization. Out-of-sample classification accuracy for songs with known authorship was 76%, with a c-statistic from an ROC analysis of 83.7%. We applied our model to the prediction of songs and song portions with unknown or disputed authorship.

The modeling approach looks to me very sound and appropriate to the small sample size available (, the statistical unit corresponding to a song of known authorship). Effective model selection and testing is achieved through three nested layers of cross-validation (😱): one for elastic net hyperparameter tuning, one for feature screening, and finally one for estimating the prediction error.

The discussion of feature importance is insightful, in that it identifies concrete aspects of McCartney’s compositions that make them distinguishable from Lennon’s ones. This type of interpretability is a big plus for authorship analysis. The general qualitative conclusion, that McCartney’s music tended to exhibit more complex and unusual patterns kinda resonates with my perception of Beatles’ songs.

Armed with the trained logistic regression model, together with a valid accuracy estimate (76%), the authors set out to apply their model to authorship prediction for controversial cases within the Beatles’ corpus (outside of the training sample). I don’t fully understand the authors approach in this part of the paper, and some points appear to be questionable, for the reasons I explain below.

One of the advantages of fitting a full probability model, such as logistic regression, rather than a conceptually simpler pure classification model (like a tree, for example), is that the output of the former is not a mere class (McCartney or Lennon), but rather a *probability* of belonging to that class. This allows one to make much more informative statements in the analysis of new cases, since the strength of evidence provided by the data towards the predicted class can be quantified on a case by case basis. All of this is true, of course, *provided that the fitted model gives a decent approximation to the true data generating process*.

With similar considerations in mind, I suppose, the authors produce probability estimates for each of the disputed cases considered, in the form of a point estimate and a confidence interval to represent uncertainty. I think there is room for improvement here, in two aspects.

My first objection is what I already pointed out above: nothing in the modeling process explained in the paper suggests that the final model provides a good approximation to the true class probability conditional on features. The model has, with reasonable confidence, a predictive performance close to the best achievable within the possibilities considered - quantified by 76% accuracy and 84% AUC - but this says nothing about its correct specification as a probability model. Without a careful specification study, it is impossible to conclude anything on the nature of the true estimation targets of the fitted “probabilities”: they may perfectly have nothing to do with the actual the authors are after. There is still value, I believe, in reporting fitted probabilities as qualitative measures of evidence, but these should not be conflated with the true (unknown) class probabilities… at least without some serious attempt to detect differences between the two.

My second point is a technical one and concerns how they construct confidence intervals for fitted probabilities. The construction resembles that of bootstrap percentile confidence intervals but, rather than the usual bootstrap synthetic datasets, the delete-one datasets used in leave-one-out cross-validation are used to obtain replicas of the fitted probabilities. This is nothing but Jackknife resampling in disguise, and it is well known that the resampling standard deviation of such Jackknife replicae is roughly times the true standard deviation, see *e.g.* (Tibshirani and Efron 1993). Therefore, I have strong reasons to believe that the reported intervals strongly underestimate the uncertainty associated with these probability estimates.

All in all, the attempt to go beyond reporting simple classes - backed up by an overall 76% accuracy estimate - is well-motivated in principle, but the final outcome is not very dependable.

As usual, I’m more eloquent when criticizing than when praising, but let me end on a very positive note. The authors do a *great* favor to the reader, by including a discussion of the informal steps performed prior and in parallel to the formal analysis presented in the paper. This kind of transparency - which is also present in the rest of the discussion - is, I believe, not so common as it should, and is what makes it eventually possible to think critically about someone else’s work.

Glickman, Mark, Jason Brown, and Ryan Song. 2019. “(A) Data in the Life: Authorship Attribution in Lennon-McCartney Songs.” *Harvard Data Science Review* 1 (1).

Tibshirani, Robert J, and Bradley Efron. 1993. “An Introduction to the Bootstrap.” *Monographs on Statistics and Applied Probability* 57 (1): 1–436.

BibTeX citation:

```
@online{gherardi2024,
author = {Gherardi, Valerio},
title = {Authorship {Attribution} in {Lennon-McCartney} {Songs}},
date = {2024-05-23},
url = {https://vgherard.github.io/posts/2024-05-23-authorship-attribution-in-lennon-mccartney-songs/authorship-attribution-in-lennon-mccartney-songs.html},
langid = {en}
}
```

For attribution, please cite this work as:

Gherardi, Valerio. 2024. “Authorship Attribution in
Lennon-McCartney Songs.” May 23, 2024. https://vgherard.github.io/posts/2024-05-23-authorship-attribution-in-lennon-mccartney-songs/authorship-attribution-in-lennon-mccartney-songs.html.

Suppose we wish to compare two hypotheses and , where is simple^{1}. We start collecting data until our sample meets some specific requirement, according to some given *stopping rule* . If this ever occurs, we compute the *Bayes factor*:

and reject if , for some . The theorem is that if is the true data generating process, the above procedure has a false rejection rate lower than , *independently of the stopping rule employed to end sampling*.

Notice that the rejection event is composed by two parts:

- Sampling has stopped at some point during data taking.
- When sampling stopped, held.

We also note that the stopping rule needs not be deterministic, although this appears to be implicitly assumed in the original reference. In general, the data collected up to a certain point will only determine the *probability* that sampling stops at that time (and, to reinforce the previous point, the sum of these probabilities will not, in general, add up to ).

In order to prove this theorem, let us set up some notation. Let be some stochastic process representing “data”, where each is a data point. We denote by the probability distribution of under , which is completely defined since is simple. We further denote by the corresponding probability measure on for the set of the first observations .

We first consider the case in which is also simple, and denote by and the corresponding measures. The Bayes factor is defined as the Radon-Nikodym derivative:

(we assume regularity conditions so that such a derivative exists).

Also, we assume for the moment that the stopping rule is deterministic, embodied by binary functions of the first observations, with if sampling can stop at step .

Now fix . A rejection of at sampling step is represented by the event:

which, with abuse of notation, we may identify with a subset of . The overall rejection event (at any sampling step) is given by:

so that our theorem amounts to the bound:

In order to prove this, we first note that:

Hence, since the events and are clearly disjoint for , we have:

which, since , implies Equation 5.

We may relax the assumption that the alternative hypothesis is simple, by considering a parametric family of measures , where the parameter has some prior probability . The argument given above still applies to this case, if is replaced by the mixture (under appropriate regularity assumptions). In the notation of Equation 1, the denominator .

Finally, in order to lift the assumption that our stopping rule is deterministic, let us first consider the following special (deterministic) stopping rule:

In other words, we stop sampling whenever the sample would reject according to . The rejection event for this special stopping rule is simply:

Since we already proved the theorem for any deterministic stopping rule, Equation 5 implies:

But Equation 10 clearly implies the theorem for any stopping rule, deterministic or not, since in general:

(we need to hold for some in order to reject ).

Interestingly, the argument just given leads to a more accurate statement of our main result Equation 5:

where the leftmost quantity is the false rejection rate of a selective testing procedure, such as the one we have been considering so far, wheareas the central quantity is the false rejection rate of a *simultaneous* testing procedure (that checks whether at each step of sampling). What’s happening here is analogous to a phenomenon observed in the context of parameter estimation following model selection (Berk et al. 2013), where one can show that, in order to guarantee marginal coverage for the selected parameters, if the selection rule is allowed to be completely arbitrary one must actually require *simultaneous* coverage for all possible parameters.

To conclude the post, let us remark that Equation 5 was originally formulated in terms of the posterior probability of :

where and are the prior probabilities of the two competing models and , respectively. We may use , rather than , as the relevant criterion for rejecting . From the pure frequentist point of view, this doesn’t add anything to our formulation in terms of the Bayes ratio, as is equivalent to as long as . In particular, the bound analogous to Equation 5 reads:

Berk, Richard, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao. 2013. “Valid Post-Selection Inference.” *The Annals of Statistics*, 802–37.

Kerridge, D. 1963. “Bounds for the Frequency of Misleading Bayes Inferences.” *The Annals of Mathematical Statistics* 34 (3): 1109–10.

This is a technical term, meaning that completely characterizes the probability distribution of data. An example of a non-simple hypothesis would be a parametric model depending on some unknown parameter .↩︎

BibTeX citation:

```
@online{gherardi2024,
author = {Gherardi, Valerio},
title = {Frequentist Bounds for {Bayesian} Sequential Hypothesis
Testing},
date = {2024-05-22},
url = {https://vgherard.github.io/posts/2024-05-22-frequentist-bounds-for-bayesian-sequential-hypothesis-testing/frequentist-bounds-for-bayesian-sequential-hypothesis-testing.html},
langid = {en}
}
```

For attribution, please cite this work as:

Gherardi, Valerio. 2024. “Frequentist Bounds for Bayesian
Sequential Hypothesis Testing.” May 22, 2024. https://vgherard.github.io/posts/2024-05-22-frequentist-bounds-for-bayesian-sequential-hypothesis-testing/frequentist-bounds-for-bayesian-sequential-hypothesis-testing.html.

Consider the AIC for the usual linear model :

where is the dimension of the covariate vector and is the ML estimate of the conditional variance. The expectation of Equation 1 under model assumptions can be found by using the fact that, for a random variable with degrees of freedom^{1}:

where:

and the second equality results from the Stirling approximation . We obtain:

where, according to standard assumptions, is assumed to be constant in .

Now consider two such models, with different covariate vectors and , of dimension and respectively, both assumed to be well specified. Denote, as before:

for . Equation 4 gives the unconditional expectation of for both models^{2}, so that:

Assuming, without loss of generality, that , we have:

To gain some intuition, suppose that the set of variables contained in is a subset of those contained in , so that the two corresponding models are nested. Equation 7 tells us that, for below a certain threshold, AIC will prefer the more “parsimonious” model involving only. In particular, if , we can make a first-order approximation in the RHS of Equation 7, that yields:

In parallel to AIC, we can consider the exact “information criterion” provided by the model in-sample cross-entropy under the true data generating process. For a single linear model, the in-sample cross-entropy is:

(“in-sample” refers to the fact that we fix, *i.e.* condition, on the covariate vector of the training sample, .) The conditional expectation of , again under model assumptions, can be computed by noticing two facts:

- The numerator and denominator are conditionally independent variables with and degrees of freedom respectively. This can be seen by rewriting these as , and , respectively, where as usual.
- For a random variable with degrees of freedom we have .

Using these results, we can show that:

(an equation which is true by design of AIC).

Before rushing to the (wrong) conclusion that will correspondingly estimate a difference of expected cross-entropies, let us notice that the relevant in-sample cross-entropy to be considered for model evaluation is Equation 9 with *corresponding to the full covariate vector*: this is the target we should try to estimate (at least to the extent that our goal is predicting given ). For this reason, strictly speaking, Equation 10 is exact only if our model is well specified as a model of . Otherwise, in order to estimate consistently , we should use Takeuchi’s Information Criterion (TIC) rather than AIC.

A bit more pragmatically, in the real world we could assume the remainder of Equation 10 to be (rather than ), but generally small with respect the leading order AIC correction (). This will be the case if the models being compared are approximately well specified.

We take the data generating process to be:

with:

```
m <- 0.1
q <- 0
rxy <- function(n) {
tibble(
x = rnorm(n, sd = 1),
y = m * x + q + rnorm(n, sd = 1)
)
}
```

We compare the model with *vs.* without slope term ( *vs.* ), which we will denote by suffixes and , respectively. The functions below compute AIC and in-sample cross-entropy from the corresponding `lm`

objects. We also define a “Naive Information Criterion” .

```
nic <- function(fit) {
p <- length(coef(fit))
n <- nobs(fit)
sigma_hat <- sigma(fit) * sqrt((n - p) / n)
log(sigma_hat)
}
aic <- function(fit) {
p <- length(coef(fit))
n <- nobs(fit)
sigma_hat <- sigma(fit) * sqrt((n - p) / n)
log(sigma_hat) + (p + 1) / n + 0.5 *(1 + log(2*pi))
}
ce <- function(fit, data) {
p <- length(coef(fit))
n <- nobs(fit)
sigma_hat <- sigma(fit) * sqrt((n - p) / n)
y_hat <- fitted(fit)
mu <- data$x * m + q
res <- 0
res <- res + 0.5 / (sigma_hat^2)
res <- res + log(sigma_hat)
res <- res + mean(0.5 * (y_hat - mu)^2 / (sigma_hat^2))
res <- res + 0.5 * log(2 * pi)
return(res)
}
```

From our results above, we expect:

The expected in-sample cross-entropies cannot be computed explicitly, but for relatively small we expect (*cf.* Equation 10):

I will use tidyverse for plotting results.

```
library(dplyr)
library(ggplot2)
```

In order to make results reproducible let’s:

`set.seed(840)`

We simulate fitting models and at different sample sizes from the data generating process described above.

```
fits <- tidyr::expand_grid(
n = 10 ^ seq(from = 1, to = 3, by = 0.5), b = 1:1e3
) |>
mutate(data = lapply(n, rxy)) |>
group_by(n, b, data) |>
tidyr::expand(model = c(y ~ 1, y ~ x)) |>
ungroup() |>
mutate(
fit = lapply(row_number(), \(i) lm(model[[i]], data = data[[i]])),
ce = sapply(row_number(), \(i) ce(fit[[i]], data[[i]])),
aic = sapply(fit, aic),
nic = sapply(fit, nic),
model = format(model)
) |>
select(-c(fit, data))
```

The plots below show the dependence from sample size of and , as well as AIC selection frequencies. Notice that for , even though , the selection frequency of the “complex” model is still below . This is because the distribution of is asymmetric, as seen in the second plot, and .

```
fits |>
mutate(
is_baseline = model == "y ~ 1",
delta_ce = ce - ce[is_baseline],
delta_aic = aic - aic[is_baseline],
delta_nic = nic - nic[is_baseline],
.by = c(n, b),
) |>
filter(!is_baseline) |>
summarise(
`E( ΔCE )` = mean(delta_ce),
`E( ΔAIC )` = mean(delta_aic),
`E( ΔNIC )` = mean(delta_nic),
.by = n
) |>
tidyr::pivot_longer(
-n, names_to = "metric", values_to = "value"
) |>
ggplot(aes(x = n, y = value, color = metric)) +
geom_point() +
geom_line() +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_vline(aes(xintercept = 1 / m^2), linetype = "dotted") +
scale_x_log10("Sample Size") +
coord_cartesian(ylim = c(-0.025, 0.025)) + ylab(expression(IC)) +
theme(legend.position = "bottom", legend.title = element_blank()) +
ggtitle("AIC vs. in-sample cross-entropy", "Expected values") +
NULL
```

```
fits |>
filter(aic == min(aic), .by = c(n, b)) |>
summarise(count = n(), .by = c(n, model)) |>
ggplot(aes(fill = model, x = n, y = count)) +
geom_col() +
scale_x_log10("Sample Size") +
ylab("Count") +
theme(legend.position = "bottom") +
ggtitle("AIC model selection frequencies")
```

```
fits |>
filter(n %in% c(10, 100, 1000)) |>
mutate(delta_aic = aic - aic[model == "y ~ 1"], .by = c(n, b)) |>
filter(model != "y ~ 1") |>
mutate(expec = -0.5 * log(1 + m^2) + 0.5 / n) |>
ggplot(aes(x = delta_aic, color = as.factor(n))) +
geom_density() +
coord_cartesian(xlim = c(-0.1, NA)) +
labs(x = "ΔAIC", y = "Density", color = "Sample Size") +
ggtitle("ΔAIC probability density")
```

Finally, here is something I have no idea where it comes from. The plot below shows the scatterplot of in-sample cross-entropy differences *vs.* the AIC differences. It is well known that AIC only estimates the expectation of these differences, averaged over potential training samples. One may ask whether AIC has anything to say about the actual cross-entropy difference for the estimated models, conditional on the realized training sample.

Assuming I have made no errors here, the tilted-U shape of this scatterplot is a clear negative answer. What’s especially interesting is that, apparently, these differences have a negative correlation. I fail to see where do the negative correlation and the U-shape come from.

```
fits |>
filter(n == 100) |>
mutate(
is_baseline = model == "y ~ 1",
delta_ce = ce - ce[is_baseline],
delta_aic = aic - aic[is_baseline],
.by = c(n, b),
) |>
filter(!is_baseline) |>
ggplot(aes(x = delta_aic, y = delta_ce)) +
geom_point(size = 1, alpha = 0.2) +
lims(x = c(-0.02, 0.01), y = c(-0.01, 0.03)) +
labs(x = "ΔAIC", y = "ΔCE") +
ggtitle("AIC vs. in-sample cross-entropy", "Point values for N = 100") +
NULL
```

See

*e.g.*1503.06266↩︎The same equation actually gives the expectation of conditional to the in-sample covariate vector . Since this conditioning differs for the two different models involving and , in our comparison of expected values we must interpret this as unconditional expectations, in general.↩︎

BibTeX citation:

```
@online{gherardi2024,
author = {Gherardi, Valerio},
title = {AIC in the Well-Specified Linear Model: Theory and
Simulation},
date = {2024-05-17},
url = {https://vgherard.github.io/posts/2024-05-09-aic-in-the-well-specified-linear-model-theory-and-simulation/aic-in-the-well-specified-linear-model-theory-and-simulation.html},
langid = {en}
}
```

For attribution, please cite this work as:

Gherardi, Valerio. 2024. “AIC in the Well-Specified Linear Model:
Theory and Simulation.” May 17, 2024. https://vgherard.github.io/posts/2024-05-09-aic-in-the-well-specified-linear-model-theory-and-simulation/aic-in-the-well-specified-linear-model-theory-and-simulation.html.

The main qualitative idea put forward by Ref. (Nini et al. 2024) is that *grammar is a fundamentally personal and unique trait of an individual*, therefore providing a sort of “behavioural biometric”. One first goal of this work was to put this general principle under test, by applying it to the problem of Authorship Verification (AV): the process of validating whether a certain document was written by a claimed author. Concretely, we built an algorithm for AV that relies entirely on the grammatical features of the examined textual data, and compared it with the state-of-the-art methods for AV.

The results were very encouraging. In fact, our method actually turned out to be generally superior to the previous state-of-the-art on the benchmarks we examined. This is a notable result, keeping also into account that our method uses *less* textual information (only the grammar part) than other methods to perform its inferences.

I sketch here a pseudo-implementation of our method in R. For the fit of -gram models and perplexity computations, I use my package `{kgrams}`

, which can be installed from CRAN. Model (hyper)parameters such as number of impostors, order of the -gram models, etc. are hardcoded, see (Nini et al. 2024) for details.

This is just for illustrating the essence of the method. For practical reasons, in the code chunk below I’m not reproducing the definition of the function `extract_grammar()`

, which in our work is embodied by the POS-noise algorithm. This function should transform a regular sentence, such as “He wrote a sentence”, to its underlying grammatical structure, say “[Pronoun] [verb] a [noun]”.

```
#' @param q_doc character. Text document whose authorship is questioned.
#' @param auth_corpus character. Text corpus of claimed author.
#' @param imp_corpus character. Text corpus of impostors.
#' @param n_imp a positive number. Number of "impostor" simulations.
score <- function(q_doc, auth_corpus, imp_corpus, n_imp = 100)
{
q_doc <- extract_grammar(q_doc)
auth_corpus <- extract_grammar(auth_corpus)
imp_corpus <- extract_grammar(imp_corpus)
# Compute perplexity based on claimed author's language model.
auth_mod <- train_language_model(auth_corpus)
auth_perp <- kgrams::perplexity(q_doc, model = auth_mod)
# Compute perplexity based on impostor language models.
#
# Each impostor is trained on a synthetic corpus obtained by sampling from
# the impostor corpus the same number of sentences as the corpus of the
# claimed author.
n_sents_auth <- length(kgrams::tknz_sent(auth_corpus))
imp_corpus_sentences <- kgrams::tknz_sent(imp_corpus)
imp_mod <- replicate(n_imp, {
sample(imp_corpus_sentences, n_sents_auth) |> train_language_model()
})
imp_perp <- sapply(imp_mod, \(m) kgrams::perplexity(q_doc, model = m))
# Score is the fraction of impostor models that perform worse (higher
# perplexity) than the proposed authors language model
score <- mean(auth_perp < imp_perp)
return(score)
}
train_language_model <- function(text)
{
text |>
kgrams::kgram_freqs(N = 10, .tknz_sent = kgrams::tknz_sent) |>
kgrams::language_model(smoother = "kn", D = 0.75)
}
extract_grammar <- identity # Just a placeholder - see above.
```

To be used as follows:

```
q_doc <- "a a b a. b a. c b a. b a b. a."
auth_corpus <- "a a b a b. b c b. a b c a. b a. b c a."
imp_corpus <- "a a. b. a. b a. b a. c. a b a. d. a b. a d. a b a b c b a."
set.seed(840)
score(q_doc, auth_corpus, imp_corpus)
```

`[1] 0.89`

The “score” computed by this algorithm turns out to be a good truthfulness predictor for the claimed authorship, higher scores being correlated with true attributions. *If* the impostor corpus is fixed once and for all, and *if* the pairs `q_doc`

and `auth_corpus`

are randomly sampled from a fixed joint distribution, we can set a threshold for score in such a way that the attribution criterion `score > threshold`

maximizes some objective such as accuracy. This is, more or less, what we studied quantitatively in our paper.

Efron, Bradley, and Trevor Hastie. 2021. *Computer Age Statistical Inference, Student Edition: Algorithms, Evidence, and Data Science*. Vol. 6. Cambridge University Press.

BibTeX citation:

```
@online{gherardi2024,
author = {Gherardi, Valerio},
title = {Grammar as a Biometric for {Authorship} {Verification}},
date = {2024-04-25},
url = {https://vgherard.github.io/posts/2024-04-25-grammar-as-a-biometric-for-authorship-verification/grammar-as-a-biometric-for-authorship-verification.html},
langid = {en}
}
```

For attribution, please cite this work as:

Gherardi, Valerio. 2024. “Grammar as a Biometric for Authorship
Verification.” April 25, 2024. https://vgherard.github.io/posts/2024-04-25-grammar-as-a-biometric-for-authorship-verification/grammar-as-a-biometric-for-authorship-verification.html.

The classical or frequentist approach to statistics (in which inference is centered on significance testing), is associated with a philosophy in which science is deductive and follows Popper’s doctrine of falsification. In contrast, Bayesian inference is commonly associated with inductive reasoning and the idea that a model can be dethroned by a competing model but can never be directly falsified by a significance test. The purpose of this article is to break these associations, which I think are incorrect and have been detrimental to statistical practice, in that they have steered falsificationists away from the very useful tools of Bayesian inference and have discouraged Bayesians from checking the fit of their models. From my experience using and developing Bayesian methods in social and environmental science, I have found model checking and falsification to be central in the modeling process.

Comments:

I don’t know nothing about applied Bayesian analysis, but I’m a bit surprised by the fact that the recommendation to check model’s fit requires a whole paper in the 21st century. What is the supposed argument why Bayesians should not worry about model fit?

I’m a bit confused about how one would actually interpret the model posterior checks discussed in the paper. If I understand correctly, the -value is the posterior probability of observing a statistic as extreme as in the original data. Should I interpret this as a strength of evidence against the model - similar to Fisherian significance testing? What is the philosophical basis for rejecting models with small -values? I guess these questions are answered in the technical references by the same author.

Gelman, Andrew. 2011. “Induction and Deduction in Bayesian Data Analysis.”

BibTeX citation:

```
@online{gherardi2024,
author = {Gherardi, Valerio},
title = {“{Induction} and {Deduction} in {Bayesian} {Data} {Analysis}”
by {A.} {Gelman}},
date = {2024-04-25},
url = {https://vgherard.github.io/posts/2024-04-25-induction-and-deduction-in-bayesian-data-analysis-by-a-gelman/induction-and-deduction-in-bayesian-data-analysis-by-a-gelman.html},
langid = {en}
}
```

For attribution, please cite this work as:

Gherardi, Valerio. 2024. “‘Induction and Deduction in
Bayesian Data Analysis’ by A. Gelman.” April 25, 2024. https://vgherard.github.io/posts/2024-04-25-induction-and-deduction-in-bayesian-data-analysis-by-a-gelman/induction-and-deduction-in-bayesian-data-analysis-by-a-gelman.html.

It is well known that statistical power calculations can be valuable in planning an experiment. There is also a large literature advocating that power calculations be made whenever one performs a statistical test of a hypothesis and one obtains a statistically nonsignificant result. Advocates of such post-experiment power calculations claim the calculations should be used to aid in the interpretation of the experimental results. This approach, which appears in various forms, is fundamentally flawed. We document that the problem is extensive and present arguments to demonstrate the flaw in the logic.

The point on observed power is very elementary: for a given hypothesis test at a fixed size , observed power is a function of the observed -value, and thus cannot add any information to the one already contained in the latter.

Observed power will be typically small for non-significant () results, and high otherwise. The authors discuss the example of a one-tailed, one-sample -test for the null hypothesis that has mean . The -value and observed power are, respectively:

where is the significance threshold at level . Below significance, one always has , implying low observed power . This implies that observed power cannot be used to tell whether a null result comes from a small effect or from a low detection capability.

The authors also criticize the notion of “detectable effect size”, but their arguments look less convincing to me in this case. In their example we have an i.i.d. sample , where , and we again test the null . For a fixed power , the detectable effect size is that value of that would yield a type II error rate of exactly , if is taken to be equal to the observer sample standard deviation. Their argument against this construct (Sec. 2.2) seems to rest on the premise that a smaller -value should always be interpreted as stronger evidence against the null hypothesis, even when the -value is not significant and the null hypothesis is accepted. But this is inconsistent, because -values are uniformly distributed under the null hypothesis, so that their actual values should be given no meaning once this is accepted.

In fact, I would argue that the detectable effect size provides a decent heuristics to quantify the uncertainty of a non-significant result on the scale of the parameter of interest. The true point against its usage is that (as also recognized in this work) there is a tool available which is much more suited for the same purpose ^{1}, the *confidence interval*, which has a clear probabilistic characterization in terms of its coverage probability.

Hoenig, John M, and Dennis M Heisey. 2001. “The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis.” *The American Statistician* 55 (1): 19–24.

In the previous example, assuming a large sample size so that we can approximate the distribution with a normal one, the -level lower limit on (appropriate for a one-sided test of the null hypothesis ) is given by: where and are the observed sample mean and standard deviation, respectively. The detectable effect size is the minimum value of such that , that is: ↩︎

BibTeX citation:

```
@online{gherardi2024,
author = {Gherardi, Valerio},
title = {“{The} {Abuse} of {Power}” by {J.} {M.} {Hoenig} and {D.}
{M.} {Heisey}},
date = {2024-04-18},
url = {https://vgherard.github.io/posts/2024-04-18-the-abuse-of-power-by-j-m-hoenig-and-d-m-heisey/the-abuse-of-power-by-j-m-hoenig-and-d-m-heisey.html},
langid = {en}
}
```

For attribution, please cite this work as:

Gherardi, Valerio. 2024. “‘The Abuse of Power’ by J.
M. Hoenig and D. M. Heisey.” April 18, 2024. https://vgherard.github.io/posts/2024-04-18-the-abuse-of-power-by-j-m-hoenig-and-d-m-heisey/the-abuse-of-power-by-j-m-hoenig-and-d-m-heisey.html.

if the noise variance is known, and:

if is unknown. Here denotes the maximum-likelihood estimate of , and the corresponding estimate of if the latter is unknown; is the dimension of the covariate vector .

One would expect knowledge on variance to have little effect on model selection for the mean, at least in a limit in which variance can be considered to be reasonably well estimated. In order to check that this is actually the case, we expand differences to first order :

The approximation in the second line requires . Furthermore, the last term in the final expression is a small fraction of if .

Putting these two conditions together, we obtain:

which means that and lead to the same model selection provided that the models involved in the AIC comparison estimate reasonably well the true variance.

Concluding remarks:

Although the maximum-likelihood estimates plugged in the AIC are derived from normal theory, the theorem about the equivalence of AIC selection in the known and unknown variance cases continues to hold irrespective of this assumption.

What happens in misspecified cases, in which does not consistently estimate , either because of non-linearity or heteroskedasticity?

BibTeX citation:

```
@online{gherardi2024,
author = {Gherardi, Valerio},
title = {AIC for the Linear Model: Known Vs. Unknown Variance},
date = {2024-03-13},
url = {https://vgherard.github.io/posts/2024-03-13-aic-for-the-linear-model-known-vs-unknown-variance/akaike-criterion-for-the-gaussian-linear-model-known-vs-unknown-variance.html},
langid = {en}
}
```

For attribution, please cite this work as:

Gherardi, Valerio. 2024. “AIC for the Linear Model: Known Vs.
Unknown Variance.” March 13, 2024. https://vgherard.github.io/posts/2024-03-13-aic-for-the-linear-model-known-vs-unknown-variance/akaike-criterion-for-the-gaussian-linear-model-known-vs-unknown-variance.html.

Prediction error and Kullback-Leibler distance provide a useful link between least squares and maximum likelihood estimation. This article is a summary of some existing results, with special reference to the deviance function popular in the GLIM literature.

Of particular interest:

- Clarifies the definition of a “saturated” model for i.i.d. samples.
- Highlights the parallels between and Kullback-Leibler loss. In particular, the expectation is shown to be the optimal regression function for the general KL loss.
- Discusses optimism in the training error estimate of the in-sample (fixed predictors) error rate in terms of KL loss, within the context of Generalized Linear models.

Hastie, Trevor. 1987. “A Closer Look at the Deviance.” *The American Statistician* 41 (1): 16–20. https://doi.org/10.1080/00031305.1987.10475434.

BibTeX citation:

```
@online{gherardi2024,
author = {Gherardi, Valerio},
title = {“{A} {Closer} {Look} at the {Deviance}” by {T.} {Hastie}},
date = {2024-03-07},
url = {https://vgherard.github.io/posts/2024-03-07-a-closer-look-at-the-deviance-by-t-hastie/a-closer-look-at-the-deviance-by-t-hastie.html},
langid = {en}
}
```

For attribution, please cite this work as:

Gherardi, Valerio. 2024. “‘A Closer Look at the
Deviance’ by T. Hastie.” March 7, 2024. https://vgherard.github.io/posts/2024-03-07-a-closer-look-at-the-deviance-by-t-hastie/a-closer-look-at-the-deviance-by-t-hastie.html.

Let’s get to the point. Consider a binomial model:

For concreteness, imagine that we are studying survival in a population of animals, and Equation 1 is proposed to model the survival probability of an initial cohort of individuals (from one year to the next, say).

Clearly, we should not expect to represent the survival probability for each single animal. In fact are a lot of factors that are only determined at the level of the individual and that could realistically affect survival: age, sex, weight, *etc.*. If we are not including any individual variable in our analysis (because they were not measured, for example) we may model the survival probability of any individual by a probability distribution with mean . If we further assume that survival probabilities of several individuals are independent and identically distributed (i.i.d.), we can derive explicitly the probability of survivals out of initial individuals:

where we have denoted by the set of such that . Notice that the integrand in the first line of Equation 2 is the conditional probability of successes out of trials with probabilities for the individual trials given by .

In other words, assuming that the survival probabilities of individuals are i.i.d. according to , we see that the binomial distribution Equation 1 holds exactly with for the unconditional (on individual level covariates) distribution of survivals. *A fortiori*, no overdispersion with respect to the binomial variance, *i.e.* , is possible under these assumptions.

Let us examine a bit more in detail the i.i.d. assumption. First of all, we observe that “identically distributed” is not the same as “identical”, which would be the case if . Quite the contrary, the purpose of is exactly to reflect the variability of in the overall population. On the other hand, assuming all individuals are sampled from the *same* population, the distribution is simply the result of such a sampling scheme, and it doesn’t really make sense to consider different distributions for different individuals. The only case in which we should use different distributions is if our experimental design involved systematically sampling individuals from distinct populations and putting them together into a single cohort (*e.g.* we always start with individuals from population and from population ). Finally, if the analysis included some individual covariate, such as *sex* or *age*, all the discussion would remain valid, with unconditional survival probabilities replaced by conditional (on *sex* and *age*) probabilities.

The rather strong assumption is, instead, independence. How could independence be violated? Suppose there is some set of variables not included in the analysis, which globally affect survival for all individuals - in our example may include for instance things like *food availability* and *metereological conditions*. Suppose, further, that survival probabilities are actually i.i.d. *conditional on* , with joint distribution:

Then, unconditionally:

where is the marginal distribution of . Crucially, this is in general not a product measure, and (looking back at our derivation, Equation 2) we see that this dependence can indeed change the form of the resulting distribution - and lead to overdispersion with respect to the binomial expectation, in particular.

I think the discussion above clearly shows that binomial overdispersion is not caused by inhomogeneities in the population, if these are understood as random (patternless) variations at the individual level. Quite the contrary, what can easily make data look non-binomial is the presence of unobserved global factors that can change randomly between experimental repetitions, and influence (or simply correlate with) survival probability at the population level.

A few concluding remarks:

From a pure mathematical point of view, in the limit in which initial cohorts are sampled from a single, infinite population, the validity of the i.i.d. assumption is guaranteed by De Finetti’s theorem on infinite exchangeable sequences (in the finite case there are also guarantees of approximate validity). Clearly, if the experimental design involves sampling individuals from several populations in a systematic way, the resulting sequence of Bernoulli variables (alive/dead) is not exchangeable.

If the De Finetti measure in the previous point can change between different cohort releases, depending on some random and unmeasured parameter , this will effectively lead to the same kind of dependence between individual probability parameters illustrated above.

Another form in which dependence may arise is when survival of one individual may influence, or simply correlate with, survival of other individuals. Imagine, for instance, that we may only observe individuals in pairs. Mathematically, this will again manifest in the form of non-exchangeability.

**Further references:**

- (Cox and Snell 1989)

Cox, D. R., and E. J. Snell. 1989. *Analysis of Binary Data, Second Edition*. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis.

BibTeX citation:

```
@online{gherardi2024,
author = {Gherardi, Valerio},
title = {No Binomial Overdispersion from Variations at the Individual
Level},
date = {2024-03-06},
url = {https://vgherard.github.io/posts/2024-03-06-no-binomial-overdispersion-from-variations-at-the-individual-level/no-overdispersion-from-individual-variation.html},
langid = {en}
}
```

For attribution, please cite this work as:

Gherardi, Valerio. 2024. “No Binomial Overdispersion from
Variations at the Individual Level.” March 6, 2024. https://vgherard.github.io/posts/2024-03-06-no-binomial-overdispersion-from-variations-at-the-individual-level/no-overdispersion-from-individual-variation.html.

where , and denote the internal energy and entropy of the system, is the temperature of the heat source, and and are the amounts of energy transferred to the system in the form of heat and work, respectively. For concreteness, I will focus on volume work and reversible processes, so that the equal sign in Equation 1 holds with the temperature of the source being equal to , the temperature of the system. Equation 1 then implies:

from which we can identify:

In the previous equation, represents the quantity of matter (say number of moles) in the system. The suffixes indicate that the partial derivatives are taken at constant , which may look trivial for closed systems. In open systems, however, can vary due to matter exchanges with the surrounding, so that the internal energy is also a function of , and this dependence defines the *chemical potential*:

Putting this together with Equation 3, we obtain a generalization of Equation 2 for an open system:

One should realize that this equation represents, at the present stage, no more than a *mathematical definition* of the partial derivatives of . In order to attach some physical content to it, we should connect the various terms appearing in Equation 5 with *operatively defined* quantities.

To appreciate this point, let us make a step back to the closed system case, and examine the physical content of the first two laws of thermodynamics. For a closed system we have the relations and . It is important to realize that and have independent operative definitions: is defined as the work performed by external forces during the process considered, whereas is defined as the difference . It is in light of these operative definitions that the laws of thermodynamics acquire physical content^{1}.

In order to establish a similar physical interpretation for the open system case, let us decompose the change in internal energy as follows:

Here the first two terms correspond to heat and work, as before, while the new term accounts for energy exchange due to matter transfer. Comparing Equation 6 and Equation 5, and given the expressions of and in the closed system case, it is tempting to conclude that:

but these identifications are easily seen to be *wrong*.

Before discussing the correct version of Equation 7, let us comment why these identifications can be dangerously misleading. Consider a gas enclosed by rigid and thermally insulating walls, so that no energy exchange with the exterior in the form of heat is possible and, since volume is constant, also no work is possible (according to Equation 7). Suppose now that we have some mechanism to inject a number of additional particles into the system. Due to what was said above, we would be lead to conclude that the corresponding variation of internal energy should be , which is incorrect (the correct answer is discussed below).

The physical reason why Equation 7 are wrong is that exchanged matter also carries an amount of entropy and has a volume , where and are the entropy and volume per particle of the external source of matter ( and if the exchanged particles are in thermal equilibrium with the open system). This additional entropy and volume flows must be kept into account in the corresponding variations for opens system, that become:

Plugging this into Equation 5 and comparing with Equation 6, we obtain:

where is the internal energy per particle (the chemical potential can be shown to be equal to the Gibbs energy per particle).

The meaning of Equation 9 is best clarified with a few examples. Consider first a process which simply consists in bringing together two quantities and of gas molecules kept at the same temperature and pressure . This amounts to a mere rescaling of the original system by a factor , so that all extensive quantities are simply scaled by this same factor:

Clearly, this process involves no energy exchange in the form of heat or work, and we see indeed from Equation 9 that:

Had we neglected the extra terms and in the equations for and , we would have concluded that the system has exchanged heat or performed work during such a null process.

As our second example, we consider a vapor-liquid phase transition. The vapor-liquid system, assumed to be in equilibrium, is enclosed in a cylinder with thermal conducting walls and surrounded by a medium at constant temperature . A piston on one extremity of the cylinder allows to condense vapor by compression.

If we consider either the vapor or liquid phases as open systems, in an isobaric and isothermal transformation in which a quantity of vapor molecules is condensed, we have as in Equation 9:

where (denoting liquid or vapor), and . But, since specific quantities at the phase transition only depend on temperature, which is held constant, the overall changes in entropy and volume are simply equal to the amounts due to matter transfer and , which implies:

This is not unreasonable since, from the point of view of the open system, what is happening is simply that a quantity of vapor or liquid (that was produced before, somehow), is getting transferred to the system. This is entirely analogous to the previous example, in which two chunks of identical substance were simply joined together.

On the other hand, the conventional approach to the same problem treats the vapor-liquid system as a closed system. For this system, we can directly relate the changes in entropy and volume to heat and work:

which looks superficially different from the open system point of view.

There is no contradiction in the fact that and , since the open and closed system points of view are describing the condensation process from a very different angle. In order to see this, let’s imagine breaking down the process as follows. Starting with our system with internal energy :

- We separate an amount of vapor from the rest of the system, leaving the system with energy .
- We heat and compress the small portion while keeping it at constant temperature and pressure, until it condenses completely. The energy transferred to the small mass is , while the vapor-liquid system’s energy of course does not change, .
- We add the condensed liquid back to the vapor-liquid system, whose energy becomes

The crucial observation is that the open system point of view is only describing steps 1 and 3, while the closed system only describes step 2. The fact that , implies that the two points of view lead to the same energy balance, as it should be. This can be easily verified:

where in the third equality we used the fact that at the phase transition, so that .

As a final note, let me mention that some references (especially from the engineering field) include the work required to transfer matter across the open system boundary in the definition of , which thus becomes proportional to the molar *enthalpy* (see *e.g.* (Knuiman, Barneveld, and Besseling 2012)). As a consequence, the external work performed on an open system simply reads , as in the closed system case. The reason why this could make sense is that if the “open system” is defined in terms of a (physical or imaginary) spatial boundary surface, which allows the flow of matter through some injection mechanism, one could be interested in the work resulting from the expansion of the boundary only - sometimes called *shaft work*, in contrast with the *flow work* included in the enthalpy. In the way I see it, this leads to a cumbersome physical description in the context of the examples mentioned above.

Knuiman, Jan T, Peter A Barneveld, and Nicolaas AM Besseling. 2012. “On the Relation Between the Fundamental Equation of Thermodynamics and the Energy Balance Equation in the Context of Closed and Open Systems.” *Journal of Chemical Education* 89 (8): 968–72.

The first law, in its essence, postulates the existence of certain experimental conditions (thermal insulation) under which the work required by any transformation of a thermodynamic system depends only on the initial and final states - allowing the definition of a state function , the internal energy. Similarly, the second law postulates the existence of a state function that bounds the heat exchange with a source at temperature by .↩︎

BibTeX citation:

```
@online{gherardi2024,
author = {Gherardi, Valerio},
title = {On the First and Second Laws of Thermodynamics for Open
Systems},
date = {2024-03-04},
url = {https://vgherard.github.io/posts/2024-02-29-on-the-first-and-second-laws-of-thermodynamics-for-open-systems/on-the-first-and-second-laws-of-thermodynamics-for-open-systems.html},
langid = {en}
}
```

For attribution, please cite this work as:

Gherardi, Valerio. 2024. “On the First and Second Laws of
Thermodynamics for Open Systems.” March 4, 2024. https://vgherard.github.io/posts/2024-02-29-on-the-first-and-second-laws-of-thermodynamics-for-open-systems/on-the-first-and-second-laws-of-thermodynamics-for-open-systems.html.

We compare two derivations of the stability conditions for hydrostatic equilibrium of an ideal fluid:

- A “parcel” argument, that follows the motion of a small particle of fluid, ignoring the dynamics of the surroundings.
- Standard linearization of the ideal fluid equations.

The two derivations turn out to give the same answer, but the intermediate steps in the parcel argument contain some hidden assumptions, which are clarified in the second approach.

We start with a fluid at rest in a constant gravitational field , and consider a small portion of fluid initially located at height . We imagine that this parcel is now vertically displaced to height , and that no heat is transferred between the parcel and the surroundings during this process. We further assume that the pressure inside the parcel rapidly equalizes with the pressure outside of it (on a time scale much shorter than the one involved in the displacement). Finally, *we assume that the whole process does not appreciably alter the pressure field* with respect to its equilibrium configuration, satisfying , where is the fluid’s density at rest. To anticipate, in the second derivation below, we will see that the last assumption may actually fail, giving rise to different dynamics than the one discussed in these Section.

The derivation below follows (Landau and Lifshitz 2013). The parcel’s acceleration in the vertical direction is given by Newton’s second law:

where the equilibrium assumption was used in the second equality, where is the parcel’s density, while is the density of the surroundings evaluated at the parcel’s height .

Using the thermodynamic state equation of the fluid, we can express densities in terms of pressure and specific entropy . For the fluid density, this means:

while for the parcel we have:

due to the fact that the process is adiabatic. Hence, expanding the right hand side of Equation 1 to first order in , we obtain:

where:

is called the *Brunt–Väisälä frequency*, or *buoyancy frequency* (all quantities in this equation can be evaluated at in the linear approximation we are considering). Equations Equation 2 imply that, in order for hydrostatic equilibrium to be stable, we must have , that is:

There are a few alternative ways to express Equation 4. First of all, using the Maxwell relation , we see that equilibrium requires:

Moreover, assuming , this simplifies to:

Considering as a function of and , we have:

where the Maxwell relation and the definition of the specific heat at constant pressure were used. Finally, using again the equilibrium condition , we obtain

where is the thermal expansion coefficient. For an ideal gas, the right hand side is just .

The Brunt–Väisälä oscillation frequency (Equation 3) is actually correct only in a certain limit, which is best clarified in the more careful approach, that proceeds from the ideal fluid equations. Nonetheless, the equilibrium condition turns out to be correct.

In fluid dynamics, our system would be described by the ideal fluid equations:

where denotes the material derivative. The last equation can be exchanged for^{1}:

where is the speed of sound. Denoting by and the pressure and density field of the hydrostatic solution, satisfying , we consider a perturbation of the form:

To linear order in the small quantities , and , the equations of motion read:

These equations take a rather simple form if we re-express them in terms of the mass flux density , that is to linear order. Before doing so, we notice that:

where is the buoyancy frequency defined above (*cf.* Equation 3) and we assume, consistent with cylindrical symmetry, to lie in the ^{2}. Putting everything together, we obtain:

Strictly speaking, the quantities and appearing in this equation are scalar fields with a non-trivial spatial variation. However, assuming that the spatial scale of the perturbation is much smaller than the typical scale of variation of and , we can treat these two numbers as constants. For simplicity, we will work with units such that (this is the same as using and as units of time and length, respectively; the dependence from and can be reintroduced in the final formulas through dimensional analysis).

The system becomes then a linear system with constant coefficients, which suggests to search for simple solutions of the form:

Plugging these into the linearized system, we obtain:

In order to solve these equations, we write:

From the first equation we obtain:

The second equation then yields:

Finally, from the third equation we obtain:

We now require to have an imaginary part , as we rigorously justify below. Under this assumption, the equation for has two real roots:

with:

( and have been reintroduced in these formulas as explained above).

Before proceeding further, we notice that the frequencies (those from the minus sign branch in Equation 16) are real if and only if , which is the same as the equilibrium condition derived from the parcel argument. The actual frequencies of oscillation are not given by , in general.

In order to understand the two branches of Equation 16, we start by noticing that, for the whole linearization approach to be valid, we must have (in natural units ):

This must be the case for the perturbation to be localized in the direction, which requires (notice that in natural units).

Assuming Equation 18, we can approximate the two roots in Eq. Equation 16 as follows:

We also notice that the fluid velocity field satisfies (without any approximation):

Let us first consider waves associated with , which are essentially sound waves and for which gravity plays very little role. These have both phase and group velocity aligned with and close to (the speed of sound), and the fluid velocity is also in the direction of (the waves are longitudinal):

In contrast, waves associated with , called *gravity waves*, have vanishing phase and group velocity in the limit , in general. The material velocity is perpendicular to :

The wave frequency depends on the angle between and , since . In particular, in the limit of plane waves in the direction, *i.e.* , we have , while plane waves orthogonal to gravity have frequency . From the physical point of view, these two limits correspond to the cases in which the horizontal spatial scale of the perturbation is much larger/smaller than the vertical scale, respectively.

We realize that the kind of perturbation analysed in the parcel argument implicitly refers to gravity waves of the second type (with small horizontal scales). From Equation 19, we see that the oscillation frequency coincides with the buoyancy frequency Equation 3 only if , that is, if the vertical spatial scale of the perturbation is much larger than its horizontal scale.

In order to justify the procedure used in the derivation of the plane waves solutions, consider a *localized* perturbation (say with compact support) at and let denote the vector of quantities , and . Since is localized, we can define:

and the inverse of the Fourier transform gives:

The Fourier components for a fixed satisfy a linear system of ordinary differential equations, for which we already found four independent solutions (two for each frequency) in the previous Section. The fifth solution can be easily verified to correspond to a static, divergence-less velocity perturbation, with the velocity field orthogonal to the axis:

In momentum space, in the basis provided by the four eigenvectors with eigenvalues given by Equation 16, plus this last (static) solution, time evolution is trivial.

As a parenthetical remark, we notice that if we drop the requirement of a *localized* perturbation, we can have additional solutions that are not covered by the previous remarks. A trivial example is that of an hydrostatic equations - , with , all fields being independent of time. These solutions are clearly not localized, since the pressure field changes only in the direction. Another example is provided by Lamb waves, that satisfy the constraint:

and take the general form:

These waves are clearly not localized in the direction.

This is the point where I felt the algebra was getting a bit too involved and I left the problem. There are still a few things that it may be interesting to investigate. In particular, it would be nice to derive the explicit evolution of a wave packet, say:

with vanishing density and pressure perturbation (to simplify the algebra a little bit). One should compute the “modified” Fourier transform Equation 20 and express the coefficients in terms of the five eigenvectors derived above. Perturbations like this will in general give rise to a combination of acoustic and longitudinal waves, depending on the direction of and on the ratio of vertical and horizontal scales, and .

Landau, Lev Davidovich, and Evgenii Mikhailovich Lifshitz. 2013. *Fluid Mechanics: Landau and Lifshitz: Course of Theoretical Physics, Volume 6*. Vol. 6. Elsevier.

In order to see this, we simply write as a function of and and take the material derivative.↩︎

From the pure mathematical point of view, this is not a strict consequence of Equation 9, which are in fact consistent with any static density configuration, as long as is satisfied. The physical reason is, of course, that we’re neglecting thermal conductivity, which allows for an arbitrary temperature gradient to persist forever in the absence of motion.↩︎

BibTeX citation:

```
@online{gherardi2024,
author = {Gherardi, Valerio},
title = {Gravity Waves in an Ideal Fluid},
date = {2024-02-22},
url = {https://vgherard.github.io/posts/2024-02-22-gravity-waves-in-an-ideal-fluid/gravity-waves-in-an-ideal-fluid.html},
langid = {en}
}
```

For attribution, please cite this work as:

Gherardi, Valerio. 2024. “Gravity Waves in an Ideal Fluid.”
February 22, 2024. https://vgherard.github.io/posts/2024-02-22-gravity-waves-in-an-ideal-fluid/gravity-waves-in-an-ideal-fluid.html.

Notice that the representation is unique for all outside of a countable subset of the unit interval.^{1}

The cool theorem proved below is that is uniformly distributed in if and only if all ’s are independent and . That is to say, the binary representation of a random variable amounts to a sequence of independent fair coin tosses.

We fix and decompose the unit interval as follows:

Each interval corresponds to a specific set of values for the first digits , that is if and only if for some that depend on the interval . Therefore:

Now, is uniformly distributed if and only if the left hand side of this equation equals for all and ^{2}. Furthermore, the possible values of correspond to the possible values of in the right hand side. Therefore, is uniform if and only if:

for all . More generally, this implies that, for any we have:

where the second equality follows from the special case . This is equivalent to saying that all ’s are independent, each having .

That is, the set of numbers that have a

*finite*expansion for some finite , with . These numbers also have the equivalent infinite expansion . For these numbers we can make the convention of using the first (finite) representation.↩︎That this is sufficient follows from the fact that any interval of the real line can be obtained by taking countable unions and intersections of intervals of the form .↩︎

BibTeX citation:

```
@online{gherardi2024,
author = {Gherardi, Valerio},
title = {Binary Digits of Uniform Random Variables},
date = {2024-01-29},
url = {https://vgherard.github.io/posts/2024-01-29-binary-digits-of-uniform-random-variables/binary-digits-of-uniform-random-variables.html},
langid = {en}
}
```

For attribution, please cite this work as:

Gherardi, Valerio. 2024. “Binary Digits of Uniform Random
Variables.” January 29, 2024. https://vgherard.github.io/posts/2024-01-29-binary-digits-of-uniform-random-variables/binary-digits-of-uniform-random-variables.html.

During the last few months, I’ve been working on a machine learning algorithm with applications in Forensic Science, a.k.a. Criminalistics. In this field, one common task for the data analyst is to present the *trier-of-fact* (the person or people who determine the facts in a legal proceeding) with a numerical assessment of the strength of the evidence provided by available data towards different hypotheses. In more familiar terms, the forensic expert is responsible of computing the likelihoods (or likelihood ratios) of data under competing hypotheses, which are then used by the trier-of-fact to produce Bayesian posterior probabilities for the hypotheses in question^{1}.

In relation to this, forensic scientists have developed a bunch of techniques to evaluate the performance of a likelihood ratio model in discriminating between two alternative hypothesis. In particular, I have come across the so called *Likelihood Ratio Cost*, usually defined as:

where we assume we have data consisting of independent identically distributed observations , with binary ; and stand for the number of positive () and negative () cases; and is a model for the likelihood ratio .

The main reason for writing this note was to understand a bit better what it means to optimize Equation 1, which does not look immediately obvious to me from its definition^{2}. In particular: is the population minimizer of Equation 1 the actual likelihood ratio? And in what sense is a model with lower better than one with a correspondingly higher value?

The short answers to these questions are: yes; and: optimization seeks for the model with the best predictive performance in a Bayesian inference setting with uninformative prior on , assuming that this prior actually reflects reality (*i.e.* ). The mathematical details are given in the rest of the post.

We start with a mathematical digression, which will turn out useful for further developments. Let be independent draws from a joint distribution, with binary . Given a function that is symmetric in its arguments^{3}, we define the random functional:

where is any function satisfying for all , and we let for any number . Notice that for , this is just the usual cross-entropy loss.

We now look for the population minimizer of Equation 2, *i.e.* the function that minimizes the functional ^{4}. Writing the expectation as:

we can easily see that is a convex functional with a unique minimum given by:

The corresponding expected loss is:

where is the entropy of a binary random variable with probability (the index in the previous expression can be any index, since data points are assumed to be identically distributed).

Before looking at values of other than , we observe that the previous expectation can be succintly expressed as:

where

and is the conditional entropy of with respect to a *different* probability measure , defined by:

where is fixed by the requirement^{5}:

In terms of , the population minimizers and in Equation 3 can be simply expressed as:

If now is an arbitrary function, we have:

where , and is the Kullback-Liebler divergence between the measure and the measure defined by:

(notice that by definition). Finally, suppose that for some random variable , and define the corresponding functional:

Then . If is the population minimizer of , it follows that .

Putting everything together, we can decompose the expected loss for a function , where , in the following suggestive way:

where is defined in Equation 4. In the equation for we introduced the conditional mutual information (with respect to the measure ), that satisfies [Cover and Thomas (2006)]:

The three components in Equation 8 can be interpreted as follows: represents the minimum expected loss achievable, given the data available ; accounts for the information lost in the processing transformation ; finally is due to misspecification, *i.e.* the fact that the model for the true posterior probability is an approximation.

All the information-theoretic quantities (and their corresponding operative interpretations hinted in the previous paragraph) make reference to the measure defined by Equation 5 and Equation 6. This is merely the result of altering the proportion of positive () and negative () examples in the - joint distribution by a factor dictated by the weight function - while keeping conditional distributions such as unchanged.

For , the functional coincides with the usual cross-entropy loss^{6}:

From Equation 6 we see that the measure coincides with the original , so that by Equation 3 the population minimizer of Equation 9 is (independently of sample size). Since (*cf.* Equation 4), the decomposition Equation 8 reads:

where conditional entropy , mutual information and relative entropy now simply refer to the original measure .

The quantity defined in Equation 1 can be put in the general form Equation 2, if we let and^{7}:

In what follows, I will consider a slight modification of the usual , defined by the weight function:

This yields Equation 1 multiplied by , which I will keep denoting as , with a slight abuse of notation.

We can easily compute^{8}:

so that, by Equation 3, the population minimizer of is:

where denotes the *likelihood-ratio* of , schematically:

The constant in Equation 4 is:

The general decomposition Equation 8 becomes:

where is now given by Equation 11.

The table below provides a comparison between cross-entropy and likelihood-ratio cost, summarizing the results from previous sections.

Cross-entropy | Likelihood Ratio Cost | |
---|---|---|

` | Posterior odds ratio | Likelihood ratio |

Minimum Loss | ||

Processing Loss | ||

Misspecification Loss | ||

Reference measure |

The objective of is found to be the likelihood ratio, as terminology suggests. The interpretation of model selection according to minimization turns out to be slightly more involved, compared to cross-entropy, which we first review.

Suppose we are given a set of predictive models , each of which consists of a processing transformation, , and an estimate of the posterior probability . When the sample size , cross-entropy minimization will almost certainly select the model that minimizes . Following standard Information Theory arguments, we can interpret this model as the statistically optimal compression algorithm for , assuming to be available at both the encoding and decoding ends.

The previous argument carries over *mutatis mutandi* to minimization, with an important qualification: optimal average compression is now achieved for data distributed according to a different probability measure . In particular, according to , the likelihood ratio coincides with the posterior odds ratio, and coincides with posterior probability, which clarifies why we can measure differences from the true likelihood-ratio through the Kullback-Liebler divergence.

The measure is not just an abstruse mathematical construct: it is the result of balanced sampling from the original distribution, *i.e.* taking an equal number of positive and negative cases^{9}. If the distribution is already balanced, either by design or because of some underlying symmetry in the data generating process, our analysis implies that likelihood-ratio cost and cross-entropy minimization are essentially equivalent for . In general, with , this is not the case^{10}.

The fact that seeks for optimal predictors according to the balanced measure is, one could argue, not completely crazy from the point of view of forensic science, where “” often stands for a sort verdict (guilty *vs.* not guilty, say). Indeed, optimizing with respect to means that our predictions are designed to be optimal in a world in which the verdict could be *a priori* or with equal probability - which is what an unbiased trier-of-fact should ideally assume. Minimizing , we guard ourselves against any bias that may be implicit in the training dataset, extraneous to the - relation and not explicitly modeled, a feature that may be regarded as desirable from a legal standpoint.

In general, the posterior odd ratio and likelihood ratio differ only by a constant, so it is reasonable to try to fit the same functional form to both of them. Let us illustrate with a simulated example of this type the differences between cross-entropy and optimization mentioned in the previous Section.

Suppose that has conditional density: and has marginal probability . The true likelihood-ratio and posterior odds ratio are respectively given by:

where we have defined:

Suppose that we fit an exponential function to by likelihood-ratio cost minimization, and similarly to by cross-entropy minimization^{11}. Due to the previous discussion, one could reasonably expect the results of the two procedure to differ in some way, which is demonstrated below by simulation.

The chunk of R code below defines the function and data used for the simulation. In particular, I’m considering a heavily unbalanced case () in which negative cases give rise to a sharply localized peak around (, ), while the few positive cases give rise to a broader signal centered at (, ).

```
# Tidyverse facilities for plotting
library(dplyr)
library(ggplot2)
# Loss functions
weighted_loss <- function(par, data, w) {
m <- par[[1]]
q <- par[[2]]
x <- data$x
y <- data$y
z <- m * x + q
p <- 1 / (1 + exp(-z))
-mean(y * w(y) * log(p) + (1-y) * w(1-y) * log(1-p))
}
cross_entropy <- function(par, data)
weighted_loss(par, data, w = \(y) 1)
cllr <- function(par, data)
weighted_loss(par, data, w = \(y) mean(1-y))
# Data generating process
rxy <- function(n, pi = .001, mu1 = 1, mu0 = 0, sd1 = 1, sd0 = 0.25) {
y <- runif(n) < pi
x <- rnorm(n, mean = y * mu1 + (1-y) * mu0, sd = y * sd1 + (1-y) * sd0)
data.frame(x = x, y = y)
}
pi <- formals(rxy)$pi
# Simulation
set.seed(840)
data <- rxy(n = 1e6)
par_cllr <- optim(c(1,0), cllr, data = data)$par
par_cross_entropy <- optim(c(1,0), cross_entropy, data = data)$par
par_cross_entropy[2] <- par_cross_entropy[2] - log(pi / (1-pi))
# Helpers to extract LLRs from models
llr <- function(x, par)
par[1] * x + par[2]
llr_true <- function(x) {
mu1 <- formals(rxy)$mu1
mu0 <- formals(rxy)$mu0
sd1 <- formals(rxy)$sd1
sd0 <- formals(rxy)$sd0
a <- 0.5 * (sd1 ^2 - sd0 ^2) / (sd1 ^2 * sd0 ^2)
b <- mu1 / (sd1^2) - mu0 / (sd0^2)
c <- 0.5 * (mu0^2 / (sd0^2) - mu1^2 / (sd1^2)) + log(sd0 / sd1)
a * x * x + b * x + c
}
```

So, what do our best estimates look like? The plot below shows the best fit lines for the log-likelihood ratio from minimization (in solid red) and cross-entropy minimization (in solid blue). The true log-likelihood ratio parabola is the black line. Also shown are the line (in dashed red) and the (in dashed blue), which are the appropriate Bayes thresholds for classifying a data point as positive (), assuming data comes from a balanced and unbalanced distribution, respectively.

```
ggplot() +
geom_function(fun = \(x) llr(x, par_cllr), color = "red") +
geom_function(fun = \(x) llr(x, par_cross_entropy), color = "blue") +
geom_function(fun = \(x) llr_true(x), color = "black") +
geom_hline(aes(yintercept = 0), linetype = "dashed", color = "red") +
geom_hline(aes(yintercept = -log(pi / (1-pi))),
linetype = "dashed", color = "blue") +
ylim(c(-10,10)) + xlim(c(-1, 2)) +
xlab("X") + ylab("Log-Likelihood Ratio")
```

The reason why the lines differ is that they are designed to solve a different predictive problem: as we’ve argued above, minimizing looks for the best conditional probability estimate according to the balanced measure , whereas cross-entropy minimization does the same for the original measure . This is how data looks like under the two measures (the histograms are stacked - in the unbalanced case, positive examples are invisible on the linear scale of the plot):

```
test_data <- bind_rows(
rxy(n = 1e6, pi = 0.5) |> mutate(type = "Balanced", llr_thresh = 0),
rxy(n = 1e6) |> mutate(type = "Unbalanced", llr_thresh = -log(pi / (1-pi)))
)
test_data |>
ggplot(aes(x = x, fill = y)) +
geom_histogram(bins = 100) +
facet_grid(type ~ ., scales = "free_y") +
xlim(c(-2, 4))
```

These differences are reflected in the misclassification rates of the resulting classifiers defined by , where the appropriate threshold is zero in the balanced case, and in the unbalanced case. According to intuition, we see that the optimizer beats the cross-entropy optimizer on the balanced sample, while performing significantly worse on the unbalanced one.

```
test_data |>
mutate(
llr_cllr = llr(x, par_cllr),
llr_cross_entropy = llr(x, par_cross_entropy),
llr_true = llr_true(x)
) |>
group_by(type) |>
summarise(
cllr = 1 - mean((llr_cllr > llr_thresh) == y),
cross_entropy = 1 - mean((llr_cross_entropy > llr_thresh) == y),
true_llr = 1 - mean((llr_true > llr_thresh) == y)
)
```

```
# A tibble: 2 × 4
type cllr cross_entropy true_llr
<chr> <dbl> <dbl> <dbl>
1 Balanced 0.166 0.185 0.140
2 Unbalanced 0.000994 0.000637 0.000518
```

Our main conclusion in a nutshell is that minimization is equivalent, *in the infinite sample limit*, to cross-entropy minimization on a balanced version of the original distribution. We haven’t discussed what happens for finite samples where variance starts to play a role, affecting the *efficiency* of loss functions as model optimization and selection criteria. For instance, for a well specified model of likelihood ratio, how do the convergence properties of and cross-entropy estimators compare to each other? I expect that answering questions like this would require a much more in-depth study than the one performed here (likely, with simulation playing a central role).

Brümmer, Niko, and Johan du Preez. 2006. “Application-Independent Evaluation of Speaker Detection.” *Computer Speech & Language* 20 (2): 230–75. https://doi.org/https://doi.org/10.1016/j.csl.2005.08.001.

Cover, Thomas M., and Joy A. Thomas. 2006. *Elements of Information Theory 2nd Edition (Wiley Series in Telecommunications and Signal Processing)*. Hardcover; Wiley-Interscience.

This is how I understood things should

*theoretically*work, from discussions with friends who are actually working on this field. I have no idea on how much day-to-day practice comes close to this mathematical ideal, and whether there exist alternative frameworks to the one I have just described.↩︎The Likelihood Ratio Cost was introduced in (Brümmer and du Preez 2006). The reference looks very complete, but I find its notation and terminology so unfamiliar that I decided to do my own investigation and leave this reading for a second moment.↩︎

That is to say, for any permutation of the set .↩︎

*Nota bene:*the function is here assumed to be fixed, whereas the randomness in the quantity only comes from the paired observations .↩︎Notice that, due to symmetry , which might be easier to compute.↩︎

Here and below I relax a bit the notation, as most details should be clear from context.↩︎

The quantity is not defined when all ’s are zero, as the right-hand side of Equation 1 itself. In this case, we make the convention .↩︎

For the original loss in Equation 1, without the modification discussed above, the result would have been ↩︎

Formally, given an i.i.d. stochastic process , we can define a new stochastic process such that if , and (not defined) otherwise. Discarding values, we obtain an i.i.d. stochastic process whose individual observations are distributed according to .↩︎

There is another case in which and cross-entropy minimization converge to the same answer as : when used for model selection among a class of models for the likelihood or posterior odds ratio that contains their correct functional form.↩︎

This is just logistic regression. It could be a reasonable approximation if , which however I will assume below to be badly violated.↩︎

BibTeX citation:

```
@online{gherardi2023,
author = {Gherardi, Valerio},
title = {Interpreting the {Likelihood} {Ratio} Cost},
date = {2023-11-15},
url = {https://vgherard.github.io/posts/2023-11-15-interpreting-the-likelihood-ratio-cost/interpreting-the-likelihood-ratio-cost.html},
langid = {en}
}
```

For attribution, please cite this work as:

Gherardi, Valerio. 2023. “Interpreting the Likelihood Ratio
Cost.” November 15, 2023. https://vgherard.github.io/posts/2023-11-15-interpreting-the-likelihood-ratio-cost/interpreting-the-likelihood-ratio-cost.html.

for all . A measurable function is integrable with respect to if and only if is integrable with respect to , in which case^{1}: Now, given an arbitrary event define . Then is a measure on which is clearly dominated by , and there exists a Radon-Nikodym derivative . We define the conditional probability of event with respect to the random variable as the random variable:

The intuition behind this definition comes from the tautology (given the definition in terms of Radon-Nikodym derivative):

On one hand, from elementary probability theory, one would expect any sensible definition of conditional probability to satisfy this theorem. On the other hand, the theorem univocally identifies as the Radon-Nikodym derivative , modulo a set of measure zero.

It is fairly easy to verify the following properties of conditional probability:

*Countable additivity*. For any finite or countable family of disjoint events, , we have: for almost all .*Positivity*. For any event we have for almost all .*Normalization*. for almost all .

This, however, does not generally imply that is a probability measure for almost all ^{2}. Functions such that is a measure for all , and is -measurable for all are called *random measures*. If satisfies (or, equivalently, if is a version of ) for all , is called a *regular conditional probability* for the random variable . If the space is regular enough (*e.g.* if it is a Borel space) one can prove that a regular conditional probability exists for any random variable , see *e.g.* (Kallenberg 1997).

If , where has positive probability , we can easily compute:

In particular, agrees with the usual elementary definition of conditional probability.

More generally, if , where the target space is equipped with a sub--algebra , we have:

which is sometimes taken as the definition of conditional probability with respect to a sub--algebra. When is the -algebra generated by a finite or countable partition of such that for all , we find:

again in agreement with elementary definitions.

Finally, if is a real-valued random variable, where is equipped with the Borel -algebra, coincides with the Stieltjes measure generated by the cumulative distribution function of . Denoting , we may write:

and, in particular:

Kallenberg, Olav. 1997. *Foundations of Modern Probability*. Vol. 2. Springer.

These claims can be proved by a standard argument using approximations by simple functions.↩︎

For instance, denoting by , positivity implies that . However, there’s no guarantee that is also a measure zero set (and in fact it does not need to be measurable, since the union is generally uncountable).↩︎

BibTeX citation:

```
@online{gherardi2023,
author = {Gherardi, Valerio},
title = {Conditional {Probability}},
date = {2023-11-03},
url = {https://vgherard.github.io/posts/2023-11-03-conditional-probability/conditional-probability.html},
langid = {en}
}
```

For attribution, please cite this work as:

Gherardi, Valerio. 2023. “Conditional Probability.”
November 3, 2023. https://vgherard.github.io/posts/2023-11-03-conditional-probability/conditional-probability.html.