Machine Learning and Medicine

Induction, Inductive Biases, and Infusing Knowledge into Learned Representations

Mon, 22 Jun 2020 06:00:00 -0400

_{^{Note: This post is a modified excerpt from the introduction to my PhD thesis.}}

Outline:

Inductive Generalization and Inductive Biases
-Philosophical Foundations for the Problem of Induction
-Inductive Biases in Machine Learning
Learned Representations of Data and Knowledge
-Background on Representation Learning
-Infusing Domain Knowledge into Neural Representations

Inductive Generalization and Inductive Biases

Our goal in building machine learning systems is, with rare exceptions, to create algorithms whose utility extends beyond the dataset in which they are trained. In other words, we desire intelligent systems that are capable of generalizing to future data. The process of leveraging observations to draw inferences about the unobserved is the principle of inductionTerminological note: In a non-technical setting, the term inductive – denoting the inference of general laws from particular instances – is typically contrasted with the adjective deductive, which denotes the inference of particular instances from general laws. This broad definition of induction may be used in machine learning to describe, for example, the model fitting process as the inductive step and the deployment on new data as the deductive step. By the same token, some AI methods such as automated theorem provers are described as deductive. In the setting of current ML research, however, it is much more common for the term ‘inductive’ to refer specifically to methods that are structurally capable of operating on new data points without retraining. In contrast, transductive methods require a fixed or pre-specified dataset, and are used to make internal predictions about missing features or labels. While many ML methods are assumed to be inductive in both senses of the term, this section concerns itself primarily with the broader notion of induction as it relates to learning from observed data. In contrast, Chapters 1 and 2 involve the second use of this term, as I propose new methods that are inductive but whose predecessors were transductive 1..

Philosophical Foundations for the Problem of Induction

Even ancient philosophers appreciated the tenuity of inductive generalization. As early as the second century, the Greek philosopher Sextus Empiricus argued that the very notion of induction was invalid, a conclusion independently argued by the Charvaka school of philosophy in ancient India 2,3. The so-called “problem of induction,” as it is best known today, was formulated by 18th-century philosopher David Hume in his twin works A Treatise of Human Nature and An Enquiry Concerning Human Understanding 4,5. In these works, Hume argues that all inductive inference hinges upon the premise that the future will follow the past. This premise has since become known as his “Principle of Uniformity of Nature” (or simply, the “Uniformity Principle”), the “Resemblance Principle,” or his “Principle of Extrapolation” 6.

In the Treatise and the Inquiry, Hume examines various arguments – intuitive, demonstrative, sensible, probabilistic – that could be proposed to establish the principle of extrapolation and, having rejected them all, concludes that inductive inference itself is “not determin’d by reason.” Hume thus places induction outside the scope of reason itself, casting it therefore as non-rational if not irrational. In his 1955 work Fact, Fiction, and Forecast, Nelson Goodman extended and reframed Hume’s arguments, proposing “a new riddle of induction”7. For Goodman, the key challenge was not the validity of induction per se, but rather the recognition that for any set of observations, there are multiple contradictory generalizations that could be used to explain them.

At least among scientists, the best known formal response to the problem of induction comes from the philosopher of science Karl Popper. In Conjectures and Refutations, Popper argues that science may sidestep the problem of induction by relying instead upon scientific conjecture followed by criticism 8. Stated otherwise, according to Popper, the central goal of scientists should be to formulate falsifiable theories which can be provisionally treated as true when they survive repeated attempts to prove them false Popper’s framing is frequently used to justify the statistical hypothesis testing frameworks proposed by the likes of Neyman, Pearson, and Fisher. However, the compatibility of Popperian falsification and statistical hypothesis testing is a matter of debate 9,10,11. . Popper’s arguments may be helpful as we frame our evaluation of any specific ML system that has already been trained – and thus instantiated, in a sense, as a “conjecture” that can be refuted. However, the training process of ML systems is itself an act of inductive inference and thus relies on a Uniformity Principle in a way that Popper’s conjecture-refutation framework does not address.

This thesis is not a work of philosophy. However, I consider it important to acknowledge that the entire field of machine learning – the branch of AI concerned with constructing computers that learn from experience 12 – is predicated upon a core premise that has, for centuries, been recognized as unprovable and arguably non-rational. To boot, even if the inductive framework is accepted as valid, there are an infinite number of contradictory generalizations that are equally consistent with our training data. While these observations may be philosophical in spirit and may appear impractical, they provide a framing for extremely practical questions:

Under which circumstances can we reasonably expect the future to resemble the past, as far as our models are concerned? Given an infinite number of valid generalizations from our data – most of which are presumably useless or even dangerous – what guiding principles do we leverage to choose between them? What are the falsifiable empirical claims that we should be making about our models, and how should we test them? If we are to assume that prospective failure of our systems is the most likely outcome, as Popper would, what reasonable standards can be set to nevertheless trust ML in safety-critical settings such as healthcare?

Each of these questions will be repeatedly considered throughout the course of this thesis.

Inductive Biases in Machine Learning

As outlined above, the paradigm of machine learning presupposes the identification – a la Hume – of some set of tasks and environments for which we expect the future to resemble the past. At this point, we are thus forced to determine guiding principles – a la Goodman – that give our models strong a priori preferences for generalizations that we expect to extrapolate well into the future. When such guiding principles are instantiated as design decisions in our models, they are known as inductive biases.

In his 1980 report The Need for Biases in Learning Generalizations, Tom M. Mitchell argues that inductive biases constitute the heart of generalization and indeed a key basis for learning itself:

If consistency with the training instances is taken as the sole determiner of appropriate generalizations, then a program can never make the inductive leap necessary to classify instances beyond those it has observed. Only if the program has other sources of information, or biases for choosing one generalization over the other, can it non-arbitrarily classify instances beyond those in the training set....

The impact of using a biased generalization language is clear: each subset of instances for which there is no expressible generalization is a concept that could be presented to the program, but which the program will be unable to describe and therefore unable to learn. If it is possible to know ahead of time that certain subsets of instances are irrelevant, then it may be useful to leave these out of the generalization language, in order to simplify the learning problem. ...

Although removing all biases from a generalization system may seem to be a desirable goal, in fact the result is nearly useless. An unbiased learning system’s ability to classify new instances is no better than if it simply stored all the training instances and performed a lookup when asked to classify a subsequent instance.
Tom M. Mitchell, The Need for Biases in Learning Generalizations

A key challenge of machine learning, therefore, is to design systems whose inductive biases align with the structure of the problem at hand. The effect of such efforts is not merely to endow the model with the capacity to learn key patterns, but also – somewhat paradoxically – to deliberately hamper the capacity of the model to learn other (presumably less useful) patterns, or at least to drive the model away from learning them. In other words, inductive biases stipulate the properties that we believe our model should have in order to generalize to future data; they thus encode our key assumptions about the problem itself.

The machine learning toolkit has a wide array of methods to induce inductive biases in learning systems ⊕ . For example, regularization methods such as L1-/L2-penalties 13, dropout 14, or early stopping 15 are a simple yet powerful means to impose Occam’s razor onto the training process. By the same token, the maximum margin loss of support vector machines 16, or model selection based on cross-validation can be described as inductive biases 17,18. Bayesian methods of almost any form induce inductive biases by placing explicit prior probabilities over model parameters. Machine learning systems that build on symbolic logic, such as inductive logic programming 19, encode established knowledge into very strict inductive biases, by forcing algorithms to reason about training examples explicitly in terms of hypotheses derived from pre-specified databases of facts. As nicely synthesized in Battaglia et al, the standard layer types of modern neural networks each have distinct invariances that induce corresponding relational inductive biases; for example, convolutional layers have spatial translational invariance and induce a relational inductive bias of locality, whereas recurrent layers have a temporal invariance that induces the inductive bias of sequentiality 20. Such relational inductive biases are extremely powerful when well-matched to the data on which they are applied.

In the next section, I will introduce the neural representation learning framework – the dominant paradigm of machine learning today – and discuss inductive biases in this setting, with a special emphasis on recent tools for infusing external knowledge into the inductive biases of our models.

Learned Representations of Data and Knowledge

The performance of most information processing systems, including machine learning systems, typically depends heavily upon the data representations (or features) they employ. Historically, this meant the devotion of significant labor and expertise to feature engineering, the design of data transformations and preprocessing techniques to extract and organize discriminative features from data prior to the application of ML. Representation learning21,22 is an alternative to feature engineering, and refers to the training of learned representations of data (or knowledge graphs 23) that are optimized for utility in downstream tasks such as prediction or information retrieval.

Background on Representation Learning

Many canonical methods in statistical learning can be considered representation learning methods. For example, low-dimensional data representations with desirable properties are learned by unsupervised methods such as principal components analysis 24, k-means clustering 25, independent components analysis 26, and manifold learning methods such as Isomap 27 and locally-linear embeddings 28. Within the field of machine learning, the most popular paradigm for representation learning are neural networks21,22, which provide an extremely flexible framework that can in theory be used to approximate any continuous function 29. Over the past two decades, representation learning with neural networks has steadily outperformed traditional feature engineering methods on a large family of tasks, including speech recognition 30, image processing 31, and natural language processing 32.

A common feature of all the representation learning methods just mentioned is that they are designed to learn data representations that have lower dimensionality than the original data. This basic inductive bias is motivated by the so-called manifold hypothesis, which states that most real world data – images, text, genomes, etc. – are captured and stored in high dimensions but actually consist of some lower-dimensional data manifold embedded in that high-dimensional space.

Another desirable property of learned representations is that they be distributed representations21,22, composed of multiple elements that can be set separately from each other. Distributed representations are highly expressive: $n$ learned features with $k$ values can represent $k^n$ different concepts, with each feature element representing a degree of meaning along its own axis. This results in a rich similarity space that improves the generalizability of resultant models. The benefits of distributed representations apply to any data type, but are particularly obvious from a conceptual level when considering settings such as natural language processing 33, where the initial data representation are encoded as symbols that lack any relationship with their underlying meaning. For example, the two sentences (or their equivalent triples, in a knowledge graph setting) ‘ibuprofen impairs renal function’ and ‘Advil damages the kidneys’ have zero tokens or ngrams in common. Thus, machine learning programs based only on symbols would be unable to extrapolate from one sentence to the other without relying upon explicit mappings such as ‘ibuprofen has_name Advil’, ‘impairs has_synonym damages’, etc. In contrast, the distributed representations of these sentences should, in principle, be nearly identical, facilitating direct extrapolation.

Over the past decade, neural networks have established themselves as the de facto approach to representation learning for essentially every ML problem in which their training has been shown feasible 21,22. While some neural architectures – e.g. Word2vec34 – are designed exclusively to produce embeddings that will be utilized in downstream tasks, the primary appeal of neural networks is that every deep learning architecture serves as a representation learning system. More specifically, the activations of each layer of neurons serves as a distributed representation of the input that is progressively refined in a hierarchical manner to produce representations of increased abstraction with increasing depth While even single-layer neural networks can provably approximate any continuous function, this guarantee is impractical because the proof assumes an infinite number of hidden nodes29. Deep neural networks, in contrast, allow for feature re-use that is exponential in the number of layers, which makes deep networks more expressive and more statistically efficient to train. 35,21 . In this light, a typical supervised neural network architecture of depth $k$, for example, can arguably be best understood as a representation learning architecture of depth $k-1$ followed by a simple linear or logistic regression.

Representations learned by neural networks have a number of desirable properties. First, neural representations are low-dimensional, distributed, and hierarchically organized, as described above. Neural networks have the ability to learn parameterized mappings that are strongly nonlinear but can still be used to directly compute embeddings for new data points. Yoshuo Bengio and others have extensively argued that neural networks have a higher capacity for generalization versus other well-established ML methods such as kernels 36,37 and decision trees 38, specifically because they avoid an excessively strong inductive bias towards smoothness; in other words, when making a new prediction for some new data point $x$, deep representation learning methods do not exclusively rely upon the training points that are immediately nearby $x$ in the original feature space.

Representation learning using neural networks also benefits from being modular, and therefore flexible The flexibility of neural networks doesn’t come without a price: In addition to obvious concerns about highly parameterized models and overfitting39, for example, the ease of implementing complicated DL architectures has arguably produced a research culture focused on ever-larger – and more costly40 – models that are often poorly characterised and very difficult to reproduce. 41 and extensible to design. For example, given two neural architectures that each create a distributed representation of a unique data modality, these can be straightforwardly combined into a single, fused architecture that creates a composite multi-modal representation (e.g. combining audio embeddings and visual embeddings into composite video embeddings42). Such an approach is leveraged in Chapter 2. Another example of the power afforded by the modularity of neural architectures are Generative Adversarial Networks (GANs) 43 , which learn to generate richly structured data by pitting a data-simulating ‘generator’ model against a jointly-trained ‘discrimator’ model that is optimized to distinguish real from generated data. In Supplemental Chapter 1, I demonstrate this approach using a GAN trained to simulate hip radiographs.

Taken together, neural architectures can be designed to expressively implement a broad array of inductive biases, while still allowing the network parameters to search over millions of compatible functions.

Infusing Domain Knowledge into Neural Representations

Neural networks have largely absolved the contemporary researcher of the need to hand-engineer features, but this reality has not eliminated the role of external knowledge in designing our models and their inductive biases. In this section, I compare and contrast various approaches to explicitly and implicitly infuse domain knowledge into neural representations.

The first paradigm involves the design of layers and architectures that align the representational capacity of the network with our prior knowledge of the problem domain. For instance, if we know that the data we provide have a particular property (e.g. unordered features), we can enforce corresponding constraints in our architecture (e.g. permutation invariance, as in DeepSet 44 or self-attention45 without position encodings). This is an example of a relational inductive bias ⊕ 20. Relatedly, we can manually wire the network in a manner that corresponds with our prior understanding of relationships between variables. Peng et al 46 adopted this approach by building a feed forward neural network for single cell RNA-Seq data in which the input neurons for each gene were wired according to the Gene Ontology 47; this approach strictly weakens the capacity of the network, but may be useful if we have a strong prior that particular relationships would be confounding, for example. An alternative means to a similar end is to perform graph convolutions over edges that reflect domain knowledge 48.

Another explicit paradigm for infusing knowledge into neural networks is to augment the architecture with the ability to query external information. For example, models can be augmented with knowledge graphs in the form of fact triples, which they can query using an attention mechanism ⊕ 49,50. More generally, attention can be used to allow modules to incorporate relevant information from embeddings of any knowledge source or data modality. For example, 51 introduced an architecture in which a language model attends to images to generate image captions. Self-attention, or intra-attention, is an attention mechanism that allows for relating different positions within a single sequence 52,45, image 53, or other instance of input data; this allows representations to better share and synthesis information across features.

Transfer learning54,55 provides a family of methods to infuse knowledge into a learning algorithm that has been gained from a previous learning task. This is related to, but distinct from multi-task learning, which seeks to learn several tasks simultaneously under the premise that performance and efficiency can be improved by sharing knowledge between the tasks during learning. While there are many forms of transfer learning ⊕ 56, the canonical form in the setting of deep learning is pretraining. In pretraining, model weights from a trained neural network are used to initialize some subset of the weights in another network; these parameters can then be either frozen or “fine-tuned” with further training on a the target task. Initial transfer learning experiments were conducted using unsupervised pretraining with autoencoders Autoencoders 57 learn representations guided by the inductive bias that a good representation should be able to be used to reconstruct its raw input. They are an example of an ‘encoder-decoder’ architecture, which consist of an encoder, which take the raw input and use a series of layers to embed it into a low-dimensional space, and a decoder, which takes an embedding from the encoder and tries to construct raw data; this combined architecture is then trained in an end-to-end fashion. When the decoder is trained specifically to reconstruct the exact same input passed into the encoder, this is called an autoencoder. (Alternatively, decoders can be trained to produce related data, a prominent example being Seq2seq models that can, for example, encode a sentence from one language and decode it into another58.) Variational autoencoders 59 combine autoencoding with stochastic variational inference to build generative models that can be use for sampling entirely new data. before transferring weights to a supervised model for a downstream task; this technique is an example of inductive semi-supervised learning60. In the past decade, supervised pretraining has become very popular, with the quintessential example being the initialization of an image processing architecture with all but the final layer of a model trained on the ImageNet dataset 61. More recently, self-supervised transfer learning has received significant attention, particularly in natural language processing. In self-supervised learning, subsets of a data or feature set are masked, and neural networks are trained to predict them from remaining features. The resulting representations can then be used directly for downstream tasks, such as information retrieval, or be leveraged for transfer learning. Word embeddings 33 are arguably the first widespread instance of self-supervised transfer learning, with more recent methods including language model pretraining 62,63,45.

Contrastive learning methods ⊕ learn representations by taking in small sets of examples and optimize embeddings to bring similar data together while driving dissimilar data apart. This is a form of metric learning. Early methods in this field include Siamese (missing reference) and Triplet networks 64, which were initially developed to learn deep representations of images. Recent analyses suggest that many methods developed in the past several years have failed to advance beyond triplet networks 65. Contrastive methods have been used in the pretraining step of a semi-supervised framework to achieve the current state-of-the-art in limited data image classification 66. In addition, contrastive optimization can be leveraged using multi-modal data to create aligned representations across modalities 67.

The methods described in this section can be described as a spectrum. Hand-engineered architectures are based on strong and specific prior assumptions about the problem domain, and are used to fundamentally alter the representational capacity of the network. In contrast, self-supervised and contrastive architectures make very minimal specific assumptions about the problem domain, and do nothing to alter the representational capacity of the algorithm; instead their innovation lies in devising training schemes and loss functions that will guide the network to learn underlying relationships and find a generalizable solution. In between these two extremes, augmenting networks with access to external knowledge through attention mechanisms often make the assumption that specific knowledge will be helpful, but allow the model to determine for itself which knowledge to employ. Transfer learning makes the assumption that other specific learning tasks will provide useful knowledge and experience for the target domain, but makes minimal assumptions about precisely what this knowledge would be. Despite (arguably significant) philosophical differences, these and yet other paradigms are not mutually exclusive, and share the common goal of improving generalization and data efficiency by introducing richer domain understanding into the neural networks.

Example paradigms for infusing domain knowledge into learned representations.

Finally, while this section – and indeed several chapters of this thesis – focuses on the design of neural architectures and training curricula, the role of domain knowledge is truly inescapable when it comes to the evaluation of deployable systems. Accordingly, the topic of deployment analysis will also be a major theme of this thesis.

The current draft of my full PhD thesis can be found here.

Bibliography

1.Chami, I., Abu-El-Haija, S., Perozzi, B., Ré, C., and Murphy, K. (2020). Machine Learning on Graphs: A Model and Comprehensive Taxonomy.
2.Empiricus, S., and Bury, R.G. (1933). Outlines of pyrrhonism. Eng. trans, by RG Bury (Cambridge Mass.: Harvard UP, 1976), II 20, 90–1.
3.Perrett, R.W. (1984). The problem of induction in Indian philosophy. Philosophy East and West 34, 161–174.
4.Hume, D. (1739). A treatise of human nature (Oxford University Press).
5.Hume, D. (1748). An Enquiry Concerning Human Understanding (Oxford University Press).
6.Garrett, D., and Millican, P.J.R. (2011). Reason, Induction, and Causation in Hume’s Philosophy (Institute for Advanced Studies in the Humanities, The University of Edinburgh).
7.Goodman, N. (1955). Fact, fiction, and forecast (Harvard University Press).
8.Popper, K. (2014). Conjectures and refutations: The growth of scientific knowledge (routledge).
9.Hilborn, R., and Mangel, M. (1997). The ecological detective: confronting models with data (Princeton University Press).
10.Mayo, D.G. (1996). Ducks, rabbits, and normal science: Recasting the Kuhn’s-eye view of Popper’s demarcation of science. The British Journal for the Philosophy of Science 47, 271–290.
11.Queen, J.P., Quinn, G.P., and Keough, M.J. (2002). Experimental design and data analysis for biologists (Cambridge University Press).
12.Mitchell, T.M., and others (1997). Machine learning.
13.Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58, 267–288.
14.Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1929–1958.
15.Prechelt, L. (1998). Early stopping-but when? In Neural Networks: Tricks of the trade (Springer), pp. 55–69.
16.Cortes, C., and Vapnik, V. (1995). Support-vector networks. Machine learning 20, 273–297.
17.Girosi, F., Jones, M., and Poggio, T. (1995). Regularization theory and neural networks architectures. Neural computation 7, 219–269.
18.Mitchell, T.M. (1980). The need for biases in learning generalizations (Department of Computer Science, Laboratory for Computer Science Research …).
19.Muggleton, S. (1991). Inductive logic programming. New generation computing 8, 295–318.
20.Battaglia, P.W., Hamrick, J.B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., et al. (2018). Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261.
21.Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 1798–1828.
22.Goodfellow, I., Bengio, Y., and Courville, A. (2016). Representation learning. Deep Learning, 517–548.
23.Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., and Yakhnenko, O. (2013). Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pp. 2787–2795.
24.Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 559–572.
25.Forgy, E.W. (1965). Cluster analysis of multivariate data: efficiency versus interpretability of classifications. biometrics 21, 768–769.
26.Jutten, C., and Herault, J. (1991). Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal processing 24, 1–10.
27.Tenenbaum, J.B., De Silva, V., and Langford, J.C. (2000). A global geometric framework for nonlinear dimensionality reduction. science 290, 2319–2323.
28.Roweis, S.T., and Saul, L.K. (2000). Nonlinear dimensionality reduction by locally linear embedding. science 290, 2323–2326.
29.Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2, 303–314.
30.Dahl, G., Ranzato, M.A., Mohamed, A.-rahman, and Hinton, G.E. (2010). Phone recognition with the mean-covariance restricted Boltzmann machine. In Advances in neural information processing systems, pp. 469–477.
31.Hinton, G.E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural computation 18, 1527–1554.
32.Mikolov, T., Deoras, A., Kombrink, S., Burget, L., and Černockỳ, J. (2011). Empirical evaluation and combination of advanced language modeling techniques. In Twelfth Annual Conference of the International Speech Communication Association.
33.Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119.
34.Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
35.Håstad, J., and Goldmann, M. (1991). On the power of small-depth threshold circuits. Computational Complexity 1, 113–129.
36.Bengio, Y., and Monperrus, M. (2005). Non-local manifold tangent learning. In Advances in Neural Information Processing Systems, pp. 129–136.
37.Bengio, Y., Delalleau, O., and Roux, N.L. (2006). The curse of highly variable functions for local kernel machines. In Advances in neural information processing systems, pp. 107–114.
38.Bengio, Y., Delalleau, O., and Simard, C. (2010). Decision trees do not generalize to new variations. Computational Intelligence 26, 449–467.
39.Friedman, J., Hastie, T., and Tibshirani, R. (2001). The elements of statistical learning (Springer series in statistics New York).
40.Lacoste, A., Luccioni, A., Schmidt, V., and Dandres, T. (2019). Quantifying the Carbon Emissions of Machine Learning. arXiv preprint arXiv:1910.09700.
41.Lipton, Z.C., and Steinhardt, J. (2018). Troubling Trends in Machine Learning Scholarship.
42.Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, A.Y. (2011). Multimodal deep learning.
43.Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680.
44.Zhang, Y., Hare, J., and Prugel-Bennett, A. (2019). Deep set prediction networks. In Advances in Neural Information Processing Systems, pp. 3207–3217.
45.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008.
46.Peng, J., Wang, X., and Shang, X. (2019). Combining gene ontology with deep neural networks to enhance the clustering of single cell RNA-Seq data. BMC bioinformatics 20, 284.
47.Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. (2000). Gene ontology: tool for the unification of biology. Nature genetics 25, 25–29.
48.McDermott, M., Wang, J., Zhao, W.N., Sheridan, S.D., Szolovits, P., Kohane, I., Haggarty, S.J., and Perlis, R.H. (2019). Deep Learning Benchmarks on L1000 Gene Expression Data. IEEE/ACM transactions on computational biology and bioinformatics.
49.Annervaz, K.M., Chowdhury, S.B.R., and Dukkipati, A. (2018). Learning beyond datasets: Knowledge graph augmented neural networks for natural language processing. arXiv preprint arXiv:1802.05930.
50.Kishimoto, Y., Murawaki, Y., and Kurohashi, S. (2018). A knowledge-augmented neural network model for implicit discourse relation classification. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 584–595.
51.Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057.
52.Cheng, J., Dong, L., and Lapata, M. (2016). Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733.
53.Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, Ł., Shazeer, N., Ku, A., and Tran, D. (2018). Image transformer. arXiv preprint arXiv:1802.05751.
54.Yang, Q., and Wu, X. (2006). 10 challenging problems in data mining research. International Journal of Information Technology & Decision Making 5, 597–604.
55.Pan, S.J., and Yang, Q. (2009). A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22, 1345–1359.
56.Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., and He, Q. (2019). A Comprehensive Survey on Transfer Learning. arXiv preprint arXiv:1911.02685.
57.Hinton, G.E., and Salakhutdinov, R.R. (2006). Reducing the dimensionality of data with neural networks. science 313, 504–507.
58.Sutskever, I., Vinyals, O., and Le, Q.V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112.
59.Kingma, D.P., and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
60.Van Engelen, J.E., and Hoos, H.H. (2020). A survey on semi-supervised learning. Machine Learning 109, 373–440.
61.Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (Ieee), pp. 248–255.
62.Howard, J., and Ruder, S. (2018). Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146.
63.Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
64.Hoffer, E., and Ailon, N. (2015). Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition (Springer), pp. 84–92.
65.Musgrave, K., Belongie, S., and Lim, S.-N. (2020). A Metric Learning Reality Check. arXiv preprint arXiv:2003.08505.
66.Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709.
67.Deng, C., Chen, Z., Liu, X., Gao, X., and Tao, D. (2018). Triplet-based deep hashing network for cross-modal retrieval. IEEE Transactions on Image Processing 27, 3893–3903.

Comments on ML "versus" statistics

Fri, 31 Jan 2020 05:00:00 -0500

Why am I writing this?

Over the last few years, I’ve observed many vigorous debates about “machine learning versus statistics.” Often, these are sparked by some paper/blog post/press release that either (a) involves some use of logistic regression (or some other type of GLM) being described as machine learning, or (b) performs a meta-analysis attempting to pit the fields against each other.

I have allowed myself to get pulled down this rabbit hole far too many times, wasting hours of my time in fruitless debate. As such, I have decided to write this post as a way to inoculate myself against the urge to enter future discussions. The first two sections explain why I consider most ML “versus” Stats debates to be fundamentally flawed, even in their very premise. The following two sections explain why I do validate where people are coming from in having these debates, but still think they (the debates!) are a colossal waste of time.

As time goes on, I might even write a bot to post this on relevant twitter threads. If I do, I will intentionally code this bot using logistic regression and call it machine learning, just to maximize peskiness.

Outline

-Neglected historical context: The term "machine learning" was not coined to contrast with statistics, but to contrast the field with competing paradigms for building intelligent computer systems.

-Arguments about who "owns" regression miss the point.

-Distinctions in goals have yielded a divergence in methods and cultures, which explains shifting connotations of the term "machine learning."

-Isn't this whole "debate" a massive waste of time?

Neglected historical context: The term “machine learning” was not coined to contrast with statistics, but to contrast the field with competing paradigms for building intelligent computer systems.

Before getting to Machine Learning (ML), a couple paragraphs on Artificial Intelligence (AI). These days, many people – including me – reflexively wince when they hear the term “AI,” because it is (a) used by slimey buzzword peddlers to such an extent that it is now nearly synonymous with “snakeoil,” (b) overloaded with connotations of sentient killer robots, and (c) almost exclusively used to refer to machine learning, anyway. This is all quite unfortunate. However, try to set that aside for just one paragraph.

Engineers have dreamed of building something “smart” for thousands of years, but the term “artificial intelligence” itself was coined by John McCarthy in preparation for the famous “Dartmouth Conference” of 1956. McCarthy defined artificial intelligence as “the science and engineering of making intelligent machines,” and that’s not too bad for a pithy one-liner. Importantly for this discussion, McCarthy was able to convince his colleagues to adopt this term at Dartmouth in large part because it was vague. At that point in time, computer scientists who were trying to crack intelligence were focused not on data-driven methods, but on things like automata theory, formal logic, and cybernetics. McCarthy wanted to create a term that would capture all of these paradigms (and other ones yet to come) rather than favoring any specific approach.

It was with this context that Arthur Samuel (one of the attendees at the Dartmouth Conference) coined the term “Machine Learning” in 1959, which he defined as:

Field of study that gives computers the ability to learn without being explicitly programmed.

Samuels and his colleagues wanted to help computers becomes “smart” by equipping them with the capacity to recognize patterns and iteratively improve over time. While this may seem like an obvious approach today, it took decades before this became the dominant mode of AI research (as opposed to, say, building systems that exhibit “intelligence” by applying propositional logic over curated knowledge graphs).

In other words, machine learning was coined to describe a design process for computers that leverages statistical methods to improve performance over time. The term was created by computer scientists, for computer scientists, and designed as a contrast with non-data-driven approaches to building smart machines. It was not designed to be a contrast with statistics, which is focused on using (often overlapping) data-driven methods to inform humans.

Another extremely widely-referenced definition of ML comes from Tom M. Mitchell’s 1997 textbook, which said:

The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience,

and offered the accompanying semi-formal definition:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.

This is all very much in accordance with Arthur Samuel’s definition, and I could pull other more recent definitions with similar verbiage. Another passage from Mitchell that I think gets less circulation than it deserves, however, is the following (taken with a little reformatting from a 2006 article called “The Discipline of Machine Learning”):

Machine Learning is a natural outgrowth of the intersection of Computer Science and Statistics. We might say the defining question of Computer Science is “How can we build machines that solve problems, and which problems are inherently tractable/intractable?” The question that largely defines Statistics is “What can be inferred from data plus a set of modeling assumptions, with what reliability?”

The defining question for Machine Learning builds on both, but it is a distinct question. Whereas Computer Science has focused primarily on how to manually program computers, Machine Learning focuses on the question of how to get computers to program themselves (from experience plus some initial structure). Whereas Statistics has focused primarily on what conclusions can be inferred from data, Machine Learning incorporates additional questions about what computational architectures and algorithms can be used to most effectively capture, store, index, retrieve and merge these data, how multiple learning subtasks can be orchestrated in a larger system, and questions of computational tractability.

Arguments about who “owns” regression miss the point

Given the history just described, I must admit that I’ve been frustrated at times by the authoritarian tone with which many have tried to enforce a false dichotomization between statistics and ML methods. In particular, there appears to be a strange fixation by specific online personalities on insisting that regression-powered applications must not be described as ML. In reading some of these discussions, one might even come away thinking that there is a conspiracy at play to annex regression away from statistics.

To me, this is only slightly less silly than trying to stir up a turf war to target anyone who uses logistic regression and calls it “econometrics.” For at least 60 years, “machine learning” has been about building the best learning computers that we can, not some weird methods competition with statistics. This is why, when it comes to teaching “ML methods,” almost every introductory machine learning textbook or course that I’ve ever seen – Mitchell, Murphy, Bishop, Ng, etc. – spends much of its efforts on teaching GLMs and their variants. It’s also why it’s perfectly sensible for specialized textbooks to include plots like this one, which routinely makes the rounds to much pearl-clutching. Would I expect such a plot to be useful to statisticians? Or course not! But they make sense in context of the ML/AI fields, which are concerned with different ways to make programs act “smart”. And to put it bluntly for any [c]rank-y colleagues in statistics: You don’t get to decide which taxonomies another field finds useful in framing its own problems and history.

The great irony with the whole recurring snafoo around who “owns” regression – and all of its variants – is that it simultaneously undersells both machine learning and statistics, for many reasons that include the following four:

First, it minimizes – or even defines away – the core role that classic statistical methods continue to play in efforts to build computer programs that learn.
Second, it ignores the impact that ML has had on statistics, when in reality AI and CS have been a massive boon to statistics research. This includes the generation of new statistical paradigms (e.g. Judea Pearl and others’ work on causality, now one of many booming subareas of stats that came from ML) and a wide array of algorithmic and computational tools that have enabled the rise of statistical computing.
Third, a false dichotomy between ML and stats minimizes the wide – and critical to modeling decisions – variation within each purported class. It’s also silly. For example, take a simple logistic regression model implemented in pytorch. Now consider (a) adding 1000 polynomial interaction terms between all the features and a ridge penalty, and (b) adding a second fully connected layer. The dichotomous approach to ML vs stats would say that (b) fundamentally alters the entire class of model from stats to ML, and minimizes the arguably larger impact on modeling induced by (a). By the same token, I’m always shocked when I see meta-analyses that claim they’re going to compare the accuracy of ML “versus” statistics, and then use embarrassingly horrible and out-of-date ML models (like a single decision tree) or bad statistical practices. In short, the spectrum is so broad (and the execution so essential) within each of these purportedly distinct methodological camps that most statements about the whole collections are minimally helpful. It’s all about picking the right tool for a specific job, and then taking the time and effort to use that tool properly!
Fourth, the above dispute also ignores the fact that many top researchers, publication venues, and papers in stats or ML are fully-fledged citizens of both communities.

In my opinion, the careers of Trevor Hastie and Rob Tibshirani highlight the best of what happens when statisticians interact richly with machine learning researchers. Rather than getting caught up in drawing methodological border lines, they have taken tools developed first by machine learning researchers and helped formally situate them within the world of statistics proper. In this light, I enjoy their frequent use of the term “statistical learning” (as in the title of their textbook), which I think nicely emphasizes the fact that their goals are those of statistics, even if many of the methods in the book have been developed by and for people in ML. (I’ll also point out, a bit immaturely, that I’ve never heard a machine learning researcher complain that Hastie and Tibshirani are trying to annex their methods by not using the phrase “machine learning” when describing neural networks, tree-based methods, etc.) Of course, all of the above is to say nothing of the new methods Hastie and Tibs have generated themselves, which have impacted the daily work of both statisticians and machine learning researchers.

All the above being said, I do appreciate that perfectly reasonable people have come to think of ML as a disjoint set of methods from statistics. The following sections elaborate on why I think this has happened, and what I think this means as a takeaway for the overall discussion.

Distinctions in goals have yielded a divergence in methods and cultures, which explains shifting connotations of the term “machine learning.” Disconnects in language doom many “debates” to futility before they begin.

As stated above, the field of machine learning research was founded as computer scientists sought to build and understand intelligent computer systems, and this continues to be the case today. Major ML applications include things like speech recognition, computer vision, robotics/autonomous systems, computational advertising (sigh…), surveillance (sigh…), chat-bots (sigh…), etc. In seeking to solve these problems, machine learning researchers will almost always start by first trying classical statistical methods, including the relevant simple GLM (in fact, this is often considered a mandatory baseline for publication in many applied ML areas). Hence my whole discussion about ML not being predicated on a specific method. However, computer scientists have, of course, also significantly added to this toolkit over the years through the development of additional methods.

As with evolution in any other context, the growing phylogeny of statistical methods used for machine learning have been shaped by selective pressures. Compared to statisticians, machine learning researchers typically care much less about understanding any specific action taken by their algorithms (though it is certainly important, and increasingly a bigger priority). Rather, they usually care most about minimizing model errors on held-out data. As such, it makes sense that methods developed by ML researchers are typically more flexible even at the expense of interpretability, for example. Leo Breiman and others have written about how these cultures have informed methods development, such as random forests. This often-divergent evolution has made it easy to draw (fuzzy) boundaries between ML and statistics research based entirely on methods. To boot, many statisticians are unaware of the history of ML, and have thus, for years, only ever been exposed to the field by means of the methods it periodically emits. It is thus unsurprising that they would be interested in defining the field in any other terms, even if it is dissapointing.

By the same token, a sharp division based on use (like I advocated for above) is now complicated by the fact that many ML people say they’re doing machine learning even when they’re applying their methods for pure data analysis rather than to drive a computer program. While arguably incorrect in a strict historical sense, I don’t fault people for doing this – probably out of a mixture of habit, cultural affiliation, and/or because it sounds cool.

Taken together, people now use “machine learning” to mean very different things. Sometimes, people use it to mean: “I’m using a statistical method to make my program learn” or “I’m developing a data analysis that I hope to deploy in an automated system.” Other times, they mean: “I’m using a method – perhaps for a statistical data analysis – that was originally developed by the machine learning community, like random forests.” Still other times (maybe most of the time…?), they mean: “I consider myself a machine learning researcher, I’m working with data, and I can call this work whatever I darn well please.”

These different uses of the term aren’t really surprising or problematic, because this is simply how language evolves. But it does make it extremely frustrating when a hoard of data scientists (oh no, another hypey term! I use it here as union of ML and statistics) collectively try to debate whether or not a specific project can be branded as ML or must be branded “just statistics.” Usually, when this happens, people enter the discussion with wildly different assumptions – poorly defined, and seldom articulated – about what the words mean in the first place. And then they rarely take the time to understand where others are coming from or what they are actually trying to say. Instead, they typically just talk past each other, louder instead of clearer.

Isn’t this whole “debate” a massive waste of time?

Finally, let’s lay our cards out on the table w.r.t. a few real problems: There are many machine learning researchers (or at the very least, machine learning hobbyists), who exhibit an inadequate understanding of statistics for people who work with data for a living. At times, I am such a machine learning researcher! (Though I’d wager that many professional statisticians sometimes feel the same way, too.) Relatedly, but more seriously, ML research moves so fast, and is sometimes so culturally disconnected from the field of statistics, that I think that it is all-too-common for even prominent ML researchers to re-discover or re-invent parts of statistics. That’s a problem and a waste. Finally, there is a massive brand dilution in ML, because a large third-party population of applied researchers have essentially co-opted the term “machine learning,” applying it to papers just to make them sound fancy, even when in reality they are doing machine learning neither in the sense of automated system building nor in the sense of using new methods that came from ML.

I feel that the solution to all of these problems is to increase recognition that most of ML’s data methods actually live within statistics. Rather than doubling down on a false partition between the two fields, our priority needs to be the cultivation of a robust understanding of statistical principles, whether they are being used for data analysis or for programming intelligent systems. Endless debates about what to call a lot of this work end up distracting people from essential conversations about how to carry out good work by matching the right specific tool to the right problem. If anything, a fixation on a false dichotomy between stats and ML methods probably drives many people further into the habit of using unnecessarily complex methods, just to feel (whether for pride or for money) like they are doing “real ML” (whatever on earth that means). It also directly feeds the issue that causes people to call their work ML just for the sake of sounding methodological fancy.

Finally, this golden age of statistical computing is driving these two fields closer than ever. ML research, of course, lives within computer science, and the modern statistician is increasingly dependent upon the algorithms and software stack that have been pioneered by CS departments for decades. Modern statisticians – especially in fields like computational biology – are also increasingly finding use for methods pioneered by ML researchers for, say, regression in high dimensions or at large scale. On the flip side, the ML community is becoming increasingly concerned with topics like interpretability, fairness, certifiable robustness, etc., which is leading many researchers’ priorities to align more directly with the traditional values of statistics. At the very least, even when a system is deployed using the most convoluted architectures possible, it’s pretty universally recognized that classical statistics is necessary to measure and evaluate performance.

In summary:

The whole debate is misguided, the terms are overloaded, the methodological dichotomy is false, ML people care (and increasingly so) about statistics, stats people are increasingly dependent upon CS and ML, and there is no regression annexation conspiracy. There’s a lot of hype out there right now, but that doesn’t change the fact that, often, when people use different terminology than you, that’s because they come from a different background or have different goals in mind, not because they are stupid or dishonest. Let’s just all be friends and strive to do good work together and learn from each other. Kumbaya.

All the DAGs from Hernan and Robins' Causal Inference Book

Wed, 19 Jun 2019 07:50:00 -0400

This is my preliminary attempt to organize and present all the DAGs from Miguel Hernan and Jamie Robin’s excellent Causal Inference Book. So far, I’ve only done Part I.

I love the Causal Inference book, but sometimes I find it easy to lose track of the variables when I read it. Having the variables right alongside the DAG makes it easier for me to remember what’s going on, especially when the book refers back to a DAG from a previous chapter and I don’t want to dig back through the text. Plus, making this was a great exercise!

Again, this page is meant to be fairly raw and only contain the DAGs. If you use it, you might also find it useful to open up this page, which is where I have more traditional notes covering the main concepts from the book. But of course, the text itself has no substitute.

Table of Contents:

Refresher: Visual rules of d-separation.
Refresher: Backdoor criterion
Basics of Causal Diagrams (6.1-6.5)
Effect Modification (6.6)
Confounding (Chapter 7)
Selection Bias (Chapter 8)
Measurement Bias (Chapter 9)

Refresher: Visual rules of d-separation.

Two variables on a DAG are d-separated if all paths between them are blocked. The following four rules defined what it means to be “blocked.”

(This is just meant to be a refresher – see the second half of this post or Fine Point 6.1 of the text for more definitions.)

Rule	Example
1. If there are no variables being conditioned on, a path is blocked if and only if two arrowheads on the path collide at some variable on the path.	$L \rightarrow A \rightarrow Y$ is open. $A \rightarrow Y \leftarrow L$ is blocked at $Y$
2. Any path that contains a noncollider that has been conditioned on is blocked.	Conditioning on $B$ blocks the path from $A$ to $Y$.
3. A collider that has been conditioned on does not block a path	The path between $A$ and $Y$ is open after conditioning on $L$.
4. A collider that has a descendant that has been conditioned on does not block a path.	The path between $A$ and $Y$ is open after conditioning on $C$, a descendant of collider $L$.

Refresher: Backdoor criterion

Assuming positivity and consistency, confounding can be eliminated and causal effects are identifiable in the following two settings:

Rule	Example
1. No common causes of treatment and outcome.	There are no common causes of treatment and outcome. Hence no backdoor paths need to be blocked. No confounding; equivalent to a marginally randomized trial.
2. Common causes are present, but there are enough measured variables to block all colliders. (i.e. No unmeasured confounding.)	Backdoor path through the common cause $L$ can be blocked by conditioning on measured covariates (in this case, $L$ itself) that are non-descendants of treatment. There will be no residual confounding after controlling for $L$; equivalent to a conditionally randomized trial.

And now we can finally:

Basics of Causal Diagrams (6.1-6.5)

DAG	Example	Notes	Page
	Marginally randomized experiment A: Treatment Y: Outcome	Arrow doesn’t specifically imply protection vs risk, just causal effect. Unconditional exchangeability assumption means that association implies causation and vice versa.	I.70
	Conditionally randomized experiment L: Stratification Variable A: Treatment Y: Outcome	Also equivalent to an Observational Study that assumes A depends on L and on no other causes of Y (else they’d need to be added). Implies conditional exchangeability.	I.69-I.70
	A: Aspirin B: Platelet aggregation Y: Heart Disease	$B$ is a mediator of $A$’s effect on $Y$, but conditioning on $B$ (e.g by restricting the analysis to people with a specific lab value) blocks the flow of association through the path A $\rightarrow$ B $\rightarrow$ Y. Even though $A$ and $Y$ are marginally associated, they are conditionally independent given $B$. In other words, A $\unicode{x2AEB}$ Y \| B. Thus, knowing aspirin status gives you no more information once platelets are measured, at least according to this graph.	I.73
	L: Smoking status A: Carrying a lighter Y: Lung cancer	Graph says that carrying a lighter (A) has no causal effect on outcome (Y). Math form of this assumption is: Pr[Y^(a=1)=1]=Pr[Y^(a=0)=1] However, $A$ will be spuriously associated with $Y$, because path A $\leftarrow$ L $\rightarrow$ Y is open to flow from A to Y: they share a common cause.	I.72
	L: Smoking status A: Carrying a lighter Y: Lung cancer	A $\unicode{x2AEB}$ Y \| L, because the path A $\leftarrow$ L $\rightarrow$ Y is closed by conditioning on L. Thus, restricting the analysis to either smokers or non-smokers (box around L) means that lighter carrying will no longer be associated with lung cancer.	I.74
	A: Genetic predisposition for heart disease Y: Smoking status L: Heart disease	$A$ and $Y$ are not marginally associated, because they share no common causes. (i.e. Genetic risk for heart disease says nothing, in a vaccuum, about smoking status.) $L$ here is a collider on the path A $\rightarrow$ L $\leftarrow$ Y, because the two arrows collide on this node. But there is no causal path from $A$ to $Y$.	I.73
	A: Genetic predisposition for heart disease Y: Smoking status L: Heart disease	Conditioning on the collider $L$ opens the causal path A $\rightarrow$ L $\leftarrow$ Y. Put another way, two causes of a given effect generally become associated once we stratify on the common effect. In the example, knowing someone with heart disease lacks haplotype A makes it more likely that the individual is a smoker, because, in the absence of $A$, it is more likely that some other cause of $L$ is present. Or, conversely, the population of non-smokers with heart disease will be enriched for people with haplotype A. Thus, if one restricts the analysis to people with heart disease, he will find a spurious anti-correlation between the haplotype predictive of heart disease and smoking status.	I.74
	A: Genetic predisposition for heart disease Y: Smoking status L: Heart disease C: Diuretic medication (given after heart disease diagnosis)	Conditioning on variable $C$ downstream from collider $L$ also opens up causal path A $\rightarrow$ L $\leftarrow$ Y. Thus, in the example, stratifying on $C$ (diuretic status) will induce a spurious relationship between $A$ (genetic heart disease risk) and $Y$ (smoking status).	I.75
Before matching: After matching:	Matched analysis L: Critical Condition A: Heart Transplant Y: Death S: Selection for inclusion via matching criteria	In this study design, the average causal effect of $A$ on $Y$ is computed after matching on $L$. Before matching, $L$ and $A$ are associated via the path $L \rightarrow A$ . Matching is represented in the DAG through the addition of $S$, the selection criteria. The study is obviously restricted to patients that are selected ($S$=1), hence we condition on $S$. d-separation rules say that there are now two open paths between $A$ and $L$ after conditioning on $S$: $L \rightarrow A$ and $L \rightarrow S \leftarrow A$. This seems to indicate an association between $L$ and $A$. However, the point of matching is supposed to be to make sure that $L$ and $A$ not associated! The resolution comes from the observation that $S$ has been constructed specifically to induce the distribution of $L$ to be the same in the treated ($A$=1) and untreated ($A$=0) population. This means that the association in $L \rightarrow S \leftarrow A$ is of equal magnitude but opposite direction of $L \rightarrow A$. Thus there is no net association between $A$ and $L$. This disconnect between the associations visible in the DAG and the associations actually present is an example of unfaithfulness, but here it has been introduced by design.	I.49 and I.79
	R: Compound treatment (see right) A: Vector of treatment versions $A( r )$ (see right) Y: Outcome L and W: unnamed causes U: unnmeasured variables	This is the example the book uses of how to encode compound treatments. The example compound treatment is as follows: R=0 corresponds to “exercising less than 30 minutes daily”. R=1 corresponds to “exercising more than 30 minutes daily.” $A$ is a vector corresponding to different versions of the treatment, where $A(r=0)$ can take on values $0,1,2,\dots, 29$ and $A(r=1)$ can take on values $30,31\dots, max$ Taken together, we can have a mapping from multiple values $A( r )$ onto a single value $R=r$.	I.78

Effect Modification (6.6)

Example	Notes	Page
A: Heart Transplant Y: Outcome M: Quality of Care. High ($M=1$) vs Low ($M=0$)	This DAG reflects the assumption that quality of care influences quality of transplant procedure and thus of outcomes, BUT still assumes random assignment of treatment. Given random assignment, $M$ is not strictly necessary but added if you want to use it to stratify. Causal diagram as such does not distinguish between: 1. Causal effect of treatment $A$ on mortality $Y$ is in the same direction in both stratum $M=1$ and $M=0$. 2. The causal effect of $A$ on $Y$ is in the opposite direction in $M=1$ vs $M=0$. 3. Treatment $A$ as a causal effect on $Y$ in one straum of $M$ but no effect in other stratum.	I.80
A: Heart Transplant Y: Outcome M: Quality of Care. High ($M=1$) vs Low ($M=0$) N: Therapy Complications	Same example as above, except assumes that other variables along the path of a modifier can also influence outcomes.	I.80
A: Heart Transplant Y: Outcome M: Quality of Care. High ($M=1$) vs Low ($M=0$) S: Cost of treatment	Same example as above, except assumes that the quality of care effects the cost, but that the cost does not influence the outcome. This is the example of an effect modifier that does not have a causal effect on the outcome, but rather stands as a surrogate effect modifier. Analysis stratifying on $S$ – which is available/objective – might be used to detect effect modification that actually comes from $M$ but is harder to measure.	I.80
A: Heart Transplant Y: Outcome M: Quality of Care. High ($M=1$) vs Low ($M=0$) U: Place of residence P: Passport-defined nationality	Example where the surrogate effect modifier (passport) is not driven by the causal effect modifier (quality of care), but rather both are driven by a common cause (place of residence).	I.80
A: Heart Transplant Y: Outcome M: Quality of Care. High ($M=1$) vs Low ($M=0$) S: Cost of Care W: Use of mineral water vs tap	Example where the surrogate effect modifier (cost) is influenced by both the causal effect modifier (quality) and something spurious. If the study were restricted to low-cost hospitals by conditioning on $S=0$, then use of mineral water would become associated with medical care $M$ and would behave as a surrogate effect modifier. Addendum: How? One example might be that conditioned on a low cost, a zero sum situation may arise in which spending more on fancy water means less is being spent on quality care, which could yield an inverse correlation between mineral water and medical quality.	I.81

Confounding (Chapter 7)

DAG	Example	Notes	Page
	L: Being physicially fit A: Working as a firefighter Y: Mortality	The path $A \rightarrow Y$ is a causal path from $A$ to $Y$. $A \leftarrow L \rightarrow Y$ is a backdoor path between $A$ and $Y$, mediated by common cause (confounder) $L$. Conditioning on $L$ will block the backdoor path, induce conditional exchangeability, and allow for causal inference. Note: This is an example of “healthy worker bias.”	I.83
	A: Aspirin Y: Stroke L: Heart Disease U: Atherosclerosis (unmeasured)	This DAG is an example of confounding by indication (or channeling). Aspirin will have a confounded association with stroke, both from heart disease ($L \rightarrow A \rightarrow Y$), and from atherosclerosis ($U \rightarrow L \rightarrow A \rightarrow Y$). Conditioning on unmeasured $U$ is impossible, but there is no unmeasured confounding given $L$, so conditioning on $L$ is sufficient.	I.84
	A: Exercise Y: Death L: Smoking status U: Social Factors (unmeasured) or Sublinical Disease (undetected)	Conditioning on $L$ is again sufficient to block the backdoor path in this case.	I.84
	A: Physical activity Y: Cervical Cancer L: Pap smear U_1: Pre-cancer lesion (unmeasured here) U_2: Health-conscious personality (unmeasured)	Example shows how conditioning on a collider can induce bias. Adjustment for $L$ (e.g. by restricting to negative tests $L=0$) will induce bias by opening a backdoor path between $A$ and $Y$ ($A \leftarrow U_2 \rightarrow L \leftarrow U_1 \rightarrow Y$), previously blocked by the collider. This is a case of selection bias. Thus, after conditioning, association between $A$ and $Y$ would be a mixture of association due to effect of $A$ on $Y$ and backdoor path. In other words, there is no unconditional bias, but there would be a conditional bias for at least one stratum of $L$.	I.88
	(Labels not in book) A: Antacid L: Heartburn Y: Heart attack U: Obesity	A nonconfounding example in which traditional analysis might lead you to adjust for $L$, but doing so would induce a bias.	I.89
	A: Physical activity L: Income Y: Cardiovascular disease U: Socioeconomic status	$L$ (income) is not a confounder, but is a measurable variable that could serve as a surrogate confounder for $U$ (socioeconomic status) and thus could be used to partially adjust for the confounding from $U$. In other words, conditioning on $L$ will result in a partial blockage of the backdoor path $A \leftarrow U \rightarrow Y$.	I.90
Normal DAG: Corresponding SWIG:	A: Aspirin Y: Stroke L: Heart Disease U: Atherosclerosis (unmeasured)	Represents data from a hypothetical intervention in which all individuals receive the same treatment level $a$. Treatment is split into two sides: (a) Left side encodes the values of treatment $A$ that would have been observed in the absence of intervention (the natural value of treatment) (b) Right side encodes the treatment value under the intervention. $A$ has no variable into $a$ bc $a$ is the same everywhere. Conditional exchangeability $Y^{a} \unicode{x2AEB} A \| L$ holds because all paths between $Y^{a}$ and $A$ are blocked after conditioning on $L$.	I.91
Normal DAG: Corresponding SWIG:	A: Physical activity Y: Cervical Cancer L: Pap smear U_1: Pre-cancer lesion (unmeasured here) U_2: Health-conscious personality (unmeasured)	Here, marginal exchangeability $Y^{a} \unicode{x2AEB} A$ holds because, on the SWIG, all paths between $Y^{a}$ and $A$ are blocked without conditioning on $L$. Conditional exchangeability $Y^{a} \unicode{x2AEB} A \| L$ does not hold because, on the SWIG, the path $Y^{a} \leftarrow U_1 \rightarrow L \leftarrow U_2 \rightarrow A$ is open when the collider $L$ is conditioned on. Taken together, marginal $A-Y$ association is causal but conidtional association $A-Y$ given $L$ is not.	I.91
Normal DAG: Corresponding SWIG:	(Example labels not in book) A: Statins Y: Coronary artery disease L: HDL/LDL U: Race	In this example, the SWIG is used to highlight a failure of the DAG to provide conditional exchangeability $Y^{a} \unicode{x2AEB} A \| L$. In the SWIG, the factual variable $L$ is replaced by the counterfactual variable $L^{a}$. In this SWIG, counterfactual exchangeability $Y^{a} \unicode{x2AEB} A \| L_{a}$ holds, since $L^{a}$ blocks the paths from $Y^{a}$ to $A$. But $L$ is not even on the graph, so we can’t conclude $Y^{a} \unicode{x2AEB} A \| L$ holds. The problem being highlighted here is that $L$ is a descendent of the treatment $A$ blocking the path to $Y$. In contrast, if the arrow from $A$ to $L$ didn’t exist, $L$ would not be a descendent of $A$ and adjusting for $L$ would eliminate all bias, even if $L$ were still in the future of $A$. Thus, confounders are allowed to be in the future of the treatment, they just can’t be descendents.	I.92
	A: Aspirin Y: Blood Pressure U: History of heart disease (unmeasured) C: Blood pressure right before treatment (“placebo test” aka “negative outcome control”)	This example was used to show difference-in-difference and negative outcome controls. The idea: We cannot compute the effect of $A$ on $Y$ via standardization or IP weighting because there is unmeasured confounding. Instead, we first measure the (“negative”) outcome $C$ right before treatment. Obviously $A$ has no effect on $C$, but we can assume that $U$ will have the same confounding effect on $C$ that it has on $Y$. As such, we take the effect in the treated to be the effect of $A$ on $Y$ (treatment effect + confounding effect) minus the effect of $A$ on $C$ (confounding effect). This is the difference-in-differences. Negative outcome controls are sometimes used to try to detect confounding.	I.95
	(No example labels in text) A: Aspirin M: Platelet Aggregation Y: Heart Attack U: High Cardiovascular Risk	This example is to demonstrate the frontdoor criterion (see notes or page I.96 for more details). Given this DAG, it is impossible to directly use standardization or IP weighting, because the unmeasured variable $U$ is necessary to block the backdoor path between $A$ and $Y$. However, the frontdoor adjustment can be used because: (i) the effect of $A$ on $<$ can be computed without confounding, and (ii) the effect of $M$ on $Y$ can be computed because $A$ blocks only the backdoor path. Hence, frontdoor adjustment can be used.	I.95

Some additional (but structurally redundant) examples of confounding from chapter 7:

Example	Notes	Page
A: Exercise Y: Death L: Smoking status U: Social Factors (unmeasured) or Sublinical Disease (undetected)	Subclinical disease could also result both in lack of exercise $A$ and increased risk of a clinical diseae $Y$. This is an example of reverse causation.	I.84
A: Gene being tested Y: Trait L: Different gene in LD with gene A U: Ethnicity	Linkage disequilibrium can drive spurious associations between gene $A$ and trait $Y$ if the true causal gene $L$ is in LD with $A$ in patients with ethnicity $U$.	I.84
A: Airborne particulate matter Y: Coronary artery disease L: Other pollutants U: Weather conditions	Environmental exposures often co-vary with the weather conditions. As such, certain pollutants $A$ may be spuriously associated with outcome $Y$ simply because the weather drives them to co-occur with $L$.	I.84

Selection Bias (Chapter 8)

Note: While randomization eliminates confounding, it does not eliminate selection bias. All of the issues in this section apply just as much to prospective and/or randomized trials as they do to observational studies.

Example	Notes	Page
A: Folic Acid supplements Y: Cardiac Malformation C: Death before birth	In this example, we assume folic acid supplements decrease mortality by reducing non-cardiac malformations, cardiac malformatins increase mortality, and cardiac malformations increase mortality. Study restricted participants to fetuses who survived until birth ($C=0$). Two sources of association between treatment and outcome: 1. Open path $A \rightarrow Y$, the causal effect. 2. Open path $A \rightarrow C \leftarrow Y$ linking $A$ and $Y$ due to conditioning on common effect (collider) $C$. This is the selection bias, specifically, selection bias under the null. The selection bias eliminates ability to make causal inference. If analysis were not conditioned on $C$, causal inference would be valid.	I.97
A: Folic Acid supplements Y: Cardiac Malformation C: Death before birth S: Parental Grief	This example is the same as the above, except we consider if the researchers instead conditioned on the effect of the collider, namely $S$, parental grief. This is still selection bias, $A \rightarrow C \leftarrow Y$ linking $A$ is open, and association is not causation.	I.98
(Note: Missing arrow: $A \rightarrow Y$ ) A: Antiretroviral treatment Y: 3-year death C: Censoring from study or Missing Data U: High immunosuppresion (unmeasured) L: Symptoms, CD4 count, viral load (unmeasured) ———— W: Lifestyle, personality, educational variables (unmeasured)	Figure 8.3: In this example, individuals with high immunosuppresion – in addition to having higher risk of death – manifest worse physical symptoms that mediates censoring from the study. Treatment also worsens side effects, which increases censoring, as well. $C$ is conditioned upon, because those are the only ones who actually contribute data to the study. Per d-separation, $A \rightarrow C \leftarrow L \leftarrow U \rightarrow Y$ is open due to conditioning on $C$, allowing association to flow from $A$ to $Y$ and killing causal inference. Note: This is a transformation of figure 8.1, except instead of $Y$ acting directly on $C$, we have $U$ acting on both $Y$ and $C$. Intuition for the bias: if a treated individual with treatment-induced side effects does not drop out ($C=0$), this implies that he probably didn’t have high immunosuppresion $U$, and low immunosuppresion means better outcomes. Hence, there is probably an inverse association between $A$ and $U$ among those that don’t drop out. This is an example of selection bias that arises from conditioning on a censoring variable that is a comon effect of both treatment $A$ and cause $U$ of the outcome $Y$. ———— Figure 8.5 is the same idea, except it notes that sometimes additional unmeasured variables may contribute to both treatment and censoring.	I.98
(Note: Missing arrow: $A \rightarrow Y$ ) A: Antiretroviral treatment Y: 3-year death C: Censoring from study or Missing Data U: High immunosuppresion (unmeasured) L: Symptoms, CD4 count, viral load (unmeasured) ———— W: Lifestyle, personality, educational variables (unmeasured)	Same example as 8.3/8.5, except we assume that treatment (especially prior treatment) has direct effect on symptoms $L$. Restricting to uncensored individuals still implies conditioning on a common effect $C$ of both $A$ and $U$, introducing an association between treatment and outcome. (Note: Unlike in Figure 8.3/8.5, even if we had access to $L$, stratification is impossible in these DAGs, because while conditioning on $L$ blocks the backdoor path from $C$ to $Y$, it also opens the backdoor path $A \rightarrow L \leftarrow U \rightarrow Y$ because $L$ is a collider on that path. IP-weighting, in contrast, could work here. See page I.108 in section 8.5 for a discussion.)	I.98
A: Physical activity Y: Heart Disease C: Becoming a firefighter L: Parental socioeconomic status U: Interest in physical activites (unmeasured)	The goal of this example is to show that while confounding and selection bias are distinct, they can often become functionally the same; this is why some call selection bias “confounding”. Assume that – unknown to the investigators – $A$ does not cause $Y$. Parental SES $L$ affects becoming a firefighter $C$, and, through childhood diet, heart disease risk $Y$. But we assume that $L$ doesn’t affect $A$. Attraction to physical activity $U$ affects being physically active $A$ and being a firefighter $C$, but not $Y$. Per these assumptions, there is no confounding, bc no common causes of $A$ and $Y$. However, restricting the study to firefighters ($C=0$), induces a selection bias that can be eliminated by adjusting for $L$. Thus, some economists would call $L$ a “confounder” because adjusting for it eliminates the bias.	I.101
A: Heart Transplant Y_1: Death at time point 1 Y_2: Death at time point 2 U: Protective genetic haplotype (unmeasured)	The purpose of this example is to show the potential for selection bias in time-specific hazard ratios. The example depicts a randomized experiment representing the effect of heart transplant on risk of death at two time points, for which we assume the true causal DAG is figure 8.8. In figure 8.8, we assume that $A$ only directly affects death at the first time point and that $U$ decreases risk of death at all times but doesn’t affect treatment. In this circumstance, the unconditional associated risk ratios are not confounded. In other words, $aRR_{AY_1} = \frac{[Y_{1}\|A=1]}{[Y_{1}\|A=0]}$ and $aRR_{AY_2} = \frac{[Y_{2}\|A=1]}{[Y_{2}\|A=0]}$ are unbiased and valid for causal inference. However, trying to compute time-specific hazard ratios is risky. The process is valid at time point 1 ($aRR_{AY_1}$ is the same as above), but the hazard ratio at time point 2 is inherently conditional on having survived at time point 1: $aRR_{AY_2\|Y_{1}=0} = \frac{[Y_{2}\|A=1,Y_{1}=0]}{[Y_{2}\|A=0,Y_{1}=0]}$. Since $U$ affects survival at time point 1, however, this induces a selection bias that opens a path $A\rightarrow Y_{1} \leftarrow U \rightarrow Y_{2}$ beteween $A$ and $Y_2$. If we could condition on $U$, then $aRR_{AY_2\|Y_{1}=0,U}$ would be valid for causal inference. But we can’t, so conditioning on $Y_{1}=0$ makes the DAG functionally equivalent to Figure 8.9. This issue is relevant to observational and randomized experiments over time.	I.102
(Note: Missing arrow: $A \rightarrow Y$ ) A: Wasabi consumption (randomized) Y: 1-year death C: Censoring L: Heart Disease U: Atherosclerosis (unmeasured)	This example is of an RCT with censoring. We imagine that there was in reality an equal number of deaths in treatment and control, but there was higher censoring ($C=1$) among patients with heart disease and higher censoring among the wasabi arm. As such, we observe more deaths in the wasabi group than in control. Thus, we see a selection bias due to conditioning on common effect $C$. There are no common causes of $A$ and $Y$ – expected in a marginally randomized experiment – so there is no need to adjust for confounding per se. However, there is a common cause $U$ of both $C$ and $Y$, inducing a backdoor path $C \leftarrow L \leftarrow U \rightarrow Y$. As such, conditioning on non-censored patients $C=0$ means we have a selection bias that turns $U$ functionally into a confounder. $U$ is unmeasured, but the backdoor criterion says that adjusting for $L$ here blocks the backdoor path. The takeaway here is that censoring or other selection changes the causal question, and turns the counterfactual outcome into $Y^{a=1,c=0}$ – the outcome of receiving the treatment and being uncensored. The relevant causal risk ratio, for example, is thus now $\frac{E[Y^{a=1,c=0}]}{E[Y^{a=0,c=0}]}$ – “the risk if everyone had been treated and was uncensored” vs “the risk if everyone were untreated and remained uncensored.” In this sense, censoring is another treatment.	I.105
A: Surgery Y: Death E: Genetic hapltype ———— Death subsplit by causes: (not recorded) Y_A: Death from tumor Y_E: Death from heart attack Y_A: Death from other causes	In this example, Figure 8.12, surgery $A$ and haplotype $E$ are: (i) marginally independent (i.e. haplotype doesn’t affect probability of receiving surgery), and (ii) associated conditionally on $Y$ (i.e. probability of receiving surgery does vary by haplotype within at least one stratum of the haplotype). The purpose of this example is to show that despite this fact, situations exist in which $A$ and $E$ remain conditionally independent within some haplotypes. Key idea here is that to recognize that if you split death into different causes (even if this isn’t recorded), $A$ and $E$ affect different sub-causes in different ways (specifically, $A$ removes tumor, and $E$ prevents heart attack). Arrows from $Y_{A}$, $Y_{E}$, and $Y_{O}$ to $Y$ are deterministic, and $Y=0$ if and only if $Y_{A} = Y_{E}=Y_{O}=0$, so conditioning on $Y_{0}=0$ implicitly conditions the other $Y$s to zero. This also blocks the path between $A$ and $E$, since it is conditioning on non-colliders $Y_{A}$, $Y_{E}$, and $Y_{O}$. In contrast, conditioning on $Y=1$ is compatible with any combination of $Y_{A}$, $Y_{E}$, and $Y_{O}$ being equal to 1, so the path between $A$ and $E$ is not blocked. The ability to break the conditional probability of survival down in this way is an example of a multiplicative survival model.	I.105
A: Surgery Y: Death E: Genetic hapltype ———— Death subsplit by causes: (not recorded) Y_A: Death from tumor Y_E: Death from heart attack Y_A: Death from other causes	Same setup as in the examples of Figure 8.12 and 8.13. However, in all of these DAGs, $A$ and $E$ affect survival thrugh a common mechanism, either directly or indirectly. In such cases, $A$ and $E$ are dependent in both strata of $Y$. Taken together with the example above, the point is that *conditioning on a collider always* induces an association between its causes, but that this association may or may not be restricted to certain levels of the common effect**.	I.105

Some additional (but structurally redundant) examples of selection bias from chapter 8:

Example	Notes	Page
(Note: Missing arrow: $A \rightarrow Y$ ) A: Occupational exposure Y: Mortality C: Being at Work U: True health status L: Blood tests and physical exam ———— W: Exposed jobs are eliminated and workers laid off	(Note: DAGS 8.3/8.5 work just as well, here.) Healthy worker bias: If we restrict a factory cohort study to those individuals who are actually at work, we miss out on the people that are not working due to either: (a) disability caused by exposure, or (b) a common cause of not working and not being exposed.	I.99
(Note: Missing arrow: $A \rightarrow Y$ ) A: Smoking status Y: Coronary heart disease C: Consent to participate U: Family history L: Heart disease awareness ———— W: Lifestyle	(Note: DAGS 8.4/8.6 work just as well, here.) Self-selection bias or Volunteer bias: Under any of the above structures, if the study is restricted to people who volunteer or choose to participate, this can induce a selection bias.	I.100
(Note: Missing arrow: $A \rightarrow Y$ ) A: Smoking status Y: Coronary heart disease C: Consent to participate U: Family history L: Heart disease awareness ———— W: Lifestyle	(Note: DAGS 8.4/8.6 work just as well, here.) Selection affected by treatment received before study entry: Generalization of self-selection bias. Under any of the above structures, if the treatment takes place before the study selection or includes a pre-study component, a selection bias can arise. Particularly high-risk in studies that look at lifetime exposure to something in middle-aged volunteers. Similar issues often arise with confounding if confounders are only measured during the study.	I.100

Measurement Bias (Chapter 9)

Example	Notes	Page
A: True Treatment A: Measured treatment Y: True Outcome U_A*: Measurement error	This DAG is simply to demonstrate how the measured treatment $A^{}$ (aka “measure” or “indicator”) recorded in the study is different from the true* treatment (aka “construct”). It also introduces $U_{A}$, the measurement error variable, which encodes all the factors other than $A$ that determine $A^{}$ Note: $U_{A}$ and $A^{}$ were unnecessary in discussions of confounding or selection bias because they are not a part of a backdoor path and no variables are conditioned on them.	I.111
A: True Treatment A: Measured treatment Y: True Outcome Y: Measured outcome U_A: Measurement error for A U_Y: Measurement error for Y	This DAG adds in the notion of imperfect measurement for the outcome as well as the treatment. Note that there is still no confounding or selection bias at play here, so measurement bias or information bias is the only thing that would break the link between association and causation. Figure 9.2 is an example of a DAG with independent nondifferential error.	I.112
A: Drug use A: Recorded history of drug use Y: Liver toxicity Y: Liver lab values U_A: Measurement error for A U_Y: Measurement error for Y U_AY: Measurement error affecting A and Y (e.g memory and language gaps during interview)	In Figure 9.2 above, $U_{A}$ and $U_{Y}$ are independent according to d-separation, because the path between them is blocked by colliders. Independent errors could include EHR data entry errors that occur by chance, technical errors at a lab, etc. In this figure, we add $U_{AY}$ to note the existence of dependent errors. For example, communication errors that take place during an interview with a patient could effect both recorded drug use and previous recorded lab tests. Figure 9.3 is an example of dependent nondifferential error.	I.112
A: Drug use A: History of drug use per patient interview Y: Dementia Y: Dementia diagnosis U_A: Measurement error for A U_Y: Measurement error for Y ____ __U_AY: Measurement error affecting A and Y	Recall bias is one example of how the true outcome can bias treatment measurement error. In this example, patients with dementia are less able to effectively communicate, so true cases of the disease are more likely to have faulty medical histories. Another example of recall bias could be in a study of the effect of alcohol use during pregancy $A$ on birth defects $Y$, if the alcohol intake is measured by recall after delivery. Bad medical outcomes, especially ones like complicated births, often affect patient recall and patient reporting. Figure 9.4 is an example of independent differential measurement error. Adding dependent errors such as a faulty interview makes Figure 9.6 an example of dependent differential error.	I.113
A: Drug use A: Recorded history of drug use Y: Liver toxicity Y: Liver lab values U_A: Measurement error for A U_Y: Measurement error for Y ____ __U_AY: Measurement error affecting A and Y	An example of true treatment affecting the measurement error of the outcome could also arise in the setting of drug use and liver toxicity. For example, if a doctor finds out a patient has a drug problem, he may start monitoring the patients liver more frequently, and become more likely to catch aberrant liver lab values and record them in the EHR. Figure 9.5 is an example of independent differential measurement error. Adding dependent errors such as a faulty interview makes Figure 9.7 an example of dependent differential error.	I.113
A: Drug use Y: Liver toxicity L: History of hepatitis L*: Measured history of hepatitis	This example demonstrates mismeasured confounders. Controlling for $L$ in Figure 9.8 would be sufficient to allow for causal inference. However, if $L$ is imperfectly measured – say, because it was retrospectively recorded from memory – then the standardized or IP-weighted risk ratio based on $L^{}$ will generally differ from the true causal risk ratio. A cool observation is that since noisy measurement of confounding can be thought of as unmeasured confounding, Figure 9.9 is actually equivalent to Figure 7.5: $L^{}$ is essentially a surrogate confounder (like Figure 7.5’s $L$) for an unmeasured actual confounder (Figure 7.5’s $U$ playing the role of Figure 9.9’s $L$). Hence, controlling for $L^{*}$ will be better than nothing but still flawed.	I.114
A: Aspirin Y: Stroke L: Heart Disease U: Atherosclerosis (unmeasured) L*: Measured history of heart disease	Figure 9.9 is the same idea as Figure 9.8: Even though controlling for $L$ would be sufficient, a mismatched $L^{}$ is insufficient to block the backdoor path in general. Another note here is that mismeasurement of confounders can result in apparent effect modification. For example, if all participants who reported a history of heart disease ($L^{}=1$) and half the participants who reported no such history ($L^{}=0$) actually had heart disease, then stratifying on ($L^{}=1$) would eliminate all confounding in that stratum, but statifying on ($L^{}=0$) would fail to do so. Thus one could detect a spurious assocation in ($L^{}=0$) but not in ($L^{}=1$) and falsely conclude that $L^{}$ is an effect modifier. (See discussion on I.115.)	I.114
A: Folic Acid supplements Y: Cardiac Malformation C: Death before birth C*: Death records	Conditioning on a mismeasured collider induces a selection bias, because $C^{*}$ is a common effect of treatment $A$ and outcome $Y$.	I.115
Z: Assigned treatment A: Heart Transplant Y: 5-year Mortality (Ignore U here)	Figure 9.11 is an example of an intention-to-treat RCT. ITT RCT’s can be almost thought of as an RCT with a potentially misclassified treatment. However, unlike a misclassifed treatment, the treatment assignment $Z$ has a causal effect on the outcome $Y$, both (a) by influencing the actual treatment $A$, and (b) by influencing study participants who know what $Z$ is and change their behavior accordingly. Hence, the causal effect of $Z$ on $Y$ depends on the strength of the arrow $Z \rightarrow Y$, the arrow $Z \rightarrow A$, and the arrow $A \rightarrow Y$. Double-blinding attempts to remove $Z \rightarrow Y$ (Figure 9.12).	I.115
Z: Assigned treatment A: Heart Transplant Y: 5-year Mortality U: Illness Severity (unmeasured)	By including $U$, we are considering the fact that in an IIT study, severe illness (or other variables) contribute to some patients to seek out different treatment than they’ve been assigned. Note that there is a backdoor path $A \leftarrow U \rightarrow Y$ and thus confounding for the effect of $A$ on $Y$, requiring adjustment. However, there is no confounding of $Z$ and $Y$, and thus no need for adjustment. This explains why the intention-to-treat effect is often estimated in lieu of the per-protocol effect. Taken together, per-protocol effect brings with it unmeasured confounding, and IIT brings risk of misclassification bias. So one needs to trade these off when deciding which to use. (Full discussion below and on I.120)	I.115
Z: Assigned treatment A: Heart Transplant Y: 5-year Mortality U: Illness Severity (unmeasured) L: Measured factors that mediate U	This example is of a as-treated analysis, a type of per-protocol analysis As-treated includes all patients and compares those treated ($A=1$) vs not treated ($A=0$), independent of their assignment $Z$. As-treated analyses are confounded by $U$, and thus depend entirely on whether they can accurately adjust for measurable factors $L$ to block the backdoor paths between $A$ and $Y$.	I.118
Z: Assigned treatment A: Heart Transplant Y: 5-year Mortality U: Illness Severity (unmeasured) L: Measured factors that mediate U S: Selection filter (A=Z)	This example is of a conventional per-protocol analysis, a second method to measure per-protocol effect. Conventional per-protocol analyses limit the population to those who adhered to the study protocol, subsetting to those for whom $A=Z$. This method induces a selection bias on $A=Z$, and thus still requires adjustment on $L$.	I.118

Some additional (but structurally redundant) examples of measurement bias from chapter 9:

DAG	Example	Notes	Page
	A: Drug use A: Recorded history of drug use Y: Liver toxicity Y: Liver lab values U_A: Measurement error for A U_Y: Measurement error for Y	Reverse causation bias is another example of how the true outcome can bias treatment measurement error. In this example, liver toxicity worsens clearance of drugs from the body, which could affect blood levels of the drugs.	I.112

Causal Inference Book Part I -- Glossary and Notes

Wed, 19 Jun 2019 07:50:00 -0400

This page contains some notes from Miguel Hernan and Jamie Robin’s Causal Inference Book. So far, I’ve only done Part I.

This page only has key terms and concepts. On this page, I’ve tried to systematically present all the DAGs in the same book. I imagine that one will be more useful going forward, at least for me.

Table of Contents:

A few common variables
Chapter 1: Definition of Causal Effect
Chapter 2: Randomized Experiments
Chapter 3: Observational Studies
Chapter 4: Effect Modification
Chapter 5: Interaction
Chapter 6: Causal Diagrams
Chapter 7: Confounding
Chapter 8: Selection Bias
Chapter 9: Measurement Bias
Chapter 10: Random Variability
Chapter 11: Why Model?
Chapter 12:

A few common variables

Variable	Meaning
A, E	Treatment
Y	Outcome
Y^(A=a)	Counterfactual outcome under treatment with $a$
Y^(a,e)	Joint counterfactual outcome under treatment with $a$ and $e$
L	Patient variable (often confounder)
U	Patient variable (often unmeasured or background variable)
M	Patient variable (often effect modifier)

Chapter 1: Definition of Causal Effect

Term	Notation or Formula	Notes	Page
Association	Pr[Y=1\|A=1] $\neq$ Pr[Y=1\|A=0]	Example definitions of independence (lack of association): Y $\unicode{x2AEB}$ A or Pr[Y=1\|A=1] - Pr[Y=1\|A=0] = 0 or $\frac{Pr[Y=1\|A=1]}{Pr[Y=1\|A=0]}$ = 1 or $\frac{Pr[Y=1\|A=1]/Pr[Y=0\|A=1]}{Pr[Y=1\|A=0]/Pr[Y=0\|A=0]}$ = 1	I.10
Causation and Causal Effects	Causation: Pr[Y^(a=1)=1] $\neq$ Pr[Y^(a=0)=1] Individual Causal Effects: Y^(a=1) - Y^(a=0) Population Average Causal Effects: E[Y^(a=1)] - E[Y^(a=0)] where Y^(a=1) = Outcome for treatment w/ $a=1$ Y^(a=0) = Outcome for treatment w/ $a=0$	Sharp causal null hypothesis: Y^(a=1) = Y^(a=0) for all individuals in the population. Null hypothesis of no average causal effect: E[Y^(a=1)] = E[Y^(a=0)] Mathematical representations of causal null: Pr[Y^(a=1)=1] - Pr[Y^(a=0)=1] = 0 or $\frac{Pr[Y^{a=1}=1]}{Pr[Y^{a=0}=1]} = 1$ or $\frac{Pr[Y^{a=1}=1]/Pr[Y^{a=1}=0]}{Pr[Y^{a=0}=1]/Pr[Y^{a=1}=0]} = 1$	I.7

Chapter 2: Randomized Experiments

Term	Notes	Page
Marginally randomized experiment	Single unconditional (marginal) randomization probability applied to assign treatments to all individuals in experiment. Produces exchangeability of treated and untreated. Values of counterfactual outcomes are missing completely at random (MCAR).	I.18
Conditionally randomized experiment	Randomized trial where study population is stratified by some variable $L$, with different treatment probabilities for each stratum. Needn’t produce marginal exchangeability, but produces conditional exchangeability. Values of counterfactuals are not MCAR, but are missing at random (MAR) conditional on $L$.	I.18
Standardization	Calculate the marginal counterfactual risk from a conditionally randomized experiment by taking a weighted average over the stratum-specific risks. Standardized mean: $\sum_l E[Y\|L=l,A=a] \times Pr[L=l]$ Causal risk ratio can be computed via standardization as follows: $\frac{Pr[Y^{a=1}=1]}{Pr[Y^{a=0}=1]} = \frac{\sum_l E[Y=1\|L=l,A=1]\times Pr[L=l]}{\sum_l E[Y=1\|L=l,A=1]\times Pr[L=l]}$	I.19
Inverse probability weighting	Given a conditionally randomized study population: We can invoke an assumption of conditional exchangeability given $L$ to simulate the counterfactual in which everyone had received (or not received) the treatment: . The causal effect ratio can then be directly calculated by comparing $Pr[Y^{a=1}=1]/Pr[Y^{a=0}=1]$ (in this example, it’s $\frac{10/20}{10/20}=1$.) By the same token, you can effectively double your population and create a hypothetical pseudo-population in which everyone had received both treatments: This process amounts to weighting each individual in the population by the inverse of the conditional probability of receiving the treatment she received (see formula on right above). Hence the name inverse probability (IP) weighting.	I.20

Chapter 3: Observational Studies

Term	Notation or Formula	English Definition	Notes	Page
Identifiability conditions	See below.	Sufficient conditions for conceptualizing an observational study as a randomized experiment. Consist of: 1. Consistency 2. Exchangeability, and 3. Positivity.		I.25
Consistency	If $A_i$ = $a$, then $Y_{i}^{a}=Y^{A_i}$ = $Y_i$	“The values of treatment under comparison correspond to well-defined interventions that, in turn, correspond to the versions of treatment in the data.” Has two main components: 1. Precise specification of counterfactual outcomes Y^a, and 2. Linkage of counterfactual outcomes to observed outcomes.	Violated in an ill-defined intervention. Examples: - Study looks at “heart transplant” but doesn’t look at protocols (e.g. which immunosuppresant is used). If effect varies between versions of treatment and protocols not equally distributed, could cause problems. - Study wants to look at “obesity”, but “non-obesity” lumps together non-obesity from exercise vs cachexia vs genes vs diet. Need to subset population or make assumption that specific source of non-obesity doesn’t impact outcome. (Assumption called treatment-variation irrelevance assumption.) Not a testable assumption, relies on domain expertise.	I.31
Exchangeability (aka exogeneity)	Y^a $\unicode{x2AEB}$ A for all $a$ or Pr[Y^a=1 \| A=1] = Pr[Y^a=1 \| A=0] = Pr[Y^a=1]	“The treated, had they remained untreated, would have experienced the same average outcome as the untreated did, and vice versa.” Essentially, this is the assumption of no unmeasured confounding.	Beware formula: Not the same as Y $\unicode{x2AEB}$ A, which would mean treatment has no effect on outcome.	I.27
Conditional exchangeability	Y^a $\unicode{x2AEB}$ A \| L for all a or Pr[Y^a=1 \| A=1, L=1] = Pr[Y^a=1 \| A=0, L=1] = Pr[Y^a=1] \| L=1	“The conditional probability of receiving every value of treatment is randomized or depends only on measured covariates”	Think conditional RCT where assigment depends only on $L$. In observational studies, this is an untestable assumption, thus relies on domain expertise.	I.27
Positivity	Pr[A=a \| L=$l$ ] > 0 for all values $l$ with Pr[L=$l$] $\neq$ 0 in the population of interest	“The conditional probability of receiving every value of treatment is greater than zero, i.e. positive.”	Aka “Experimental treatment assumption” Example of positivity not holding: doctors always give heart transplants to patients in critical condition, eliminitating positivity from that stratum of an observational study. Unlike exchangeability, positivity, can be empricially verified.	I.30

Chapter 4: Effect Modification

Term	Notation or Formula	English Definition	Notes	Page
Effect modification aka effect-measure modification	Additive effect modification: E[Y^(a=1)-Y^(a=0) \| M = 1] $\neq$ E[Y^(a=1)-Y^(a=0) \| M = 0] Multiplicative effect modification: $\frac{E[Y^{a=1} \| M = 1]}{E[Y^{a=0} \| M = 1]}$ $\neq$ $\frac{E[Y^{a=1}\| M = 0]}{E[Y^{a=0}\| M = 0]}$	$M$ is a modifier of the effect of $A$ on $Y$ when the average causal effect of $A$ on $Y$ varies across levels of $M$.	The null hypothesis of no average causal effect does not necessarily imply the absence of effect modification (e.g. equal and oppositive effect modifications in men and women could cancel at the population level), but the sharp null hypothesis of no causal effect does imply no effect modicifaction. We only count variables unaffected by treatment as effect modifiers. Similar variables that are effected by treatment are termed mediators.	I.41
Qualitative effect modification		Average causal effects in different subsets of the population go in opposite directions.	In presence of qualitative effect modification, additive effect modification implies multiplicative effect modification, and vice versa. In absence of qualitative effect modification, it’s possible to have only additive or only multiplicative effect modification. Effect modifiers are not necessarily assumed to play a causal role. To make this explicit, sometimes the terms surrogate effect modifier vs causal effect modifier are used, or you can play it even safer and refer to “effect heterogeneity across strata of $M$.” Effect modification is helpful, among other things, for (i) assessing transportability to new populations where $M$ may have different prevalences, (ii) choosing subpopulations that may most benefit from treatment, and (iii) identifying mechanisms leading to outcome if modifiers are mechanistically meaningful (e.g. circumscision for HIV transmission).	I.42
Stratification	Statified causal risk differences: E[Y^(a=1) \| M = 1] - E[Y^(a=0) \| M = 1] and E[Y^(a=1) \| M = 0] - E[Y^(a=0) \| M = 0]	To identify effect modification by variable $M$, separately compute the causal effect of $A$ on $Y$ for each statum of the variable $M$.	If study design assumes conditional rather than marginal exchangeability, analysis to estimate effect modification must account for all other variables $L$ required to give exchangeability. This could involve standardization (IP weighting, etc.) by $L$ within each stratum $M$, or just using finer-grained stratification over all pairwise combinations of $M$ and $L$ (see page I.49). By the same token, stratification can be an alternative to standardization techinques such as IP weighting in analysis of any conditional randomized experiment : instead of estimating an average causal effect over the population while standardizing for $L$, just stratify on $L$ and report separate causal effect estimates for each stratum.	I.43-49
Collapsibility		A characteristic of a population effect measure. Means that the effect measure can be expressed as a weighted average of stratum-specific measures.	Examples of collapsible effect measures: risk ratio and risk difference Example of non-collapsible effect measure: odds ratio. Noncollapsibility can produce counter-intuitive findings like a causal odds ratio that’s smaller in the average population than in any stratum of the population.	I.53
Matching		Construct a subset of the population in which all variables $L$ have the same distribution in both the treated and the untreated.	Under assumption of conditional exchangeability given $L$ in the source population, a matched population will have unconditional exchangeability. Usually, constructed by including all of the smaller group (e.g. the treated) and selecting one member of the larger group (e.g. the untreated) with matching $L$ for each member in the smaller group. Often requires approximate matching.	I.49
Interference		Treatment of one individual effects treatment status of other individuals in the population.	Example: A socially active individual convinces friends to join him while exercising.	I.48
Transportability		Ability to use causal effect estimation from one population in order to inform decisions in another (“target”) population.	Requires that the target population is characterized by comparable patterns of: - Effect modification - Interference, and - Versions of treatment	I.48

Chapter 5: Interaction

Term	Notation or Formula	English Definition	Notes	Page
Joint counterfactual	Y^(a,e)	Counterfactual outcome that would have been observed if we had intervented to set the individual’s values of $A$ (treatment component 1) to $a$ and $E$ (treatment component 2) to $e$.		I.55
Interaction	Interaction on the additive scale: Pr[Y^(a=1,e=1)=1] - Pr[Y^(a=0,e=1)=1] $\neq$ Pr[Y^(a=1,e=0)=1] - Pr[Y^(a=0,e=0)=1] or Pr[Y^(a=1) = 1 \| E=1 ] - Pr[Y^(a=0) = 1 \| E=1 ] $\neq$ Pr[Y^(a=1) = 1 \| E=0 ] - Pr[Y^(a=0) = 1 \| E=0]	The causal effect of $A$ on $Y$ after a joint intervention that set $E$ to 1 differs from the causal effect of $A$ on $Y$ after a joint intervention that set $E$ to 0. (Definition also holds if you swap $A$ and $E$.)	Different from effect modification because an effect modifier $M$ is not considered a treatment or otherwise a variable on which we can intervene. In interaction, interventions $A$ and $E$ have equal status. Note from definition 2 on the left, however, that the mathematical definitions of effect modification and interaction line up. This means that if you randomize an interactor, it becomes equivalent to an effect modifier. Inference over joint counterfactuals require that the identifying conditions of exchangeability, positivity, and consistency hold for both treatments.	I.55
Counterfactual response type		A characteristic of an individual that refers to how she will respond to a treatment.	For example, an individual may have the same counterfactual outcome regardless of treatment, be helped by the treatment, or be hurt by the treatment. The presence of an interaction between $A$ and $E$ implies that some individuals exist such that their counterfactual outcomes under $A=a$ cannot be determined without knowledge of $E$.	I.58
Sufficient-component causes		A set of variables that are sufficient to determine the outcome for a specific individual.	The minimal set of sufficient causes can be different for distinct ndividuals in the same study. For example, a patient with background factor $U_1$ might have the same outcome regardless of treatment, whereas another patient’s outcome might be driven by both a treatment $A$ and interactor $E$. Minimal sufficient-component causes are sometimes visualized with pie charts. Contrast between counterfactual outcomes framework and sufficient-component-cause framework: Sufficient outcomes framework focuses on questions like: “given a particular effect, what are the various events which might have been its cause?” and counterfactual outcomes framework focuses on questions like: “what would have occurred if a particular factor were intervened upon and set to a different level than it was?”. Sufficient-component-causes requires more detailed mechanistic knoweldge and is generally more a pedagological tool than a data analysis tool.	I.61
Sufficient cause interaction		A sufficient cause interaction between $A$ and $E$ exists in a population if $A$ and $E$ occur together in a sufficient cause.	Can be synergistic (A = 1 and E = 1 present in sufficient cause) or antagonistic (e.g. A = 1 and E = 0 is present in sufficient cause) .	I.64

Chapter 6: Causal Diagrams

Term	Definition	Page
Path	A path on a DAG is a sequence of edges connecting two variables on the graph, with each edge occurring only once.	I.76
Collider	A collider is a variable in which two arrowheads on a path collide. For example, $Y$ is a collider in the path $A \rightarrow Y \leftarrow L$ in the following DAG:	I.76
Blocked path	A path on a DAG is blocked if and only if: 1. it contains a noncollider that has been conditioned, or 2. it contains a collider that has not been conditioned on and has no descendants that have been conditioned on.	I.76
d-separation	Two variables are d-separated if all paths between them are blocked	I.76
d-connectedness	Two variables are d-connected if they are not d-separated	I.76
Faithfulness	Faithulness is when all non-null associations implied by a causal diagram exist in the true causal DAG. Unfaithfulness can arise, for example, in certain settings of effect modification, by design as in matching experiments, or in the presence of certain deterministic relations between variables in the graph.	I.77
Positivity (on graphs)	The arrows from the nodes $L$ to the treatment node $A$ are not deterministic. (Concerned with nodes into treatment nodes)	I.75
Consistency (on graphs)	Well-defined intervention criteria: the arrow from treatment $A$ to outcome $Y$ corresponds to a potentially hypothetical but relatively unambiguous intervention. (Concerned with nodes leaving the treatment nodes.)	I.75
Systematic bias	The data are insuffient to identify the causal effect even withan infinite sample size. This occurs when any sturctural association between treatment and outcome does not arise from the causal effect of treatment on outcome in the population of interest.	I.79
Conditional bias	For average causal effects within levels of $L$: Conditional bias exists whenever the effect measure (e.g. causal risk ratio) and the corresponding association measure (e.g. associational risk ratio) are not equal. Mathematically, this is when: $Pr[Y^{a=1} \| L = l] - Pr[Y^{a=0} \| L = l]$ differs from $Pr[Y\|L=l, A = 1] - Pr[Y\|L-l, A=0]$ for at least one stratum $l$. For average causal effects in the entire population: Conditional bias exists whenever $Pr[Y^{a=1} ] - Pr[Y^{a=0}]$ $\neq$ $Pr[Y = 1\| A = 1] - Pr[Y = 1 \| A = 0]$.	I.79
Bias under the null	When the null hypothesis of no causal effect of treatment on the outcome holds, but treatment and outcome are associated in the data. Can be from either confounding, selection bias, or measurement error..	I.79
Confounding	The treatment and outcome share a common cause.	I.79
Selection bias	Conditioning on common effects.	I.79
Surrogate effect modifier	An effect modifier that does not dirrectly influence that outcome but might stand in for a causal effect modifier that does.	I.81

Chapter 7: Confounding

Concept	Definition or Notes	Page
Backdoor Path	A noncausal path between treatment and outcome that remains even if all arrows pointing from treatment to other variables (the descendants of treatment) are removed. That is, the path has an arrow pointing into treatment.	I.83
Confounding by indication (or Channeling)	A drug is more likely to be prescribed to individuals with a certain condition that is both an indication for treatment and a risk factor for the disease.	I.84
Channeling	Confounding by indication in which patient-specific risk factors $L$ encourage doctors to use certain drug $A$ within a class of drugs.	I.84
Backdoor Criterion	Assuming consistency and positivity, the backdoor criterion sets the circumstances under which (a) confounding can be eliminated from the analysis, and (b) a causal effect of treatment on outcome can be identified. Criterion is that identifiability exists if all backdoor paths can be blocked by conditioning on variables that are not affected by the treatment. The two settings in which this is possible are: 1. No common causes of treatment and outcome. 2. Common causes but enough measured variables to block all colliders.	I.85
Single-world intervention graphs (SWIG)	A causal diagram that unifies counterfactual and graphical approaches by explicitly including the counterfactual variables on the graph. Depicts variables and causal relations that would be observed in a hypothetical world in which all individuals received treatment level $a$. In other words, is a graph that represents the counterfactual world created by a single intervention, unlike normal DAGs that represent variables and causal relations from the actual world.	I.91
Two categories of methods for confounding adjustment	G-Methods: G-formula, IP weighting, G-estimation. Exploit conditional exchangeability in subsets defined by $L$ to estimate the causal effect of $A$ on $Y$ in the entire population or in any subset of the population. Stratification-based Methods: Stratification, Restriction, Matching. Methods that exploit conditional exchangeability in subsets defined by $L$ to estimate the association between $A$ and $Y$ in those subsets only.	I.93
Difference-in-differences and negative outcome controls	A technique to account for unmeasured confounders under specific conditions. The idea is to measure a “negative outcome control”, which is the same as the main outcome but right before treatment. Then, instead of just reporting the effect of the treatment on the outcome (treatment effect + confounding effect), you substract out the effect of treatment on the negative outcome (only confounding effect). What’s left is is the difference-in-differences. This requires the assumption of additive equi-confounding: $E[Y^{0}\|A=1] - E[Y^{0}\|A=0]$ = $E[C\|A=1] - E[C\|A=0]$. Negative outcome controls are also sometimes used to try to detect confounding. Note: The DAG demonstration (Figure 7.11) is really useful for this one.	I.95
Frontdoor criterion and Frontdoor adjustment	A two-step standardization process to estimate a causal effect in the presence of a confounded causal effect that is mediated by an unconfounded mediator variable. Given a DAG such as: $Pr[Y^{a}=1] = \sum_{m}Pr[M^{a}=m]Pr[Y^{m}=1]$. Thus, standardization can be applied in two steps: 1. Compute $Pr[M^{a}=m]$ as $Pr[M=m\| A=a]$, and 2. Compute $Pr[Y^{a}=1]$ as $\sum_{a'}Pr[Y=1\|M=m,A=a']Pr[A=a']$ These are then combined with the formula $\sum_{m}Pr[M=m\| A=a]\sum_{a'}Pr[Y=1\|M=m,A=a']Pr[A=a']$ The name frontdoor adjustment comes because it relies on the path from $A$ and $Y$ moving through a descendent $M$ of $A$ that causes $Y$.	I.96

Chapter 8: Selection Bias

Note: I have almost no notes in here, because the DAG section contains pretty much all the content I’m interested in noting here.

Concept	Definition or Notes	Page
Competing Event	An event that prevents the outcome of interest from happening. For example, death is a competing event, because once it occurs, no other outcome is possible.	I.108
Multiplicative survival model	A multiplicative survival model is of the form: $Pr[Y=0\|E=e,A=a]=g(e)h(a)$ . The data forllow such a model when there is no interaction between $A$ and $E$ on a multiplicative scale. This allows, for example, $A$ and $E$ to be conditionally independent given $Y=0$ but not conditionally dependent when $Y=1$. See Technical Point 8.2 and the example in Figure 8.13.	I.109
Healthy worker bias	Example of selection bias where people are only included in the study if they are healthy enough, say, to come into work and be tested.	I.99
Self-selection bias	Example of selection bias where people volunteer for enrollment.	I.100

Chapter 9: Measurement Bias

Concept	Definition or Notes	Page
Measurement bias or Information bias	Systematic difference in associational risk and causal risk that arises due to measurement error. Eliminates causal inference even under identifiability conditions of exchangeability, positivity, and consistency.	I.112
__ Independent measurement error __	Independent measurement error takes place when the measurement error of the treatment ($U_{A}$) and the measurement error of the response ($U_{Y}$) are d-separated. Dependent measurement error is when they are d-connected.	I.11
__ Nondifferential measurement error __	Measurement error is nondifferential with respect to the outcome if $U_{A}$ and $Y$ are d-separated. Measurement error is nondifferential with respect to the treatment if $U_{Y}$ and $A$ are d-separated.	I.11
Intention-to-treat effect	The causal effect of randomized treatment assigment $Z$ in an intention-to-treat trial on the outcome $Y$. Depends on the strength of the effect of assignment treatment on outcome ($Z \rightarrow Y$), the assignment treatment on actual treatment received ($Z \rightarrow A$), and the effect of the actual treatment received on outcome ($A \rightarrow Y$). In theory, this does not require adjustment for confounding, has null preservation, and is conservative. See below for comments on latter two.	I.116
The exclusion restriction	(The goal of double-blinding). The assumption that there is no direct arrow from assigned treatment $Z$ to outcome $Y$ in an intention-to-treat design.	I.117
Null Preservation in an IIT	If treatment $A$ has a null effect on $Y$, then assigned treatment $Z$ also has a null effect on $Y$. Ensure, in theory, that a null effect will be declared when none exists. However, it requires that the exclusion restriction holds, which breaks down unless their is perfect double-blinding.	I.119
Conservatism of the IIT vs Per-protocol	The IIT effect is supposed to be closer to the null than the value of the per-protocol effect, because imperfect adherence results in attenuation rather than exaggeration of effect. Thus IIT appears to be a lower bound for per-protocol effect (and is thus conservative). However, there are three issues with this: 1. Argument assumes monotonicity of effects (treatment same direction for all patients). If, say, there is inconsistent adherence and thus inconsistent effects, then this could become anti-conservative. 2. Even given monotonicity, IIT would only be conservative compared to placebos, not necessarily head-to-head trials, where adherence in the second drug might be different. 3. Even if IIT is conservative, this makes it dangerous when goal is evaluating safety, where you arguably want to be more aggresive in finding signal.
Per-protocol effect	The causal effect of randomized treatment that would have been observed if all individuals had adhered to their assigned treatment as specified in the protocol of the experiment. Requires adjustment for confounding.	I.116
As-treated analysis	An analysis to assess for per-protocol effect. Includes all patients and compares those treated ($A=1$) vs not treated ($A=0$), independent of their assignment $Z$. Confounded.	I.118
Conventional per-protocol analysis	An analysis to assess for per-protocol effect. Limits the population to those who adhered to the study protocol, subsetting to those for whom $A=Z$. Induces a selection bias on $A=Z$, and thus still requires adjustment on $L$.	I.118
Tradeoff between ITT and Per-protocol	Summary: Estimating the per-protocol effect adds unmeasured confounding, which needs to be (imperfectly) adjusted for. Intention-to-treat adds a misclassification bias, and does not necessarily deliver on purported guarantees of conservatism. As such, there is a real tradeoff, here.	I.117-I.120

Chapter 10: Random Variability

Sorry, I’m skipping this section, because the key terms are all stats concepts and its mostly a pump-up chapter for the rest of the book.

Chapter 11: Why Model?

Concept	Definition or Notes	Page
Saturated Models	Models that do not impose restrctions on the data distribution. Generally, these are models whose number of parameters in a conditional mean model is equal to the number of means. For example, a linear model E[ y	x] ~ b0 + b1x when the population is stratified into only two groups. These are non-parametric models*.	II.143
Non-parametric estimator	Estimators that produce estimates from the data without any a priori restrictions on the true function. When using the entire population rather than a sample, these yield the true value of the population parameter.	II.143

Chapter 12:

Concept	Definition or Notes	Page
Stabilized Weights
Marginal Structure Model

DONT MISS THE DOUBLY ROBUST ESTIMATOR in TECHNICAL POINT 13.2

FAQ on Medical Adversarial Attacks Policy Paper

Thu, 21 Mar 2019 00:00:00 -0400

What’s the paper and why this FAQ?
Do you think adversarial attacks are the biggest concern in using machine learning in healthcare? (A: Nowhere close!) Then why write the paper?
There seems to have been something of a pivot between the preprint and the policy forum discussion, with the latter focusing much less on images. Was this intentional?
In the paper, you frame existing examples like upcoding and claims craftsmanship as adversarial attacks, or at least as their precursors. Is that fair?
Isn’t this unrealistic? I mean, would there ever be cases when someone actually uses adversarial examples?
“Adversarial attacks” sounds scary. Do you think people will use these as tools to hurt people by hacking diagnostics, etc?
Are you hoping to stall the development of medical ML because of adversarial attacks?
Small note on the figure

What’s the paper and why this FAQ?

Last spring, some colleages (chiefly Andy Beam) and I released a preprint on adversarial attacks on medical computer visions systems. This manuscript was targeted at a technical audience. It was written with the goal of explaining why adversarial attacks researchers should consider healthcare applications among their threat models, as well as to provide a few technical examples as a proof of concept. I ended up getting a lot of great feedback/pushback via email and twitter, which I really appreciated and which informed an update of the preprint on arxiv.

After the article was released, we were also put in touch with Jonathan Zittrain and John Bowers from Harvard Law School as well as Joi Ito of the MIT Media Lab. These are incredibly thoughtful people with a lot of amazing experience. We decided to write a follow-up article targeted more at medical and policy folks, with the intention of examining precedence for adversarial attacks in the healthcare system as it exists today and initiating a conversation about what to do about them going forward. The result is being published today in Science, here. It’s been an absolute pleasure working with these guys.

We really tried hard to be thoughtful and measured. Given the nature of the topic, however, I’ve been fretting a bit that the paper will be misconstrued/taken out of context. At a minimum, I anticipate getting a lot of the same questions I got the first time around on the preprint, and figured it’d be easier to write up answers to these in one place. The paper is short and non-technical enough that it doesn’t really need a blog post/explainer per se, so I opted to go with a “FAQ.” Hope it’s not too obnoxious.

Do you think adversarial attacks are the biggest concern in using machine learning in healthcare? (A: Nowhere close!) Then why write the paper?

Adversarial attacks consitute just one small part of a large taxonomy of potential pitfalls of machine learning (both ML in general and medical ML in particular).

When I think about points of failure of medical machine learning, I think first about things like: dataset shift, accidentally fitting confounders or healthcare dynamics instead of true signal, discriminatory bias, overdiagnosis, or job displacement. Especially given recent challenges in getting ML to generalize to new populations, there are also uncomfortable questions to ask about when and how we can be sure we’re ready to deploy a ML algorithm in a new patient population.

While all of these issues may have general implications for policy, the way I think about them most is in context of how they inform our evaluations of individual ML systems. Each of the above issues demands that specific questions be asked of the systems that we’re evaluating. Questions like: what population was this model fit on, and how does it compare to the population the system will be used in? How could the data I’m feeding this algorithm have changed in the time since the model was developed? Have we thought carefully about the workflow so these algorithms are getting applied to patients with the right priors and the healthcare providers know how to properly act upon positive tests when the time comes?

Our main goal in this work was, in many ways, simply to point out that adversarial attacks at least deserve ackowledgement as one of these potential pitfalls. Questions this reality might prompt us to ask when evaluating a specific system include: Is there a mismatch in incentives between the person developing/hosting the algorithm and the person sending data into that algorithm? If so, are we prepared for the fact that those providing data to the algorithm might try to intentionally craft that data to achieve the results they want? If we decide to try to use models more robust to adversarial attacks, to what extent are we comfortable trading off accuracy in order to do so?

In many application settings, the answer to the incentives question may simply be “no.” But I don’t think that’s necessarily the case for all possible applications of machine learning in healthcare. To boot, we as authors have been slightly disconcerted by the fact that when speaking to high-level decision makers at hospitals, insurance companies, and elsewhere who are investing heavily in ML, they generally aren’t even aware of the existance of adversarial examples. So it’s really that mismatch in awareness relative to other pitfalls of ML that prompted the paper, even if in the grand scheme of things adversarial attacks are just one piece of a very large pie.

Finally – and perhaps most importantly – adversarial examples provide a proof-of-concept for a certain collection of issues with modern machine learning methods. More specificially, adversarial techniques help us assess the worst-case performance against new data distributions, and demonstrate that current models fail to encode key invariants in the classes that we trying to model. This has implications not just for the susceptibility of these algorithms to manipulation, but more fundamentally for our ability to trust these systems in any safety-critical settings. To boot, it does so in a way that is very tangible for researchers who are trying to design better models that can encode arbitrary invariants and whose behavior align exactly with how humans would want/expect them to behave. Alexander Madry calls this field of research “ML alignment,” which I think is a good phrase. (Addendum: Catherine Olsson has written a great medium post that makes many of these same points more thoughtfully and with more nuance. I highly recommend it if you’re interested in this topic.)

There seems to have been something of a pivot between the preprint and the policy forum discussion, with the latter focusing much less on images. Was this intentional?

Yes! Our preprint was geared toward a technical audience, and was largely motivated by a desire to get people who work on ML security/robustness research to start thinking about healthcare when considering attacks and defenses, rather than just things more native to the CS world like self-driving cars. At the time, the bulk of high-profile work – both in adversarial attacks and in medical ML – had been done in the computer vision space, so we decided to focus on this for our initial deep dive and in building our three proofs of concept.

As we thought a lot more deeply about the problem, however, we realized that we should probably expand our scope. The bulk of ML happening today in the healthcare industry isn’t in the form of diagnostics algorithms, but is being used internally at insurance companies to process claims directly for first-pass approvals/denials. And the best examples for existing adversarial attack-like behavior takes place in context of providers manipulating these claims. These provide a jumping off point to understand a spectrum of emerging motivations for adversarial behavior across all aspects of the healthcare system and across many different forms of ML. (See the next section on this as well.)

In the paper, you frame existing examples like upcoding and claims craftsmanship as adversarial attacks, or at least as their precursors. Is that fair?

I think so. The paper “adversarial classification” from KDD ‘04 even talks specifically about fraud detection along with spam and other applications of adversarial attacks.

For a few years, the adversarial examples community focused really heavily on human-imperceptible changes to images, usually computed using gradient tricks. But more recently, I think the community has (appropriately) returned to defining adversarial attacks as any method employed to craft one’s data to influence the behavior of an ML algorithm that processes it. As Gilmer et al say, “what is important about an adversarial example is that an adversary supplied it, not that the example itself is somehow special.” Such framings of the problem allow even for natural data identified through simple techniques like guess-and-check and grid search to be adversarial examples, so long as they are used with adversarial intent, and indeed some recent papers in major CS venues have employed such techinques.

At present, the adversarial behavior in context of things like medical claims appears to be limited to providers stumbling upon or essentially guess-and-checking combinations of codes that will provide higher rates of reimbursement/approval without commiting overt fraud. (Some studies like this one have suggested a hefty cohort of physicians think that manipulating claims is even necessary in order to provide high-quality care.) In light of the last paragraph, I think you can make a reasonable case that this behavior itselft already constitues an adversarial attack on the ML systems used by insurance companies, though admittedly a fairly boring one from a technical point of view. But it may be getting more interesting. Hospitals invest immense resources in this process – up to $99k per physician per year – and I know for a fact that some providers are already investing heavily in software solutions to more explicitly optimize this stuff. Likewise, insurance companies are doubling down on AI solutions to fraud detection, including processing not just claims but things like medical notes. Now that computer vision algorithms are starting to get FDA approved for medical purposes, I think it’s also likely that payors and regulators will start leveraging this tech as well, which may lead to incentives for computer vision adversarial attacks, a hypothetical scenario at the center of our preprint.

In any event, the real motivation for the claims examples we focus on in the paper is not to call these out as adversarial attacks per se. Rather, it’s to demosntrate how motivations – both positive and negative – already exist in the healthcare system that motivate various players to subtly manipulate their data in order to achieve specific results. This is the soul of the adversarial attacks problem. As both the reach and sophistication of medical machine learning expands across the healthcare system, the techniques used to game these algorithms will likely expand significantly as well.

Isn’t this unrealistic? I mean, would there ever be cases when someone actually uses adversarial examples?

We got some really good and reasonable pushback on this point the first time around, and once again, I really appreciated it. (Partly) as a result, we’ve spent a lot more time the last few months thinking about the range of adversarial behavior in healthcare information exchange. We ended up shifting the focus a bit as a result. In any event, there’s a whole spectrum of threat models at play here.

Without rehashing the information about claims in the question just above this one too much, machine learning is being used pretty extensively (and moreso every day, at increasing sophistication) to make first-pass approvals on claims. And while this seems like a purely financial/bureaucratic concern, this process does already have a major impact on healthcare – at least in the U.S. – today. Here is an example of writing from a doctor that explains the level of frustration here, which is reflected of common experiences. What’s more, there is something more subtle here; when I speak with clinicians, most of them feel like they get no formal feedback from what’s happening under the hood at insurance so they don’t have any real rhyme or reason for what combination of claims it is that’s resulting in their denials. To boot, there are often many possible codes that could apply to any given procedure or diagnosis, and it’s a bit of a black box for which will be likely to receive pushback and which will get you the most reimbursement. Currently, most hospitals use extensive teams of human billers to manually try to do this process, but companies for automated billing exist, and I have personally spoken to physicians that are hoping to seek more sophisticated software solutions to more explicitly optimize their billing to avoid these “hurdles.” And since many insurance companies are already starting to use NLP on notes, that will open up a whole new layer of complexity in the process. In light of all this, I actually feel that the dynamics we describe in this paper are not unrealistic at all.

Where we do get (explicitly) hypothetical is when it comes to things like adversarial attacks on imaging systems. I don’t think these are that realistic today, because I can’t find examples of insurance companies or regulators using computer vision algorithms for approvals yet. But in fairness, the first FDA approval for a CV algorithm just happened in 2018 and many more are on the way. Once CV is established as “legit” I think it’s likely that we’ll see them get more integrated into such decisions. But we aren’t there yet. Of course, even when we do get there, the adversarial imaging threat model also requires users to feel comfortable sending in adversarial attacks but not straight up fake images from other patients. But I think that there are technical and – probably moreso – legal and moral reasons why physicians/companies would hesitate to send in overtly fraudulent images to a diagnostic algorithm at an insurance/regulatory body. In contrast, I think that many would be comfortable doing more subtle things like rotations/scalings or even just cherry-picking images that give them the best shot from the many images that are often acquired per patient. According to the recently published Rules of the Game, this type of behavior “counts” as adversarial attacks according at least to many in the field. To boot, doing this effectively (and surely being robust to it) could entail advanced software even if the modifications themselves are simple. In other words, I continue to think that robustness/adversarial attacks researchers should take healthcare seriously as an area of application.

“Adversarial attacks” sounds scary. Do you think people will use these as tools to hurt people by hacking diagnostics, etc?

While this is may be possible in certain circumstances in theory, I don’t think it’s particularly likely. By analogy, pacemaker hacks have been around for more than a decade, but I don’t see many people feeling motivated to execute them.

Are you hoping to stall the development of medical ML because of adversarial attacks?

Nope! Every author on this paper is very bullish on machine learning as a way to achieve positive impact in all aspects of the healthcare system. We explicitly state this in the paper, as well as the fact that we don’t think these concerns should slow things down, just be a part of an ongoing conversation.

Small note on the figure

As will be immediately recognized by anyone familiar with adversarial examples, the design for the top part of Figure 1 was inspired by Figure 1 in Goodfellow et al – though the noise itself was generated using a different attack method (the PGD) and applied to different data. As it stands, the figure in our Science paper points to our preprint for details of how the attack was generated, and Goodfellow et al paper is cited in the preprint. However, the Science paper itself doesn’t explicitly credit Goodfellow et al for the design idea. This wasn’t intentional. I pointed this out to the Science team, which decided against updating with a citation since the paper is cited via the preprint and all the actual content in the figure are either original or CC0. But I still feel bad about this. Sorry!

Deriving probability distributions using the Principle of Maximum Entropy

Thu, 16 Mar 2017 06:00:00 -0400

Introduction
- Maximum Entropy Principle
- Lagrange Multipliers
1. Derivation of maximum entropy probability distribution with no other constraints (uniform distribution)
- Satisfy constraint
- Putting Together
2. Derivation of maximum entropy probability distribution for given fixed mean $\mu$ and variance $\sigma^{2}$ (gaussian distribution)
3. Derivation of maximum entropy probability distribution of half-bounded random variable with fixed mean $\bar{r}$ (exponential distribution)
4. Maximum entropy of random variable over range $R$ with set of constraints $\left\langle f_{n}(x)\right\rangle =\alpha_{n}$ with $n=1\dots N$ and $f_{n}$ is of polynomial order

Introduction

In this post, I derive the uniform, gaussian, exponential, and another funky probability distribution from the first principles of information theory. I originally did it for a class, but I enjoyed it and learned a lot so I am adding it here so I don’t forget about it.

I actually think it’s pretty magical that these common distributions just pop out when you are using the information framework. It feels so much more satisfying/intuitive than it did before.

Maximum Entropy Principle

Recall that information entropy is a mathematical framework for quantifying “uncertainty.” The formula for the information entropy of a random variable is $H(x) = - \int p(x)\ln p(x)dx$ . In statistics/information theory, the maximum entropy probability distribution is (you guessed it!) the distribution that, given any constraints, has maximum entropy. Given a choice of distributions, the “Principle of Maximum Entropy” tells us that the maximum entropy distribution is the best. Here’s a snippet of the idea from the wikipedia page:

The principle of maximum entropy states that, subject to precisely stated prior data (such as a proposition that expresses testable information), the probability distribution which best represents the current state of knowledge is the one with largest entropy. Another way of stating this: Take precisely stated prior data or testable information about a probability distribution function. Consider the set of all trial probability distributions that would encode the prior data. According to this principle, the distribution with maximal information entropy is the proper one. … In ordinary language, the principle of maximum entropy can be said to express a claim of epistemic modesty, or of maximum ignorance. The selected distribution is the one that makes the least claim to being informed beyond the stated prior data, that is to say the one that admits the most ignorance beyond the stated prior data.

Lagrange Multipliers

Given the above, we can use the maximum entropy principle to derive the best probability distribution for a given use. A useful tool in doing so is the Lagrange Multiplier (Khan Acad article, wikipedia), which helps us maximize or minimize a function under a given set of constraints.

For a single variable function $f(x)$ subject to the constraint $g(x) = c$, the lagrangian is of the form: $\mathcal{L}(x,\lambda) = f(x) - \lambda(g(x)- c)$ , which is then differentiated and set to zero to find a solution.

The above can then be extended to additional variables and constraints as:

\[\mathcal{L}(x_{1}\dots x_{n},\lambda_{1}\dots\lambda{n}) = f(x_{1}\dots x_{n}) - \Sigma_{k=1}^{M}\lambda_{k}g_{k}(x_{1}\dots x_{n})\]

and solving

\[\nabla x_{1},\dots,x_{n},\lambda_{1}\dots \lambda_{M}\mathcal{L}(x_{1}\dots x_{n},\lambda_{1}\dots\lambda{n})=0\]

or, equivalently, solving

\[\begin{cases} \nabla f(x)-\Sigma_{K=1}^{M}\lambda_{k}\nabla g_{k}(x)=0\\ g_{1}(x)=\dots=g_{M}(x)=0 \end{cases}\]

In this case, since we are deriving probability distributions, the integral of the pdf must sum to one, and as such, every derivation will include the constraint $(\int p(x)dx-1)=0$.

With all that, we can begin:

1. Derivation of maximum entropy probability distribution with no other constraints (uniform distribution)

First, we solve for the case where the only constraint is that the distribution is a pdf, which we will see is the uniform distribution. To maximize entropy, we want to minimize the following function:

\[J(p)=\int_{a}^{b} p(x)\ln p(x)dx-\lambda_{0}\left(\int_{a}^{b} p(x)dx-1\right)\]

. Taking the derivative with respect ot $p(x)$ and setting to zero,

\[\frac{\delta J}{\delta p(x)}=1+\ln p(x)-\lambda_{0}=0\] \[\ln p(x)=1-\lambda_{0}\] \[p(x)=e^{1 -\lambda_{0}}\]

, which in turn must satisfy

\[\int_{a}^{b} p(x)dx=1=\int_{a}^{b} e^{-\lambda_{0}+1}dx\]

Note: To check if this is a minimum (which would maximize entropy given the way the equation was set up), we also need to see if the second derivative with respect to $p(x)$ is positive here or not, which it clearly always is:

\[\frac{\delta J}{\delta p(x)^{2}dx}=\frac{1}{p(x)}\]

Satisfy constraint

\[\int_{a}^{b} p(x)dx=\int_{a}^{b} e^{1 -\lambda_{0}}dx=1\] \[\int_{a}^{b} e^{-\lambda_{0}+1}dx=1\] \[e^{-\lambda_{0}+1} \int_{a}^{b} dx=1\] \[e^{-\lambda_{0}+1} (b-a) = 1\] \[e^{-\lambda_{0}+1} = \frac{1}{b-a}\] \[-\lambda_{0}+1 = \ln\frac{1}{b-a}\] \[\lambda_{0} = 1 -\ln \frac{1}{b-a}\]

Putting Together

Plugging the constraint $\lambda_{0} = 1 -\ln \frac{1}{b-a}$ into the pdf $p(x)=e^{1 -\lambda_{0}}$, we have:

\[p(x)=e^{1 -\lambda_{0}}\] \[p(x)=e^{1 -(1 -\ln \frac{1}{b-a})}\] \[p(x)=e^{1 -1 + \ln \frac{1}{b-a}}\] \[p(x)=e^{\ln \frac{1}{b-a}}\] \[p(x)=\frac{1}{b-a}\]

. Of course, this is only defined in the range between $a$ and $b$, however, so the final function is:

\[p(x)=\begin{cases} \frac{1}{b-a} & a\leq x \leq b\\ 0 & \text{otherwise} \end{cases}\]

2. Derivation of maximum entropy probability distribution for given fixed mean $\mu$ and variance $\sigma^{2}$ (gaussian distribution)

Now, for the case when we have a specified mean and variance, which we will see is the gaussian distribution. To maximize entropy, we want to minimize the following function:

\[J(p)=\int p(x)\ln p(x)dx-\lambda_{0}\left(\int p(x)dx-1\right)-\lambda_{1}\left(\int p(x)(x-\mu)^{2}dx-\sigma^{2}\right)\]

, where the first constraint is the definition of pdf and the second is the definition of the variance (which also gives us the mean for free). Taking the derivative with respect ot p(x) and setting to zero,

\[\frac{\delta J}{\delta p(x)}=1+\ln p(x)-\lambda_{0}-\lambda_{1}(x-\mu)^{2}=0\] \[\ln p(x)=1-\lambda_{0}-\lambda_{1}(x-\mu)^{2}\] \[p(x)=e^{-\lambda_{0}+1-\lambda_{1}(x-\mu)^{2}}\]

, which in turn must satisfy

\[\int p(x)dx=1=\int e^{-\lambda_{0}+1-\lambda_{1}(x-\mu)^{2}}dx\]

and

\[\int p(x)(x-\mu)^{2}dx=\sigma^{2}=\int e^{-\lambda_{0}+1-\lambda_{1}(x-\mu)^{2}}(x-\mu)^{2}dx\]

Again, $\frac{\delta J}{\delta p(x)^{2}dx}=\frac{1}{p(x)}$ is always positive, so our solution will be minimum.

Satisfy first constraint

\[\int p(x)dx=1=\int e^{-\lambda_{0}+1-\lambda_{1}(x-\mu)^{2}}dx\] \[1=\int e^{-\lambda_{0}+1-\lambda_{1}z^{2}}dz\] \[1=\int e^{-\lambda_{0}+1-\lambda_{1}z^{2}}dz\]

$1=\int e^{-\lambda_{0}+1}*e^{-\lambda_{1}z^{2}}dz$ $1=e^{-\lambda_{0}+1}\int e^{-\lambda_{1}z^{2}}dz$ $e^{\lambda_{0}-1}=\int e^{-\lambda_{1}z^{2}}dz$ $e^{\lambda_{0}-1}=\int e^{-\lambda_{1}z^{2}}dz$ $e^{\lambda_{0}-1}=\sqrt{\frac{\pi}{\lambda_{1}}}$

Satisfy second constraint

\[\int p(x)(x-\mu)^{2}dx=\sigma^{2}=\int e^{-\lambda_{0}+1-\lambda_{1}(x-\mu)^{2}}(x-\mu)^{2}dx\] \[\sigma^{2}=\int e^{-\lambda_{0}+1-\lambda_{1}(x-\mu)^{2}}(x-\mu)^{2}dx\] \[\sigma^{2}=\int e^{-\lambda_{0}-1-\lambda_{1}z^{2}}z^{2}dz\] \[\sigma^{2}e^{\lambda_{0}-1}=\int e^{-\lambda_{1}z^{2}}z^{2}dz\] \[\sigma^{2}e^{\lambda_{0}-1}=\frac{1}{2}\sqrt{\frac{\pi}{\lambda_{1}^{3}}}\] \[\sigma^{2}e^{\lambda_{0}-1}=\frac{1}{2\lambda_{1}}\sqrt{\frac{\pi}{\lambda_{1}}}\] \[2\lambda_{1}\sigma^{2}e^{\lambda_{0}-1}=\sqrt{\frac{\pi}{\lambda_{1}}}\]

Putting together

\[\sqrt{\frac{\pi}{\lambda_{1}}}=e^{\lambda_{0}-1}=2\lambda_{1}\sigma^{2}e^{\lambda_{0}-1}\]

\[e^{\lambda_{0}-1}=2\lambda_{1}\sigma^{2}e^{\lambda_{0}-1}\] \[1=2\lambda_{1}\sigma^{2}\] \[\lambda_{1}=\frac{1}{2\sigma^{2}}\]

. Plugging in for the other lambda,

\[\sqrt{\frac{\pi}{\lambda_{1}}}=e^{\lambda_{0}-1}\] \[\sqrt{2\sigma^{2}\pi}=e^{\lambda_{0}-1}\] \[\ln\sqrt{2\sigma^{2}\pi}=\lambda_{0}-1\] \[\lambda_{0}=\ln\sqrt{2\sigma^{2}\pi}+1\]

Now, we plug back into the first equation

\[p(x)=e^{-\lambda_{0}-1-\lambda_{1}(x-\mu)^{2}}\] \[=e^{-\ln\sqrt{2\sigma^{2}\pi}-\frac{1}{2\sigma^{2}}(x-\mu)^{2}}\] \[=e^{-\ln\sqrt{2\sigma^{2}\pi}}e^{-\frac{1}{2\sigma^{2}}(x-\mu)^{2}}\] \[=\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(x-\mu)^{2}}{2\sigma^{2}}}\]

which we can note is, by definition, the pdf of the Gaussian!

3. Derivation of maximum entropy probability distribution of half-bounded random variable with fixed mean $\bar{r}$ (exponential distribution)

Now, constrain on a fixed mean, but no fixed variance, which we will see is the exponential distribution. To maximize entropy, we want to minimize the following function:

\[J(p)=\int p(x)\ln p(x)dx-\lambda_{0}\left(\int_{0}^{\infty}p(x)dx-1\right)-\lambda\left(\int_{0}^{\infty}x*p(x)dx-\bar{r}\right)\]

Now take derivative

\[\frac{\delta J}{\delta p(x)dx}=1+\ln p(x)-\lambda_{0}-\lambda_{1}x\]

To check if this is a minimum of the function, we need to see if the second derivative is positive with respect to p(x), which it is:

$\frac{\delta J}{\delta p(x)^{2}dx}=\frac{1}{p(x)}$ Setting the first derivative to zero, we have

\[0=1+\ln p(x)-\lambda_{0}-\lambda_{1}x\] \[p(x)=e^{-\lambda_{0}+1+-\lambda x}\]

, which must satisfy the constaints $\int_{0}^{\infty}p(x)dx=1$ and $\int_{0}^{\infty}x*p(x)dx-\bar{r}$.

Satisfying first constraint

\[\int_{0}^{\infty}p(x)dx=1\] \[\int_{0}^{\infty}e^{-\lambda_{0}+1-\lambda_{1}x}dx=1\] \[\int_{0}^{\infty}e^{-\lambda_{1}x}dx=e^{\lambda_{0}-1}\] \[\frac{1}{\lambda_{1}}=e^{\lambda_{0}+1}\] \[\lambda_{1}=e^{-\lambda_{0}+1}\]

Satisfying the second constraint

\[\int_{0}^{\infty}x*e^{-\lambda_{0}+1-\lambda_{1}x}dx=\bar{r}\] \[\int_{0}^{\infty}x*e^{-\lambda_{0}+1}e^{\lambda_{1}x}dx=\bar{r}\]

substituting in $\lambda_{1}=e^{-\lambda_{0}+1}$ from above

\[\int_{0}^{\infty}x*\lambda_{1}e^{\lambda_{1}x}dx=\bar{r}\]

Putting together

Rather than evaluating this last integral above, we can simply stop and note that in evaluating our constraints we have stumbled upon the formula for an exponential random variable with parameter $\lambda$!

More explicitly:

\[\int_{0}^{\infty}x*\lambda_{1}e^{\lambda_{1}x}dx=\bar{r}\] \[\int_{0}^{\infty}x*p(x)dx=\bar{r}\]

where $p(x)=\lambda e^{\lambda x}$, the pdf of the exponential function for $x\ge0$, where $\lambda=\frac{1}{\bar{r}}$.

In other words,

\[p(x)=\begin{cases} \frac{1}{\bar{r}}e^{-\frac{x}{\bar{r}}} & x\ge0\\ 0 & x<0 \end{cases}\]

4. Maximum entropy of random variable over range $R$ with set of constraints $\left\langle f_{n}(x)\right\rangle =\alpha_{n}$ with $n=1\dots N$ and $f_{n}$ is of polynomial order

$f_{n}$ must be even order for all enforced constraints.

Following the same approach as above:

\[J(p)=-\int p(x)\ln p(x)dx+\lambda_{0}\left(\int p(x)dx-1\right)+\Sigma_{i=1}^{N}\lambda_{i}\left(p(x)f_{i}(x)dx-a_{i}\right)\] \[\frac{\delta J}{\delta p(x)dx}=-1-\ln p(x)+\lambda_{0}+\Sigma_{i=1}^{N}\lambda_{i}f_{i}(x)\] \[0=-1-\ln p(x)+\lambda_{0}+\Sigma_{i=1}^{N}\lambda_{i}f_{i}(x)\] \[p(x)=e^{\lambda_{0}-1+\Sigma_{i=1}^{N}\lambda_{i}f_{i}(x)}\]

all where $f_{i}(x)=\Sigma_{j=1}^{M}b_{j}x^{j}$.

We now consider the conditions in which the random variable can be defined in the entire domain $(-\infty,\infty)$. Looking at the normalization constraint,

\[\int p(x)dx=\int e^{\lambda_{0}-1+\Sigma_{i=1}^{N}\lambda_{i}f_{i}(x)}dx=1\]

we note that we need our exponential function to integrate to 1. In order for this equation to be defined in the entire real domain, we thus will need the exponential function to integrate to a finite value, so that we can provide a normalization constant that will result in integration to 1.

Looking at the function $e^{\lambda_{0}-1+\Sigma_{i=1}^{N}\lambda_{i}f_{i}(x)}$ (which must remain finite for all x), we can thus conclude that $\lambda_{0}-1+\Sigma_{i=1}^{N}\lambda_{i}f_{i}(x)$ must not converge to positive infinity, but may converge to negative infinity (because it would cause the exponential to converge to zero) or to any finite value as $x$ approaches positive or negative infinity. The only components of this function that depend on $x$ are the polynomail constraints of form $f_{i}(x)=\Sigma_{j=1}^{M}b_{j}x^{j}$. As such, these constraints are the only components at risk to force the function towards infinity, provided that $\lambda_{0}\neq\infty.$ Therefore, because the $\lambda_{i}$ corresponding to can any $f_{i}$ can be positive or negative, the function will be able to be defined so long $f_{i}(x)=\Sigma_{j=1}^{M}b_{j}x^{j}<\infty$ for all $x$, or $f_{i}(x)=\Sigma_{j=1}^{M}b_{j}x^{j}>-\infty$ for all $x.$

Finally, we can consider the conditions for which these criteria for $f_{i}$ will be satisfied. In short, the only way to guarantee that $f_{i}$ remain either positive for negative will be if the dominant component of the polynomial $f_{i}$ is of an EVEN order for all $i$ s.t. $\lambda_{i}\neq0$. If the dominant component is odd, then $f_{i}$ will either move from negative infinity to positive infinity (or, if negated, from positive infinity to negative infinity) as x moves across the domain, which means that no finite and nonzero $\lambda_{i}$ could be chosen to maintain the criteria outlined above.

Deriving the information entropy of the multivariate gaussian

Sat, 11 Mar 2017 05:00:00 -0500

Introduction and Trace Tricks
Derivation

Introduction and Trace Tricks

The pdf of a multivariate gaussian is as follows:

\[p(x) = \frac{1}{(\sqrt{2\pi})^{N}\sqrt{\det\Sigma}}e^{-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu)}\]

, where

\[\Sigma_{i,j} = E[(x_{i} - \mu_{i})(x_{j} - \mu_{j})]\]

is the covariance matrix, which can be expressed in vector notation as

\[\Sigma = E[(X-E[X])(X-E[X])^{T}] = \int p(x)(x-\mu)(x-\mu)^{T}dx\]

. I might make the derivation of this formula its own post at some point, but it is in Strang’s intro to linear algebra text so I will hold off. Instead, this post derives the entropy of the multivariate gaussian, which is equal to:

\[H=\frac{N}{2}\ln\left(2\pi e\right)+\frac{1}{2}\ln\det C\]

Part of the reason why I do this is because the second part of the derivation involves a “trace trick” that I want to remember how to use for the future. The key to the “trace trick” is to recognize that a matrix (slash set of multiplied matrices) is 1x1, and that the value of any such matrix is, by definition, equal to its trace. This then allows you to invoke the quasi-commutative property of the trace:

\[\text{tr}(UVW)=\text{tr}(WUV)\]

to push around the matrices however you desire until they become something tidy/useful. The whole thing feels rather devious to me, personally.

Derivation

Setup

Beginning with the definition of entropy

\[H(x)=-\int p(x)*\ln p(x)dx\]

substituting in the probability function for the multivariate gaussian in only its second occurence in the formula,

\[H(x)=-\int p(x)*\ln\left(\frac{1}{(\sqrt{2\pi})^{N}\sqrt{\det\Sigma}}e^{-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu)}\right)dx\] \[=-\int p(x)*\ln\left(\frac{1}{(\sqrt{2\pi})^{N}\sqrt{\det\Sigma}}e^{-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu)}\right)dx\] \[=-\int p(x)*\ln\left(\frac{1}{(\sqrt{2\pi})^{N}\sqrt{\det\Sigma}}\right)dx-\int p(x)\ln\left(e^{-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu)}\right)dx\]

We will now consider these two terms separately.

First term

First, we concern ourselves with the first ln term:

\[-\int p(x)*\ln\left(\frac{1}{(\sqrt{2\pi})^{N}\sqrt{\det C}}\right)dx\] \[=\int p(x)*\ln\left((\sqrt{2\pi})^{N}\sqrt{\det C}\right)\]

since all the terms other than $p(x)$ form a constant,

\[=\left(\ln\left((\sqrt{2\pi})^{N}\sqrt{\det C}\right)\right)\int p(x)\]

and because $p(x)$ is a PDF, it integrates to 1. Thus, this component of the equation is

\[\ln\left((\sqrt{2\pi})^{N}\sqrt{\det C}\right)\] \[=\frac{N}{2}\ln\left(2\pi\right)+\frac{1}{2}\ln\det C\]

Second term (Trace Trick Coming!)

Now we consider the second ln term

\[-\int p(x)\ln\left(e^{-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu)}\right)dx\] \[=\int p(x)\frac{1}{2}\ln\left(e^{(x-\mu)^{T}\Sigma^{-1}(x-\mu)}\right)dx\]

because $(x-\mu)^{T}$ is a 1 x N matrix, $\Sigma^{-1}$ is a N x N matrix, and $(x-\mu)$ is a N x 1 matrix, the matrix product $(x-\mu)^{T}\Sigma^{-1}(x-\mu)$ is a 1 x 1 matrix. Further, because the trace of any 1 x 1 matrix $\text{tr}(A)=\Sigma_{i=1}^{n}A_{i,i}=A_{1,1}=A$, we can conclude that the 1 x 1 matrix $(x-\mu)^{T}\Sigma^{-1}(x-\mu)=\text{tr}((x-\mu)^{T}\Sigma^{-1}(x-\mu))$.

As such, our term becomes

\[=\int p(x)\frac{1}{2}\ln\left(e^{\text{tr}\left[(x-\mu)^{T}\Sigma^{-1}(x-\mu)\right]}\right)dx\] \[=\frac{1}{2}\int p(x)\ln\left(e^{\text{tr}\left[(x-\mu)^{T}\Sigma^{-1}(x-\mu)\right]}\right)dx\]

, which, by the quasi-commutativity property of the trace function, $\text{tr}(UVW)=\text{tr}(WUV)$,

\[=\frac{1}{2}\int p(x)\ln\left(e^{\text{tr}\left[\Sigma^{-1}(x-\mu)(x-\mu)^{T}\right]}\right)dx\]

. Because $p(x)$ is a scalar and the natural logarithm and exponentials may cancel, the properties of the trace function allow us to push the $p(x)$ and the integral inside of the trace, so

\[=\frac{1}{2}\int\ln\left(e^{\text{tr}\left[\Sigma^{-1}p(x)(x-\mu)(x-\mu)^{T}\right]}\right)dx\] \[=\frac{1}{2}\ln\left(e^{\text{tr}\left[\Sigma^{-1}\int p(x)(x-\mu)(x-\mu)^{T}dx\right]}\right)\]

But, $\int p(x)(x-\mu)(x-\mu)^{T}dx=\Sigma$ is just the definition of the covariance matrix! As such,

\[=\frac{1}{2}\ln\left(e^{\text{tr}\left[\Sigma^{-1}\Sigma\right]}\right)\] \[=\frac{1}{2}\ln\left(e^{\text{tr}\left[I_{N}\right]}\right)\] \[=\frac{1}{2}\ln\left(e^{N}\right)\] \[=\frac{N}{2}\ln\left(e\right)\]

Recombining the terms

Bringing the above terms back together, we have

\[H(x)=\frac{N}{2}\ln\left(2\pi\right)+\frac{1}{2}\ln\det C+\frac{N}{2}\ln\left(e\right)\] \[=\frac{N}{2}\ln\left(2\pi e\right)+\frac{1}{2}\ln\det C\]

as desired.

Rule	Example
1. If there are no variables being conditioned on, a path is blocked if and only if two arrowheads on the path collide at some variable on the path.	\(L \rightarrow A \rightarrow Y\) is open. \(A \rightarrow Y \leftarrow L\) is blocked at \(Y\)
2. Any path that contains a noncollider that has been conditioned on is blocked.	Conditioning on \(B\) blocks the path from \(A\) to \(Y\).
3. A collider that has been conditioned on does not block a path	The path between \(A\) and \(Y\) is open after conditioning on \(L\).
4. A collider that has a descendant that has been conditioned on does not block a path.	The path between \(A\) and \(Y\) is open after conditioning on \(C\), a descendant of collider \(L\).

DAG	Example	Notes	Page
	L: Being physicially fit A: Working as a firefighter Y: Mortality	The path \(A \rightarrow Y\) is a causal path from \(A\) to \(Y\). \(A \leftarrow L \rightarrow Y\) is a backdoor path between \(A\) and \(Y\), mediated by common cause (confounder) \(L\). Conditioning on \(L\) will block the backdoor path, induce conditional exchangeability, and allow for causal inference. Note: This is an example of “healthy worker bias.”	I.83
	A: Aspirin Y: Stroke L: Heart Disease U: Atherosclerosis (unmeasured)	This DAG is an example of confounding by indication (or channeling). Aspirin will have a confounded association with stroke, both from heart disease (\(L \rightarrow A \rightarrow Y\)), and from atherosclerosis (\(U \rightarrow L \rightarrow A \rightarrow Y\)). Conditioning on unmeasured \(U\) is impossible, but there is no unmeasured confounding given \(L\), so conditioning on \(L\) is sufficient.	I.84
	A: Exercise Y: Death L: Smoking status U: Social Factors (unmeasured) or Sublinical Disease (undetected)	Conditioning on \(L\) is again sufficient to block the backdoor path in this case.	I.84
	A: Physical activity Y: Cervical Cancer L: Pap smear U_1: Pre-cancer lesion (unmeasured here) U_2: Health-conscious personality (unmeasured)	Example shows how conditioning on a collider can induce bias. Adjustment for \(L\) (e.g. by restricting to negative tests \(L=0\)) will induce bias by opening a backdoor path between \(A\) and \(Y\) (\(A \leftarrow U_2 \rightarrow L \leftarrow U_1 \rightarrow Y\)), previously blocked by the collider. This is a case of selection bias. Thus, after conditioning, association between \(A\) and \(Y\) would be a mixture of association due to effect of \(A\) on \(Y\) and backdoor path. In other words, there is no unconditional bias, but there would be a conditional bias for at least one stratum of \(L\).	I.88
	(Labels not in book) A: Antacid L: Heartburn Y: Heart attack U: Obesity	A nonconfounding example in which traditional analysis might lead you to adjust for \(L\), but doing so would induce a bias.	I.89
	A: Physical activity L: Income Y: Cardiovascular disease U: Socioeconomic status	\(L\) (income) is not a confounder, but is a measurable variable that could serve as a surrogate confounder for \(U\) (socioeconomic status) and thus could be used to partially adjust for the confounding from \(U\). In other words, conditioning on \(L\) will result in a partial blockage of the backdoor path \(A \leftarrow U \rightarrow Y\).	I.90
Normal DAG: Corresponding SWIG:	A: Aspirin Y: Stroke L: Heart Disease U: Atherosclerosis (unmeasured)	Represents data from a hypothetical intervention in which all individuals receive the same treatment level \(a\). Treatment is split into two sides: (a) Left side encodes the values of treatment \(A\) that would have been observed in the absence of intervention (the natural value of treatment) (b) Right side encodes the treatment value under the intervention. \(A\) has no variable into \(a\) bc \(a\) is the same everywhere. Conditional exchangeability \(Y^{a} \unicode{x2AEB} A \| L\) holds because all paths between \(Y^{a}\) and \(A\) are blocked after conditioning on \(L\).	I.91
Normal DAG: Corresponding SWIG:	A: Physical activity Y: Cervical Cancer L: Pap smear U_1: Pre-cancer lesion (unmeasured here) U_2: Health-conscious personality (unmeasured)	Here, marginal exchangeability \(Y^{a} \unicode{x2AEB} A\) holds because, on the SWIG, all paths between \(Y^{a}\) and \(A\) are blocked without conditioning on \(L\). Conditional exchangeability \(Y^{a} \unicode{x2AEB} A \| L\) does not hold because, on the SWIG, the path \(Y^{a} \leftarrow U_1 \rightarrow L \leftarrow U_2 \rightarrow A\) is open when the collider \(L\) is conditioned on. Taken together, marginal \(A-Y\) association is causal but conidtional association \(A-Y\) given \(L\) is not.	I.91
Normal DAG: Corresponding SWIG:	(Example labels not in book) A: Statins Y: Coronary artery disease L: HDL/LDL U: Race	In this example, the SWIG is used to highlight a failure of the DAG to provide conditional exchangeability \(Y^{a} \unicode{x2AEB} A \| L\). In the SWIG, the factual variable \(L\) is replaced by the counterfactual variable \(L^{a}\). In this SWIG, counterfactual exchangeability \(Y^{a} \unicode{x2AEB} A \| L_{a}\) holds, since \(L^{a}\) blocks the paths from \(Y^{a}\) to \(A\). But \(L\) is not even on the graph, so we can’t conclude \(Y^{a} \unicode{x2AEB} A \| L\) holds. The problem being highlighted here is that \(L\) is a descendent of the treatment \(A\) blocking the path to \(Y\). In contrast, if the arrow from \(A\) to \(L\) didn’t exist, \(L\) would not be a descendent of \(A\) and adjusting for \(L\) would eliminate all bias, even if \(L\) were still in the future of \(A\). Thus, confounders are allowed to be in the future of the treatment, they just can’t be descendents.	I.92
	A: Aspirin Y: Blood Pressure U: History of heart disease (unmeasured) C: Blood pressure right before treatment (“placebo test” aka “negative outcome control”)	This example was used to show difference-in-difference and negative outcome controls. The idea: We cannot compute the effect of \(A\) on \(Y\) via standardization or IP weighting because there is unmeasured confounding. Instead, we first measure the (“negative”) outcome \(C\) right before treatment. Obviously \(A\) has no effect on \(C\), but we can assume that \(U\) will have the same confounding effect on \(C\) that it has on \(Y\). As such, we take the effect in the treated to be the effect of \(A\) on \(Y\) (treatment effect + confounding effect) minus the effect of \(A\) on \(C\) (confounding effect). This is the difference-in-differences. Negative outcome controls are sometimes used to try to detect confounding.	I.95
	(No example labels in text) A: Aspirin M: Platelet Aggregation Y: Heart Attack U: High Cardiovascular Risk	This example is to demonstrate the frontdoor criterion (see notes or page I.96 for more details). Given this DAG, it is impossible to directly use standardization or IP weighting, because the unmeasured variable \(U\) is necessary to block the backdoor path between \(A\) and \(Y\). However, the frontdoor adjustment can be used because: (i) the effect of \(A\) on \(<\) can be computed without confounding, and (ii) the effect of \(M\) on \(Y\) can be computed because \(A\) blocks only the backdoor path. Hence, frontdoor adjustment can be used.	I.95

Example	Notes	Page
(Note: Missing arrow: \(A \rightarrow Y\) ) A: Occupational exposure Y: Mortality C: Being at Work U: True health status L: Blood tests and physical exam ———— W: Exposed jobs are eliminated and workers laid off	(Note: DAGS 8.3/8.5 work just as well, here.) Healthy worker bias: If we restrict a factory cohort study to those individuals who are actually at work, we miss out on the people that are not working due to either: (a) disability caused by exposure, or (b) a common cause of not working and not being exposed.	I.99
(Note: Missing arrow: \(A \rightarrow Y\) ) A: Smoking status Y: Coronary heart disease C: Consent to participate U: Family history L: Heart disease awareness ———— W: Lifestyle	(Note: DAGS 8.4/8.6 work just as well, here.) Self-selection bias or Volunteer bias: Under any of the above structures, if the study is restricted to people who volunteer or choose to participate, this can induce a selection bias.	I.100
(Note: Missing arrow: \(A \rightarrow Y\) ) A: Smoking status Y: Coronary heart disease C: Consent to participate U: Family history L: Heart disease awareness ———— W: Lifestyle	(Note: DAGS 8.4/8.6 work just as well, here.) Selection affected by treatment received before study entry: Generalization of self-selection bias. Under any of the above structures, if the treatment takes place before the study selection or includes a pre-study component, a selection bias can arise. Particularly high-risk in studies that look at lifetime exposure to something in middle-aged volunteers. Similar issues often arise with confounding if confounders are only measured during the study.	I.100

Term	Notation or Formula	Notes	Page
Association	Pr[Y=1\|A=1] \(\neq\) Pr[Y=1\|A=0]	Example definitions of independence (lack of association): Y \(\unicode{x2AEB}\) A or Pr[Y=1\|A=1] - Pr[Y=1\|A=0] = 0 or \(\frac{Pr[Y=1\|A=1]}{Pr[Y=1\|A=0]}\) = 1 or \(\frac{Pr[Y=1\|A=1]/Pr[Y=0\|A=1]}{Pr[Y=1\|A=0]/Pr[Y=0\|A=0]}\) = 1	I.10
Causation and Causal Effects	Causation: Pr[Y^(a=1)=1] \(\neq\) Pr[Y^(a=0)=1] Individual Causal Effects: Y^(a=1) - Y^(a=0) Population Average Causal Effects: E[Y^(a=1)] - E[Y^(a=0)] where Y^(a=1) = Outcome for treatment w/ \(a=1\) Y^(a=0) = Outcome for treatment w/ \(a=0\)	Sharp causal null hypothesis: Y^(a=1) = Y^(a=0) for all individuals in the population. Null hypothesis of no average causal effect: E[Y^(a=1)] = E[Y^(a=0)] Mathematical representations of causal null: Pr[Y^(a=1)=1] - Pr[Y^(a=0)=1] = 0 or \(\frac{Pr[Y^{a=1}=1]}{Pr[Y^{a=0}=1]} = 1\) or \(\frac{Pr[Y^{a=1}=1]/Pr[Y^{a=1}=0]}{Pr[Y^{a=0}=1]/Pr[Y^{a=1}=0]} = 1\)	I.7

Term	Notation or Formula	English Definition	Notes	Page
Effect modification aka effect-measure modification	Additive effect modification: E[Y^(a=1)-Y^(a=0) \| M = 1] \(\neq\) E[Y^(a=1)-Y^(a=0) \| M = 0] Multiplicative effect modification: \(\frac{E[Y^{a=1} \| M = 1]}{E[Y^{a=0} \| M = 1]}\) \(\neq\) \(\frac{E[Y^{a=1}\| M = 0]}{E[Y^{a=0}\| M = 0]}\)	\(M\) is a modifier of the effect of \(A\) on \(Y\) when the average causal effect of \(A\) on \(Y\) varies across levels of \(M\).	The null hypothesis of no average causal effect does not necessarily imply the absence of effect modification (e.g. equal and oppositive effect modifications in men and women could cancel at the population level), but the sharp null hypothesis of no causal effect does imply no effect modicifaction. We only count variables unaffected by treatment as effect modifiers. Similar variables that are effected by treatment are termed mediators.	I.41
Qualitative effect modification		Average causal effects in different subsets of the population go in opposite directions.	In presence of qualitative effect modification, additive effect modification implies multiplicative effect modification, and vice versa. In absence of qualitative effect modification, it’s possible to have only additive or only multiplicative effect modification. Effect modifiers are not necessarily assumed to play a causal role. To make this explicit, sometimes the terms surrogate effect modifier vs causal effect modifier are used, or you can play it even safer and refer to “effect heterogeneity across strata of \(M\).” Effect modification is helpful, among other things, for (i) assessing transportability to new populations where \(M\) may have different prevalences, (ii) choosing subpopulations that may most benefit from treatment, and (iii) identifying mechanisms leading to outcome if modifiers are mechanistically meaningful (e.g. circumscision for HIV transmission).	I.42
Stratification	Statified causal risk differences: E[Y^(a=1) \| M = 1] - E[Y^(a=0) \| M = 1] and E[Y^(a=1) \| M = 0] - E[Y^(a=0) \| M = 0]	To identify effect modification by variable \(M\), separately compute the causal effect of \(A\) on \(Y\) for each statum of the variable \(M\).	If study design assumes conditional rather than marginal exchangeability, analysis to estimate effect modification must account for all other variables \(L\) required to give exchangeability. This could involve standardization (IP weighting, etc.) by \(L\) within each stratum \(M\), or just using finer-grained stratification over all pairwise combinations of \(M\) and \(L\) (see page I.49). By the same token, stratification can be an alternative to standardization techinques such as IP weighting in analysis of any conditional randomized experiment : instead of estimating an average causal effect over the population while standardizing for \(L\), just stratify on \(L\) and report separate causal effect estimates for each stratum.	I.43-49
Collapsibility		A characteristic of a population effect measure. Means that the effect measure can be expressed as a weighted average of stratum-specific measures.	Examples of collapsible effect measures: risk ratio and risk difference Example of non-collapsible effect measure: odds ratio. Noncollapsibility can produce counter-intuitive findings like a causal odds ratio that’s smaller in the average population than in any stratum of the population.	I.53
Matching		Construct a subset of the population in which all variables \(L\) have the same distribution in both the treated and the untreated.	Under assumption of conditional exchangeability given \(L\) in the source population, a matched population will have unconditional exchangeability. Usually, constructed by including all of the smaller group (e.g. the treated) and selecting one member of the larger group (e.g. the untreated) with matching \(L\) for each member in the smaller group. Often requires approximate matching.	I.49
Interference		Treatment of one individual effects treatment status of other individuals in the population.	Example: A socially active individual convinces friends to join him while exercising.	I.48
Transportability		Ability to use causal effect estimation from one population in order to inform decisions in another (“target”) population.	Requires that the target population is characterized by comparable patterns of: - Effect modification - Interference, and - Versions of treatment	I.48

Machine Learning and Medicine

Induction, Inductive Biases, and Infusing Knowledge into Learned Representations

Outline:

Inductive Generalization and Inductive Biases

Philosophical Foundations for the Problem of Induction

Inductive Biases in Machine Learning

Learned Representations of Data and Knowledge

Background on Representation Learning

Infusing Domain Knowledge into Neural Representations

Bibliography

Comments on ML "versus" statistics

Neglected historical context: The term “machine learning” was not coined to contrast with statistics, but to contrast the field with competing paradigms for building intelligent computer systems.

Arguments about who “owns” regression miss the point

Distinctions in goals have yielded a divergence in methods and cultures, which explains shifting connotations of the term “machine learning.” Disconnects in language doom many “debates” to futility before they begin.

Isn’t this whole “debate” a massive waste of time?

In summary:

All the DAGs from Hernan and Robins' Causal Inference Book

Refresher: Visual rules of d-separation.

Refresher: Backdoor criterion

Basics of Causal Diagrams (6.1-6.5)

Effect Modification (6.6)

Confounding (Chapter 7)

Selection Bias (Chapter 8)

Measurement Bias (Chapter 9)

Causal Inference Book Part I -- Glossary and Notes

A few common variables

Chapter 1: Definition of Causal Effect

Chapter 2: Randomized Experiments

Chapter 3: Observational Studies

Chapter 4: Effect Modification

Chapter 5: Interaction

Chapter 6: Causal Diagrams

Chapter 7: Confounding

Chapter 8: Selection Bias

Chapter 9: Measurement Bias

Chapter 10: Random Variability

Chapter 11: Why Model?

Chapter 12:

FAQ on Medical Adversarial Attacks Policy Paper

What’s the paper and why this FAQ?

Do you think adversarial attacks are the biggest concern in using machine learning in healthcare? (A: Nowhere close!) Then why write the paper?

There seems to have been something of a pivot between the preprint and the policy forum discussion, with the latter focusing much less on images. Was this intentional?

In the paper, you frame existing examples like upcoding and claims craftsmanship as adversarial attacks, or at least as their precursors. Is that fair?

Isn’t this unrealistic? I mean, would there ever be cases when someone actually uses adversarial examples?

“Adversarial attacks” sounds scary. Do you think people will use these as tools to hurt people by hacking diagnostics, etc?

Are you hoping to stall the development of medical ML because of adversarial attacks?

Small note on the figure

Deriving probability distributions using the Principle of Maximum Entropy

Introduction

Maximum Entropy Principle

Lagrange Multipliers

1. Derivation of maximum entropy probability distribution with no other constraints (uniform distribution)

Satisfy constraint

Putting Together

2. Derivation of maximum entropy probability distribution for given fixed mean \(\mu\) and variance \(\sigma^{2}\) (gaussian distribution)

Satisfy first constraint

Satisfy second constraint

Putting together

3. Derivation of maximum entropy probability distribution of half-bounded random variable with fixed mean \(\bar{r}\) (exponential distribution)

Satisfying first constraint

Satisfying the second constraint

Putting together

4. Maximum entropy of random variable over range \(R\) with set of constraints \(\left\langle f_{n}(x)\right\rangle =\alpha_{n}\) with \(n=1\dots N\) and \(f_{n}\) is of polynomial order

Deriving the information entropy of the multivariate gaussian

Introduction and Trace Tricks

Derivation

Setup

First term

Second term (Trace Trick Coming!)

Recombining the terms