Pediatrics-Genetics Resident
I am resident physician in Pediatrics and Clinical Genetics at Seattle Children's Hospital and the University of Washington. Previously, I completed my medical degree in the Harvard-MIT HST program, and my PhD under the direction of Isaac Kohane (Harvard DBMI) and Peter Szolovits (MIT CSAIL).
My clinical interests are in the diagnosis and treatment of rare pediatric diseases. My research background is primarily in biomedical applications of machine learning. Recurring technical themes include designing systems that integrate different types of data ('omics, chemical structures, imaging, clinical text, insurance claims, etc.), and trying to rigorously determine the circumstances under which machine learning systems can be clinically trusted. I am an Associate Editor at the journal NEJM AI.
In addition to my research, I am involved in the Hydrocephalus community by way of Team Hydro, a non-profit organization that my family and I started to raise money and awareness for the condition in honor of my sister, Kate. I also do consulting work as a data scientist and deep learning engineer, with clients including tech, biotech, and pharmaceutical companies.
For a more formal account of my academic work, see my Curriculum Vitae or Google Scholar.
Selected Research
-
Simulation of undiagnosed patients with novel genetic conditions
Emily Alsentzer*, Samuel Finlayson*, Michelle M. Li, Undiagnosed Diseases Network, Shilpa N. Kobren & Isaac S. Kohane
Nature Communications, 2023
[Abstract] [Paper] [PDF]Rare Mendelian disorders pose a major diagnostic challenge and collectively affect 300–400 million patients worldwide. Many automated tools aim to uncover causal genes in patients with suspected genetic disorders, but evaluation of these tools is limited due to the lack of comprehensive benchmark datasets that include previously unpublished conditions. Here, we present a computational pipeline that simulates realistic clinical datasets to address this deficit. Our framework jointly simulates complex phenotypes and challenging candidate genes and produces patients with novel genetic conditions. We demonstrate the similarity of our simulated patients to real patients from the Undiagnosed Diseases Network and evaluate common gene prioritization methods on the simulated cohort. These prioritization methods recover known gene-disease associations but perform poorly on diagnosing patients with novel genetic disorders. Our publicly-available dataset and codebase can be utilized by medical genetics researchers to evaluate, compare, and improve tools that aid in the diagnostic process. -
The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model
David M Levine, Rudraksh Tuwani, Benjamin Kompa, Amita Varma, Samuel Finlayson, Ateev Mehrotra, Andrew Beam
Lancet Digital Health, 2024
[Abstract] [Paper]Importance: Artificial intelligence (AI) applications in health care have been effective in many areas of medicine, but they are often trained for a single task using labeled data, making deployment and generalizability challenging. Whether a general-purpose AI language model can perform diagnosis and triage is unknown. Objective: Compare the general-purpose Generative Pre-trained Transformer 3 (GPT-3) AI model’s diagnostic and triage performance to attending physicians and lay adults who use the Internet. Design: We compared the accuracy of GPT-3’s diagnostic and triage ability for 48 validated case vignettes of both common (e.g., viral illness) and severe (e.g., heart attack) conditions to lay people and practicing physicians. Finally, we examined how well calibrated GPT-3’s confidence was for diagnosis and triage. Setting and Participants: The GPT-3 model, a nationally representative sample of lay people, and practicing physicians. Exposure: Validated case vignettes (<60 words; <6th grade reading level). Main Outcomes and Measures: Correct diagnosis, correct triage. Results: Among all cases, GPT-3 replied with the correct diagnosis in its top 3 for 88% (95% CI, 75% to 94%) of cases, compared to 54% (95% CI, 53% to 55%) for lay individuals (p<0.001) and 96% (95% CI, 94% to 97%) for physicians (p=0.0354). GPT-3 triaged (71% correct; 95% CI, 57% to 82%) similarly to lay individuals (74%; 95% CI, 73% to 75%; p=0.73); both were significantly worse than physicians (91%; 95% CI, 89% to 93%; p<0.001). As measured by the Brier score, GPT-3 confidence in its top prediction was reasonably well-calibrated for diagnosis (Brier score = 0.18) and triage (Brier score = 0.22). Conclusions and Relevance: A general-purpose AI language model without any content-specific training could perform diagnosis at levels close to, but below physicians and better than lay individuals. The model was performed less well on triage, where its performance was closer to that of lay individuals. -
The Clinician and Dataset Shift in Artificial Intelligence
Samuel Finlayson, Adarsh Subbaswamy, Karandeep Singh, John Bowers, Annabel Kupke, Jonathan Zittrain, Isaac Kohane, Suchi Saria
New England Journal of Medicine, 2021
[Paper] [PDF] [Supplement -- Expanded Version of Table w/ Citations] -
Code-free deep learning for multi-modality medical image classification
Edward Korot, Zeyu Guan, ... Samuel Finlayson, ... Pearse A. Keane
Nature Machine Intelligence, 2021
[Abstract] [Paper]A number of large technology companies have created code-free cloud-based platforms that allow researchers and clinicians without coding experience to create deep learning algorithms. In this study, we comprehensively analyse the performance and featureset of six platforms, using four representative cross-sectional and en-face medical imaging datasets to create image classification models. The mean (s.d.) F1 scores across platforms for all model–dataset pairs were as follows: Amazon, 93.9 (5.4); Apple, 72.0 (13.6); Clarifai, 74.2 (7.1); Google, 92.0 (5.4); MedicMind, 90.7 (9.6); Microsoft, 88.6 (5.3). The platforms demonstrated uniformly higher classification performance with the optical coherence tomography modality. Potential use cases given proper validation include research dataset curation, mobile ‘edge models’ for regions without internet access, and baseline models against which to compare and iterate bespoke deep learning approaches. -
Cross-Modal Representation Alignment of Transcriptional Profiles and Small Molecule Therapeutics
Samuel Finlayson, Matthew McDermott, Alex Pickering, Scott Lipnick, Isaac Kohane
Pacific Symposium on Biocomputing, 2021
[Abstract] [Paper]Modeling the relationship between chemical structure and molecular activity is a key goal in drug development. Many benchmark tasks have been proposed for molecular property prediction, but these tasks are generally aimed at specific, isolated biomedical properties. In this work, we propose a new cross-modal small molecule retrieval task, designed to force a model to learn to associate the structure of a small molecule with the transcriptional change it induces. We develop this task formally as multi-view alignment problem, and present a coordinated deep learning approach that jointly optimizes representations of both chemical structure and perturbational gene expression profiles. We benchmark our results against oracle models and principled baselines, and find that cell line variability markedly influences performance in this domain. Our work establishes the feasibility of this new task, elucidates the limitations of current data and systems, and may serve to catalyze future research in small molecule representation learning. -
Subgraph Neural Networks
Emily Alsentzer*, Samuel Finlayson*, Michelle Li, Marinka Zitnik
Neural Information Processing Systems, 2020
*Co-first, listed alphabetically
[Abstract] [Paper]Deep learning methods for graphs achieve remarkable performance on many node-level and graph-level prediction tasks. However, despite the proliferation of the methods and their success, prevailing Graph Neural Networks (GNNs) neglect subgraphs, rendering subgraph prediction tasks challenging to tackle in many impactful applications. Further, subgraph prediction tasks present several unique challenges: subgraphs can have non-trivial internal topology, but also carry a notion of position and external connectivity information relative to the underlying graph in which they exist.
Here, we introduce Sub-GNN, a subgraph neural network to learn disentangled subgraph representations. We propose a novel subgraph routing mechanism that propagates neural messages between the subgraph’s components and randomly sampled anchor patches from the underlying graph, yielding highly accurate subgraph representations. Sub-GNN specifies three channels, each designed to capture a distinct aspect of subgraph topology, and we provide empirical evidence that the channels encode their intended properties. We design a series of new synthetic and real-world subgraph datasets. Empirical results for subgraph classification on eight datasets show that Sub-GNN achieves considerable performance gains, outperforming strong baseline methods, including node-level and graph-level GNNs, by 12.4% over the strongest baseline. Sub-GNN performs well on challenging biomedical datasets when subgraphs have complex topology and even comprise multiple disconnected components. -
Examining the Use of Real-World Evidence in the Regulatory Process
Brett Beaulieu-Jones, Samuel Finlayson, William Yuan, Russ Altman, Isaac Kohane, Vinay Prasad, Kun-Hsing Yu
Clinical Pharmacology and Therapeutics, 2019
[Abstract] [Paper]The 21st Century Cures Act passed by the United States Congress mandates the Food and Drug Administration to develop guidance to evaluate the use of real-world evidence (RWE) to support the regulatory process. RWE has generated important medical discoveries, especially in areas where traditional clinical trials would be unethical or infeasible. However, RWE suffers from several issues that hinder its ability to provide proof of treatment efficacy at a level comparable to randomized controlled trials. In this review article, we summarized the advantages and limitations of RWE, identified the key opportunities for RWE, and pointed the way forward to maximize the potential of RWE for regulatory purposes. -
Adversarial attacks on medical machine learning
Samuel Finlayson, John Bowers, Joi Ito, Jonathan Zittrain, Andrew Beam, Isaac Kohane
Science, 2019
[Abstract] [Paper] [GitHub]
[Technical Preprint]
[FAQ] [Sample press: NYT, New Yorker, IEE, Vox, Axios]With public and academic attention increasingly focused on the new role of machine learning in the health information economy, an unusual and no-longer-esoteric category of vulnerabilities in machine learning systems could prove significant. These vulnerabilities allow a small, carefully-designed change in how inputs are presented to a system to completely alter its output, causing it to confidently arrive at manifestly wrong conclusions. These advanced techniques to subvert otherwise-reliable machine learning systems – so-called adversarial attacks – have, to date, been of interest primarily to computer science researchers. However, the landscape of often-competing interests within healthcare, and billions of dollars at stake in systems’ outputs, implies considerable problems. We outline motivations that various players in the healthcare system may have to employ adversarial attacks, and begin a discussion of what to do about them. Far from discouraging continued innovation with medical machine learning, we call for active engagement of medical, technical, legal, and ethical experts in pursuit of efficient, broadly-available, and effective health care that machine learning will enable. -
Systemic nature of spinal muscular atrophy revealed by studying insurance claims
Scott Lipnick, Denis Agniel, Rahul Aggarwal, Nina Makhortova, Samuel Finlayson... Isaac Kohane , Lee Rubin
Plos One, 2019
[Abstract] [Paper]Objective We investigated the presence of non-neuromuscular phenotypes in patients affected by Spinal Muscular Atrophy (SMA), a disorder caused by a mutation in the Survival of Motor Neuron (SMN) gene, and whether these phenotypes may be clinically detectable prior to clinical signs of neuromuscular degeneration and therefore independent of muscle weakness.
Methods: We utilized a de-identified database of insurance claims to explore the health of 1,038 SMA patients compared to controls. Two analyses were performed: (1) claims from the entire insurance coverage window; and (2) for SMA patients, claims prior to diagnosis of any neuromuscular disease or evidence of major neuromuscular degeneration to increase the chance that phenotypes could be attributed directly to reduced SMN levels. Logistic regression was used to determine whether phenotypes were diagnosed at significantly different rates between SMA patients and controls and to obtain covariate-adjusted odds ratios.
Results: Results from the entire coverage window revealed a broad spectrum of phenotypes that are differentially diagnosed in SMA subjects compared to controls. Moreover, data from SMA patients prior to their first clinical signs of neuromuscular degeneration revealed numerous non-neuromuscular phenotypes including defects within the cardiovascular, gastrointestinal, metabolic, reproductive, and skeletal systems. Furthermore, our data provide evidence of a potential ordering of disease progression beginning with these non-neuromuscular phenotypes.
Conclusions: Our data point to a direct relationship between early, detectable non-neuromuscular symptoms and SMN deficiency. Our findings are particularly important for evaluating the efficacy of SMN-increasing therapies for SMA, comparing the effectiveness of local versus systemically delivered therapeutics, and determining the optimal therapeutic treatment window prior to irreversible neuromuscular damage. -
Towards generative adversarial networks as a new paradigm for radiology education
Samuel Finlayson, Hyunkwang Lee, Isaac Kohane, Luke Oakden-Rayner
Machine Learning for Health (NeurIPS Workshop), 2018
[Abstract] [Paper] [Description/Backstory]Medical students and radiology trainees typically view thousands of images in order to ”train their eye” to detect the subtle visual patterns necessary for diagnosis. Nevertheless, infrastructural and legal constraints often make it difficult to access and quickly query an abundance of images with a user-specified feature set. In this paper, we use a conditional generative adversarial network (GAN) to synthesize 1024 × 1024 pixel pelvic radiographs that can be queried with conditioning on fracture status. We demonstrate that the conditional GAN learns features that distinguish fractures from non-fractures by training a convolutional neural network exclusively on images sampled from the GAN and achieving an AUC of > 0.95 on a held-out set of real images. We conduct additional analysis of the images sampled from the GAN and describe ongoing work to validate educational efficacy.Generative adversarial networks (GANs) are a very cool technique that uses dualing neural networks to try to learn to approximate a data distribution. They have been used to create some beautiful images, but some researchers have current techniques don't appear to learn the full data distribution they're trying to approximate. As a researcher, this makes me nervous about using GAN-sampled images as an data substitutes in high-stakes clinical applications, but I'm still very interested in considering ways we could find productive uses for GANs while we're figuring out how to solve their limitations.
As a medical student on a radiology rotation, I spent many hours staring at websites like this this one, trying to proverbially "train my eye." Note that the above tool--which I found indispensible--only shows 500 *normal* chest x-rays. This produced the mild epiphany that in radiology education, individual tools can be useful even if they just help students see many variants of a thin subset of the data distribution. What I wanted as a student was to pre-specify a combo of medical features, and then cook up an arbitrary number of slightly different images that had them all. In theory, a conditional GAN could do just this, with limited storage and no active link to hospital IT systems.
Back in grad school, I met Hyunkwang Lee during a class with Sasha Rush, and we decided to work on conditional image generation using GANs. We reached out to Luke Oakden-Rayner, who gave us access to an amazing dataset. We played around on this using the StackGAN framework during the class project, and have since extended it into this work using Progressive Growing of GANs. -
Privacy-Preserving Distributed Deep Learning for Clinical Data
Brett Beaulieu-Jones, William Yuan, Samuel Finlayson, Zhiwei Steven Wu
Machine Learning for Health (NeurIPS Workshop), 2018
[Abstract] [Paper]Deep learning with medical data often requires larger samples sizes than are available at single providers. While data sharing among institutions is desirable to train more accurate and sophisticated models, it can lead to severe privacy concerns due the sensitive nature of the data. This problem has motivated a number of studies on distributed training of neural networks that do not require direct sharing of the training data. However, simple distributed training does not offer provable privacy guarantees to satisfy technical safe standards and may reveal information about the underlying patients. We present a method to train neural networks for clinical data in a distributed fashion under differential privacy. We demonstrate these methods on two datasets that include information from multiple independent sites, the eICU collaborative Research Database and The Cancer Genome Atlas. -
Predictability and persistence of prebiotic dietary supplementation in a healthy human cohort
Thomas Gurry, HST Microbiome Consortium, ... Eric Alm
Scientific Reports, 2018
[Abstract] [Paper] [Description/Backstory]Dietary interventions to manipulate the human gut microbiome for improved health have received increasing attention. However, their design has been limited by a lack of understanding of the quantitative impact of diet on a host’s microbiota. We present a highly controlled diet perturbation experiment in a healthy, human cohort in which individual micronutrients are spiked in against a standardized background. We identify strong and predictable responses of specific microbes across participants consuming prebiotic spike-ins, at the level of both strains and functional genes, suggesting fine-scale resource partitioning in the human gut. No predictable responses to non-prebiotic micronutrients were found. Surprisingly, we did not observe decreases in day-to-day variability of the microbiota compared to a complex, varying diet, and instead found evidence of diet-induced stress and an associated loss of biodiversity. Our data offer insights into the effect of a low complexity diet on the gut microbiome, and suggest that effective personalized dietary interventions will rely on functional, strain-level characterization of a patient’s microbiota.This was a really fun project with a long story behind it. In brief, during the first year of medical school, some friends and I began wondering if we could execute a clinical experiment on ourselves. Inspired by the use of Soylent by some of our classmates -- and some reservations that we had about the quality of control arms in microbiome studies -- we came up with an idea: recruit about 50 graduate students, place them on an identical all-liquid diet, and run a seven-arm clinical experiment with 6 arms receiving a spike-in of a single macronutrient for the second half of the study period. Thus the "HST microbiome consortium" was born. As it turned out, Thomas Gurry and Eric Alm (researchers at MIT), had been thinking along similar lines as well. So we teamed up, ran the experiment together, and the result was this paper! We wrote a bit more on the backstory here. -
Potential Adverse Effects of Broad-Spectrum Antimicrobial Exposure in the Intensive Care Unit
Jenna Wiens*, Graham Snyder*, Samuel Finlayson, Monica Majoney, Leo Celi
Open Forum Infectious Diseases, 2018
[Abstract] [Paper]Background: The potential adverse effects of empiric broad-spectrum antimicrobial use among patients with suspected but subsequently excluded infection have not been fully characterized. We sought novel methods to quantify the risk of adverse effects of broad-spectrum antimicrobial exposure among patients admitted to an intensive care unit (ICU).
Methods: Among all adult patients admitted to ICUs at a single institution, we selected patients with negative blood cultures who also received ≥1 broad-spectrum antimicrobials. Broad-spectrum antimicrobials were categorized in ≥1 of 5 categories based on their spectrum of activity against potential pathogens. We performed, in serial, 5 cohort studies to measure the effect of each broad-spectrum category on patient outcomes. Exposed patients were defined as those receiving a specific category of broad-spectrum antimicrobial; nonexposed were all other patients in the cohort. The primary outcome was 30-day mortality. Secondary outcomes included length of hospital and ICU stay and nosocomial acquisition of antimicrobial-resistant bacteria (ARB) or Clostridium difficile within 30 days of admission.
Results: Among the study cohort of 1918 patients, 316 (16.5%) died within 30 days, 821 (42.8%) had either a length of hospital stay >7 days or an ICU length of stay >3 days, and 106 (5.5%) acquired either a nosocomial ARB or C. difficile. The short-term use of broad-spectrum antimicrobials in any of the defined broad-spectrum categories was not significantly associated with either primary or secondary outcomes.
Conclusions: The prompt and brief empiric use of defined categories of broad-spectrum antimicrobials could not be associated with additional patient harm. -
Toward rapid learning in cancer treatment selection: An interactive analytical engine for clinical oncology
Samuel Finlayson, Mia Levy, Sunil Reddy, Daniel Rubin
Journal of Biomedical Informatics, 2016
[Abstract] [Paper] [GitHub]OBJECTIVE: Wide-scale adoption of electronic medical records (EMRs) has created an unprecedented opportunity for the implementation of Rapid Learning Systems (RLSs) that leverage primary clinical data for real-time decision support. In cancer, where large variations among patient features leave gaps in traditional forms of medical evidence, the potential impact of a RLS is particularly promising. We developed the Melanoma Rapid Learning Utility (MRLU), a component of the RLS, providing an analytical engine and user interface that enables physicians to gain clinical insights by rapidly identifying and analyzing cohorts of patients similar to their own.
MATERIALS AND METHODS: A new approach for clinical decision support in Melanoma was developed and implemented, in which patient-centered cohorts are generated from practice-based evidence and used to power on-the-fly stratified survival analyses. A database to underlie the system was generated from clinical, pharmaceutical, and molecular data from 237 patients with metastatic melanoma from two academic medical centers. The system was assessed in two ways: (1) ability to rediscover known knowledge and (2) potential clinical utility and usability through a user study of 13 practicing oncologists.
RESULTS: The MRLU enables physician-driven cohort selection and stratified survival analysis. The system successfully identified several known clinical trends in melanoma, including frequency of BRAF mutations, survival rate of patients with BRAF mutant tumors in response to BRAF inhibitor therapy, and sex-based trends in prevalence and survival. Surveyed physician users expressed great interest in using such on-the-fly evidence systems in practice (mean response from relevant survey questions 4.54/5.0), and generally found the MRLU in particular to be both useful (mean score 4.2/5.0) and useable (4.42/5.0).
DISCUSSION: The MRLU is an RLS analytical engine and user interface for Melanoma treatment planning that presents design principles useful in building RLSs. Further research is necessary to evaluate when and how to best use this functionality within the EMR clinical workflow for guiding clinical decision making.
CONCLUSION: The MRLU is an important component in building a RLS for data driven precision medicine in Melanoma treatment that could be generalized to other clinical disorders. -
Detecting Unplanned Care From Clinician Notes in Electronic Health Records
Suzanne Tamang, Manali Patel, Douglas Blayney, Julie Kuznetsov, Samuel Finlayson, ... Nigam Shah
Journal of Oncology Practice, 2015
[Abstract] [Paper]
Purpose: Reduction in unplanned episodes of care, such as emergency department visits and unplanned hospitalizations, are important quality outcome measures. However, many events are only documented in free-text clinician notes and are labor intensive to detect by manual medical record review.
Methods: We studied 308,096 free-text machine-readable documents linked to individual entries in our electronic health records, representing care for patients with breast, GI, or thoracic cancer, whose treatment was initiated at one academic medical center, Stanford Health Care (SHC). Using a clinical text-mining tool, we detected unplanned episodes documented in clinician notes (for non-SHC visits) or in coded encounter data for SHC-delivered care and the most frequent symptoms documented in emergency department (ED) notes.
Results: Combined reporting increased the identification of patients with one or more unplanned care visits by 32% (15% using coded data; 20% using all the data) among patients with 3 months of follow-up and by 21% (23% using coded data; 28% using all the data) among those with 1 year of follow-up. Based on the textual analysis of SHC ED notes, pain (75%), followed by nausea (54%), vomiting (47%), infection (36%), fever (28%), and anemia (27%), were the most frequent symptoms mentioned. Pain, nausea, and vomiting co-occur in 35% of all ED encounter notes.
Conclusion: The text-mining methods we describe can be applied to automatically review free-text clinician notes to detect unplanned episodes of care mentioned in these notes. These methods have broad application for quality improvement efforts in which events of interest occur outside of a network that allows for patient data sharing. -
Building the Graph of Medicine from Millions of Clinical Narratives
Samuel Finlayson, Paea LePendu, Nigam Shah
Scientific Data, 2014
[Abstract] [Paper] [Data]Electronic health records (EHR) represent a rich and relatively untapped resource for characterizing the true nature of clinical practice and for quantifying the degree of inter-relatedness of medical entities such as drugs, diseases, procedures and devices. We provide a unique set of co-occurrence matrices, quantifying the pairwise mentions of 3 million terms mapped onto 1 million clinical concepts, calculated from the raw text of 20 million clinical notes spanning 19 years of data. Co-frequencies were computed by means of a parallelized annotation, hashing, and counting pipeline that was applied over clinical notes from Stanford Hospitals and Clinics. The co-occurrence matrix quantifies the relatedness among medical concepts which can serve as the basis for many statistical tests, and can be used to directly compute Bayesian conditional probabilities, association rules, as well as a range of test statistics such as relative risks and odds ratios. This dataset can be leveraged to quantitatively assess comorbidity, drug-drug, and drug-disease patterns for a range of clinical, epidemiological, and financial applications.