Researchers often use clinical trials to collect the data needed to evaluate some hypothesis, or produce a classifier. During this process, they have to pay the cost of performing each test. Many studies will run a comprehensive battery of tests on each subject, for as many subjects as their budget will allow -- ie, "round robin" (RR). We consider a more general model, where the researcher can sequentially decide which single test to perform on which specific individual; again subject to spending only the available funds. Our goal here is to use these funds most effectively, to collect the data that allows us to learn the most accurate classifier.
We first explore the simplified "coins version" of this task. After observing that this is NP-hard, we consider a range of heuristic algorithms, both standard and novel, and observe that our "biased robin" approach is both efficient and much more effective than most other approaches, including the standard RR approach. We then apply these ideas to learning a naive-bayes classifier, and see similar behavior. Finally, we consider the most realistic model, where both the researcher gathering data to build the classifier, and the user (eg, physician) applying this classifier to an instance (patient) must pay for the features used --- eg, the researcher has $10,000 to acquire the feature values needed to produce an optimal $30/patient classifier. Again, we see that our novel approaches are almost always much more effective that the standard RR model.
Joint work with Aloak Kapoor, Dan Lizotte and Omid Madani.
Context: The "survival prediction" task requires learning a model that can estimate the time until an event will happen for an instance; this differs from standard regression problems as the training survival dataset may include many "censored instances", which specify only a lower bound on that instance's true survival time. This framework is surprisingly common, as it includes many real-world situations, such as estimating the time until a customer defaults on a loan, until a game player advances to the next level, until a mechanic device breaks, and customer churn. [Note this framework allows that "true time" to be effectively infinite for some instances -- ie, some players may never advance, and some customers might not default, etc.] This presentation focuses the most common situation: estimating the time until a patient dies.
An accurate model of a patient’s individual survival distribution can help determine the appropriate treatment and care of terminal patients. The common practice of estimating such survival distributions uses only population averages for (say) the site and stage of cancer; however, this is not very precise, as it ignores many important individual differences among patients. This presentation describes a novel technique, PSSP (patient-specific survival prediction), for estimating a patient’s individual survival curve, based on the characteristics of that specific patient, using a model that was learned from earlier patients. We describe how PSSP works, and explain how PSSP differs from the more standard tools for survival analysis (Kaplan-Meier, Cox Proportional Hazard, etc). We also show that PSSP is "calibrated", which means that its probabilistic estimates are meaningful. Finally, we demonstrate, over many real-world datasets (various cancers, and liver transplantation), that PSSP provides survival estimates that are helpful for patients, clinicians and researchers.
Joint work with Chun-Nam Yu, Vickie Baracos,
2018 Slides Video Presentation (45m) from 2015 MUCMD Conference (Los Angeles)
Video Presentation (18m) from 2015 Techna Symposium (Toronto)
Recent advances in high-throughput technologies, such as genome-wide SNP analysis and microarray gene expression profiling, have led to a multitude of ranked lists, where the features (SNPs, genes) are sorted based on their individual correlation with a phenotype. Many studies will then return the top k features as ``signatures'' for the phenotypes. Multiple reviews have shown, however, that different studies typically produce very different signatures, even when based on subsampling from a single dataset; this is the ``gene signature anomaly''.
This paper formally investigates the overlap of the top ranked features in two lists whose elements are ranked by their respective Pearson correlation coefficients with the phenotype outcome. We show that our model is able to accurately predict the expected overlap between two ranked lists, based on reasonable assumptions. This finding explains why the overlap in a pair of gene signatures (for same phenotype) should be relatively small, for purely statistical reasons, given today's typical sample size.
Joint work with Babak Damavandi,
Many web recommendation systems direct users to webpages, from a single website, that other similar users have visited. By contrast, our WebIC web recommendation system is designed to locate "information content (IC) pages" --- pages the current user needs to see to complete her task --- from essentially anywhere on the web. WebIC first extracts the "browsing properties" of each word encountered in the user's current click-stream --- eg, how often each word appears in the title of a page in this sequence, or in the "anchor" of a link that was followed, etc. It then uses a user- and site-independent model, learned from a set of annotated web logs acquired in a user study, to determine which of these words is likely to appear in an IC page. We discuss how to use these IC-words to find IC-pages, and demonstrate empirically that this browsing-based approach works effectively.
Joint work with Tingshao Zhu, Gerald Haeubl and Bob Price
Proteome Analyst (PA) is a publicly available, high-throughput, web-based system for predicting various properties of each protein in an entire proteome. Using machine-learned classifiers, PA can predict, for example, the GeneQuiz general function and Gene Ontology (GO) molecular function of a protein. In addition, PA is one of the most accurate and most comprehensive systems for predicting subcellular localization, the location within a cell where a protein performs its main function. Two other capabilities of PA are notable. First, PA can create a custom classifier to predict a new property, without requiring any programming, based on labeled training data (i.e. a set of examples, each with the correct classification label) provided by a user. PA has been used to create custom classifiers for potassium-ion channel proteins and other general function ontologies. Second, PA provides a sophisticated explanation feature that shows why one prediction is chosen over another. The PA system produces a linear classifier, which is amenable to a graphical and interactive approach to explanations for its predictions; transparent predictions increase the user's confidence in, and understanding of, PA. Finally, we also present a similar technique that predicts which proteins will participate in a known (signaling) pathway, and how.
Joint work with
Paul Lu, Duane Szafron, David S. Wishart,
Alona Fyshe, Brandon Pearcy, Brett Poulin, Roman Eisner, Danny Ngo,
Nicholas Lamb, Jordan Patterson, Kurt McMillan, Kevin Jewell
In general, each cell signaling pathway involves many proteins, each with one or more specific roles. As they are essential components of cell activity, iti s important to understand how these proteins work -- and in particular, to determine which of a species' proteins participate in each role. Experimentally determining this mapping of proteins to roles is difficult and time consuming. Fortunately, many pathways are similar across species, so we may be able to use known pathway information of one species to understand the corresponding pathway of another.
We present an automatic approach, Predict Signaling Pathway (PSP), that uses the signaling pathways in well-studied species to predict the roles of proteins in less-studied species. We use a machine learning approach to create a predictor that achieves a generalization F-measure of 78.2% when applied to 11 different pathways across 14 different species. We also show our approach is very effective in predicting the pathways that have not yet been experimentally studied completely.
Joint work with
Babak Bostan, Paul Lu, Duane Szafron,
Bayesian belief nets (BNs) are often used for classification tasks --- typically to return the most likely class label for each specified instance. Many BN-learners, however, attempt to find the BN that maximizes a different objective function --- viz., likelihood, rather than classification accuracy --- typically by first learning an appropriate graphical structure, then finding the maximal likelihood parameters for that structure. As these parameters may not maximize the classification accuracy, ``discriminative learners'' follow the alternative approach of seeking the parameters that maximize conditional likelihood (CL), over the distribution of instances the BN will have to classify. This presentation first formally specifies this task, and shows how it extends standard logistic regression. After analyzing its inherent sample and computational complexity, we present a general algorithm for this task, ELR, that applies to arbitrary BN structures and works effectively even when given incomplete training data. We present empirical evidence that ELR produces better classifiers than are produced by the standard ``generative'' algorithms in a variety of situations, especially in common situations where the given BN-structure is incorrect.
Joint work with Wei Zhou, Xiaoyuan Su and Bin Shen
A Bayesian Belief Network (BN) models a joint distribution over a set of n variables, using a DAG structure to represent the immediate dependencies between the variables, and a set of parameters (aka "CPTables") to represent the local conditional probabilities of a node, given each assignment to its parents. In many situations, these parameters are themselves random variables --- this may reflect the uncertainty of the domain expert, or may come from a training sample used to estimate the parameter values. The distribution over these "CPtable variables" induces a distribution over the response the BN will return to any "What is Pr(Q=q | E=e)?" query. This paper investigates properties of this response: showing first that it is asymptotically normal, then providing, in closed form, its mean and asymptotic variance. We then present an effective general algorithm for computing this variance, which has the same complexity as simply computing (the mean value of) the response itself --- ie, O(n 2w), where w is the effective tree width. Finally, we provide empirical evidence that a Beta approximation works much better than the normal distribution, especially for small sample sizes, and that our algorithm works effectively in practice, over a range of belief net structures, sample sizes and queries.
Joint work with Tim Van Allen, Ajit Singh and Peter Hooper.
Machine Learning (ML) basically involves finding patterns in data. This presentation focusses on patterns that can lead to effective classifiers. It first providing a number of real-world deployed applications based on ML, and motivates why this process is essential for such tasks. It first provides a number of real-world deployed applications based on ML, and motivates why ML is essential for such tasks. It then overviews three simple examples of learnable classifiers (linear separators, artificial neural nets and decision trees), and quickly suggests some of the important statistical foundations that underlie this field.
This presentation introduces the relevant ideas, using real-world medical examples -- starting with a way to help predict which breast cancer patients are likely to suffer a relapse, based on the subcellular location of certain adhesion proteins, as well as the standard "Adjuvant!" features. We use this real example to introduce some basic ML technologies (like linear separators, support vector machines, and decision trees), as well as some important foundational ideas -- including ways to statistically validate the resulting classifier. En route, we explain how this machine learning approach differs from the standard biostatistical framework. This part concludes by explaining the decision tree learned for this task, and showing that it is effective -- capable of correctly predicting the outcome for 80% of the patients.
We then explain how the same methodology can be used to (1) find patterns over a patient's metabolic profiles (extracted from an NMR analysis of his/her urine sample) to predict whether s/he will become cachexic, or is at risk for colon cancer; (2) predict whether a breast cancer patient is ER+ vs ER-, by analyzing a microarray of her tumor biopsy; (3) find patterns over multiple Single Nucleotide Polyomorphisms (SNPs) that can predict breast cancer; (4) locate a tumor within an magnetic resonance image of a brain and predict how that tumor will grow; (5) predict a patient's psychiatric disorder from an fMRI scan; and (6) help patients manage their Type I diabetes. We also discuss some foundational work, providing an effective way (a) to predict how long an specific patient will survive, and (b) to efficiently design experiments over multiple sub-populations.
Joint work with various colleagues in the Cross Cancer Institute, the UofA Medical School and UofA Computing Science, including M Pasdar, J Mackey, S Damaraju, K Graham (and others of the PolyomX project), D Wishart, V Baracos, and A Murtha (and others of the Brain Tumor Analyst Project), M Brown (+ S Dursun, A Greenshaw), E Ryan, N Kneteman, A Montana-Loza, A Andres.
Abstract for shorter version.
Many tasks require building and using a model --- eg, to relate a patient's disease state with the possible symptoms, underlying causes, and effects of various treatments. These relationships are often probabilistic; eg a disease will often, but not always, manifest certain symptoms. Bayesian Belief Nets (BNs), which provide a succinct way to represent such probabilistic models, are in routine use for a wide range of applications, including medicine, bioinformatics, document classification, image processing and decision support systems. This presentation provides a quick overview to BNs: first motivating BNs in general, then describing how BNs exploit "independencies", and finally (if time permits) suggesting ways to learn a BN from data.
Slides: (Intro) (Topics)
Belief Net (PGM) website
This presentation focuses on preparing, and delivering, platform presentations: describing what material to present, and how to show this material effectively, and then some ideas relevant to the presentation itself: both before and during. The main ideas are to view the presentation as telling a well-structured story, that is relevant (and helpful) for the intended audience. If time permits, we will also discuss how to prepare and present posters; we will that many of the same ideas apply here, as well.