
Ingenuity Pathway Analysis (IPA®) can now predict cell types associated with the genes on your network or pathway. The prediction is based on an enrichment calculation for the set of genes on your pathway canvas versus sets of genes that are known to be expressed relatively highly in particular cell types. The underlying cell type expression data comes from The Human Protein Atlas (www.proteinatlas.org/).
Figure 1 shows a screenshot of the new Cells and Tissues overlay applied to a network derived from expression data from a natural killer single cell cluster (from human fetal liver, PMID 31597962). As expected, the overlay indicates that the network is enriched in natural killer cell genes (P value: 2.04E-20).
Figure 1. Enrichment of natural killer enriched genes on a network. An overlay tag (labeled “CT: natural killer cells”) was added to the pathway after the genes on the network were found to be enriched in genes expressed relatively highly in natural killer cells. CT stands for “Cells or Tissues”. The underlying sets of genes that are considered cell-type enriched are defined as genes expressed in one cell type at more than three times the median of expression across all other cell types in the collection from The Human Protein Atlas.
The cell types are organized into three major branches of the Ingenuity Ontology, namely the physiological system, eukaryotic cells, and gross anatomical part. A specific cell type will typically be found in two or three branches of those major branches. In the example of Figure 1, natural killer cells are found under the immune system (within the physiological system branch), and as shown in Figure 2, also under blood cells in the eukaryotic cells branch.
Figure 2. Natural killer cells are categorized under the eukaryotic cells branch of the Ingenuity Ontology as well.
With this new capability in IPA, you can set a pattern of activated or inhibited genes on a My Pathway, which IPA can then score by comparing that pattern to the differential expression of the analysis-ready molecules in your dataset. In so doing, IPA can predict whether My Pathway is activated or inhibited in the context of your dataset. The activation state (red or green) for each node can be set by overlaying an analysis or a dataset, either manually with the red or green paint bucket in the MAP (Molecule Activity Predictor) feature, or by using a combination of the paint buckets along with either an overlaid analysis or dataset.
Figure 3 shows an example of a My Pathway created in IPA depicting several key epithelial–mesenchymal transition-related genes and biological functions. The gene nodes have been colored with the MAP paint buckets (red for activated and green for inhibited). Once the pathway has been saved and approved for scoring, the pathway can be scored in the context of future Core Analyses.
Figure 3. A custom My Pathway with nodes assigned by the user as activated (red) or green (inhibited). This pathway can be saved and scored in any future Core Analysis. Note that any orange or blue coloring for molecules or any diseases or functions are not saved as part of this pathway pattern for scoring purposes.
The scoring is done using a z score algorithm, akin to how Canonical Pathways are scored, accomplished by comparing the up- or downregulated states of the analysis-ready molecules in your dataset to the activity state (red or green color) of matching molecules on each saved My Pathway. Figure 4 shows the My Pathways tab for a Core Analysis of expression data from claudin-low breast cancer cell lines ratioed to luminal cell lines (PMID 20813035).
IPA predicts that the custom EMT “My Pathway” is activated in the aggressive cancer lines, which is the expected result for these cells. The z score is positive because the actual expression direction in the dataset (shown in the fourth column in the table in Figure 4) matches the expected direction assigned in the saved My Pathway (displayed in the seventh column in Figure 4, labeled “Expected”).
Figure 4. Causally scoring a My Pathway. The My Pathway named “EMT key TF” shown in Figure 3 has been scored in a Core Analysis and is indicated with the orange bar above. The orange color indicates the pathway is predicted to be activated in this expression analysis of aggressive breast cancer cell lines. As shown above the table, the z score for the pathway is 2.646.
This new capability provides you with the tools to create any pathway you can imagine and find out how it is impacted in your experimental setting. The genes on the pathway do not need to be connected by relationships. You can also modify a Canonical Pathway or other IPA pathway as your starting point for your My Pathway.
When analyzing a dataset, the most precise definition of the “universe” of genes to use in statistical calculations is the one that most closely matches the set of genes that you measured (or could measure) in your experimental setting. For example, if you are analyzing a panel of 400 genes, then the universe or “reference set” should be those 400 genes (or better yet, the subset of those genes that are measurable in the experimental conditions at hand). It would be statistically incorrect to set the reference set to all genes in the genome if you know you can only measure changes in those 400.
Or, for example, if you are performing whole transcriptome RNA-seq from mouse kidney tissue, then the reference set would ideally be the set of all genes in your experiment that you could reliably measure, for example, those with RPKM values that passed some threshold in at least one sample (e.g., RPKM > 1). That way, the universe is set to “mouse kidney-expressed genes” rather than all possible genes in the genome, some of which are not expressed in mouse kidney.
IPA has always enabled you to upload the entire set of detectable molecules and then when analyzing the data, to set the User Dataset as the reference set. However, it was easy to forget to use that setting when creating the analysis, resulting in effectively using the entire genome as the universe instead. In this release of IPA, you can set the reference set to User Dataset during dataset upload instead, when you are more likely to remember to set it correctly.
Figure 5 shows the new upload setting.
Figure 5. Setting the reference set to User Dataset reference during dataset upload.
This new feature should reduce the chance of accidentally using a less-than-ideal reference set in your analyses.
Please remember that you should not use the “User Dataset” reference set option if your dataset represents only the significantly differentially expressed genes from your experiment. In such a case, if you do not set even more stringent cutoffs at analysis time, then the statistics will be incorrect, because in that case there is no difference between the analysis-ready genes and the reference set. The statistics are designed to look for enrichment among a smaller set of genes drawn from the universe of possible genes.
A legend specific to the Graphical Summary (a tab in Core Analysis) appears in the top right corner of the screen when viewing that tab as shown in Figure 6.
Figure 6. Graphical Summary legend. The legend appears at the top right. A high-resolution copy can be downloaded from the help portal for inclusion in publications.
Links on Gene Views for TARGET (Therapeutically Applicable Research to Generate Effective Treatments) for childhood cancer have been updated to point to the B38 GC33 gene model in Land Explorer, rather than the older B38 data.
If you have further questions, please contact your local QIAGEN representative or contact our Technical Support Center at www.qiagen.com/support/technical-support.