How do Reactome pathways get collapsed to gene sets?

I noticed GMT files from Reactome (and Reactome in MSigDB) sometimes include identifiers of genes which are negatively regulated by given pathway. It looks like a bad thing for GSEA and friends. I wonder if this is intentional or not. An example is IL18 in IL10 signalling pathway (https://reactome.org/content/detail/R-HSA-6783783).

Reactome | Interleukin-10 signaling

Reactome is pathway database which provides intuitive bioinformatics tools for the visualisation, interpretation and analysis of pathway knowledge.

@krassowski AFAIK, most gene sets from pathways are really just *the members* of the pathway, with no consideration for the connections or the directions of action.

So yeah, doing hypergeometric testing on the up / down regulated things separately, or GSEA which is implicitly directional (unless using something like limma's roast, which I think supports simultaneous direction testing if I recall), might not pick it up.

@krassowski I think a lot of these issues are part of the push to be able to examine gene expression changes "in the context" of the network and pathway and the connections.

But signaling, we could have really small changes at the bottom of the network, and you are actually more interested in the outputs, which are more likely phosphorylation changes.

It's good to be aware of the limitations of the methods, especially depending on the biological system you are investigating.

@krassowski For example, I think WGCNA can be used to overlay on the actual network, or even use the actual network as input to compare the detected gene-gene correlations and how they relate to it.

There have been other attempts in this area as well.

Always helpful to know biology!

https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/

I also think this is where highly specific GO can be useful, as long as there are enough genes, as they do have "negatively ..." and "positively ..." for some cases.

WGCNA: R package for performing Weighted Gene Co-expression Network Analysis

@rmflight I'm a huge fan of WGCNA and really hope we can solve the problem of topology-based pathway analysis (or maybe it is already solved?), but I'm afraid that this would not work for my data (I think those work well for proteome-wide and expression datasets, but its tricky to apply for panels of few hundred of proteins)
@rmflight as I refresh my memory of https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3146-1 it looks that maybe the topology-based methods are in a better place now than I thought :)
A comparative study of topology-based pathway enrichment analysis methods - BMC Bioinformatics

Background Pathway enrichment extensively used in the analysis of Omics data for gaining biological insights into the functional roles of pre-defined subsets of genes, proteins and metabolites. A large number of methods have been proposed in the literature for this task. The vast majority of these methods use as input expression levels of the biomolecules under study together with their membership in pathways of interest. The latest generation of pathway enrichment methods also leverages information on the topology of the underlying pathways, which as evidence from their evaluation reveals, lead to improved sensitivity and specificity. Nevertheless, a systematic empirical comparison of such methods is still lacking, making selection of the most suitable method for a specific experimental setting challenging. This comparative study of nine network-based methods for pathway enrichment analysis aims to provide a systematic evaluation of their performance based on three real data sets with different number of features (genes/metabolites) and number of samples. Results The findings highlight both methodological and empirical differences across the nine methods. In particular, certain methods assess pathway enrichment due to differences both across expression levels and in the strength of the interconnectedness of the members of the pathway, while others only leverage differential expression levels. In the more challenging setting involving a metabolomics data set, the results show that methods that utilize both pieces of information (with NetGSA being a prototypical one) exhibit superior statistical power in detecting pathway enrichment. Conclusion The analysis reveals that a number of methods perform equally well when testing large size pathways, which is the case with genomic data. On the other hand, NetGSA that takes into consideration both differential expression of the biomolecules in the pathway, as well as changes in the topology exhibits a superior performance when testing small size pathways, which is usually the case for metabolomics data.

BioMed Central

@krassowski OK, do you really expect GSEA (and friends) to work for a pathway with no consideration of direction of edges?

If you want to consider directions of edges, I'd want to start parsing the SBML directly and see if correlation matches the expected relationships.

@rmflight I am well aware of this consideration (common in Gene Ontology) but did not expect this in the MSigDB-provided Reactome sets.

My question still stands: how are the members of a pathway derived when the GMT files get exported? I guess the answer will come from some digging in the source. For example I see that said IL-10 signaling pathway includes both "CSF2" and "CSF2 gene".