Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparecluster analysis is not working for me with same data #754

Open
Sidragull57 opened this issue Jan 20, 2025 · 7 comments
Open

Comparecluster analysis is not working for me with same data #754

Sidragull57 opened this issue Jan 20, 2025 · 7 comments

Comments

@Sidragull57
Copy link

Sidragull57 commented Jan 20, 2025

Hallo,

I want to re-run the GSEA on my same data. I have previously performed the compare cluster analysis for GSEGSEA and it worked very well but now I have updated the R version to 4.4.2. here is the pathway and code with error. Kindly let me know what is the issue now

VLnorm<-read.csv2("Astrocytes VL_Females normalized counts.csv", row.names = 1, dec = ".", na.strings = c("", "NA")) # Read CSV file specifying decimal separator and handling missing values
colNamesList <- strsplit(colnames(VLnorm), "X")
colnames(VLnorm) <- sapply(colNamesList, function(x) paste(x, collapse = ""))
View(VLnorm)
colNamesList <- strsplit(colnames(VLnorm), "X")
colnames(VLnorm) <- sapply(colNamesList, function(x) paste(x, collapse = ""))
time_points <- sub("(\d+)d.*", "\1", colnames(VLnorm)) #this describes my sample names, \1 means 1 position before the d

Create an empty list to store the subsets

subset_list <- list()

str(time_points)
chr [1:25] "0" "0" "0" "0" "0" "1" "1" "1" "1" "1" "3" "3" "3" "3" "3" "5" "5" "5" "5" "5" "10" "10" ...
for (time_point in unique(time_points)) {

  • subset_df <- VLnorm[, grepl(paste0("^", time_point, "d_"), colnames(VLnorm))]
  • subset_list[[time_point]] <- subset_df
  • }

subset_BL_VL <- subset_list[["0"]]
subset_1d_VL<- subset_list[["1"]]
subset_3d_VL <- subset_list[["3"]]
subset_5d_VL<- subset_list[["5"]]
subset_10d_VL <- subset_list[["10"]]
str(subset_BL_VL)
'data.frame': 15664 obs. of 5 variables:
$ 0d_L_F_V4 : num 92.302 3.103 0.776 0 4.654 ...
$ 0d_L_F_V7 : num 72.375 1.744 0 0 0.872 ...
$ 0d_L_F_V8 : num 71.487 1.787 0.894 0 8.042 ...
$ 0d_L_F_V9 : num 93.43 2.67 0 0 3.56 ...
$ 0d_L_F_V10: num 101.03 2.09 0 0 6.27 ...
row_names_match <- all(rownames(subset_BL_VL) == rownames(subset_1d_VL))
row_names_match <- all(rownames(subset_5d_VL) == rownames(subset_5d_VL))
if (row_names_match) {

  • print("Row order is the same.")
  • } else {
  • print("Row order is different.")
  • }
    [1] "Row order is the same."

mean_BL<-rowMeans(subset_BL_VL, na.rm = T)
mean_BL<-as.data.frame(mean_BL)
sd_BL<-apply(subset_BL_VL, 1, sd, na.rm = TRUE)
sd_BL<-as.data.frame(sd_BL)
#1d
mean_1d<-rowMeans(subset_1d_VL, na.rm = T)
mean_1d<-as.data.frame(mean_1d)
sd_1d<-apply(subset_1d_VL, 1, sd, na.rm = TRUE)
sd_1d<-as.data.frame(sd_1d)
#3d
mean_3d<-rowMeans(subset_3d_VL, na.rm = T)
mean_3d<-as.data.frame(mean_3d)
sd_3d<-apply(subset_3d_VL, 1, sd, na.rm = TRUE)
sd_3d<-as.data.frame(sd_3d)
#5d
mean_5d<-rowMeans(subset_5d_VL, na.rm = T)
mean_5d<-as.data.frame(mean_5d)
sd_5d<-apply(subset_5d_VL, 1, sd, na.rm = TRUE)
sd_5d<-as.data.frame(sd_5d)
#10d
mean_10d<-rowMeans(subset_10d_VL, na.rm = T)
mean_10d<-as.data.frame(mean_10d)
sd_10d<-apply(subset_10d_VL, 1, sd, na.rm = TRUE)
sd_10d<-as.data.frame(sd_10d)
row_names_match <- all(rownames(mean_BL) == rownames(mean_1d))
row_names_match <- all(rownames(sd_BL) == rownames(sd_1d))
if (row_names_match) {

  • print("Row order is the same.")
  • } else {
  • print("Row order is different.")
  • }
    [1] "Row order is the same."

tp1d_BL_VL<- (mean_1d - mean_BL) / (sd_1d + sd_BL)
tp1d_BL_VL<-na.omit(tp1d_BL_VL) # remove the NA values

tp3d_BL_VL<- (mean_3d - mean_BL) / (sd_3d + sd_BL)
tp3d_BL_VL<-na.omit(tp3d_BL_VL) # remove the NA values

tp5d_BL_VL<- (mean_5d - mean_BL) / (sd_5d + sd_BL)
tp5d_BL_VL<-na.omit(tp5d_BL_VL) # remove the NA values

tp10d_BL_VL<- (mean_10d - mean_BL) / (sd_10d + sd_BL)
tp10d_BL_VL<-na.omit(tp10d_BL_VL) # remove the NA values

#change the col names
name<-c("SignaltoNoise")
colnames(tp1d_BL_VL)<-name
colnames(tp3d_BL_VL)<-name
colnames(tp5d_BL_VL)<-name
colnames(tp10d_BL_VL)<-name
column_to_order_by <- "SignaltoNoise" # column name
ranked1d_BL <- tp1d_BL_VL %>%arrange(desc(get(column_to_order_by)))
ranked3d_BL<-tp3d_BL_VL %>% arrange(desc(get(column_to_order_by)))
ranked5d_BL<-tp5d_BL_VL %>% arrange(desc(get(column_to_order_by)))
ranked10d_BL<-tp10d_BL_VL %>% arrange(desc(get(column_to_order_by)))
ranked1d_BL_up<-rownames_to_column(ranked1d_BL, var="ENSEMBLEID")
ranked3d_BL_up<-rownames_to_column(ranked3d_BL, var="ENSEMBLEID")
ranked5d_BL_up<-rownames_to_column(ranked5d_BL, var="ENSEMBLEID")
ranked10d_BL_up<-rownames_to_column(ranked10d_BL, var="ENSEMBLEID")
Entrez_genes_1d <- bitr(ranked1d_BL_up$ENSEMBLEID, fromType = "ENSEMBL", toType = "ENTREZID", OrgDb = org.Mm.eg.db)
'select()' returned 1:many mapping between keys and columns
Warnmeldung:
In bitr(ranked1d_BL_up$ENSEMBLEID, fromType = "ENSEMBL", toType = "ENTREZID", :
0.56% of input gene IDs are fail to map...
Entrez_genes_3d <-bitr(ranked3d_BL_up$ENSEMBLEID, fromType = "ENSEMBL", toType = "ENTREZID", OrgDb = org.Mm.eg.db)
'select()' returned 1:many mapping between keys and columns
Warnmeldung:
In bitr(ranked3d_BL_up$ENSEMBLEID, fromType = "ENSEMBL", toType = "ENTREZID", :
0.61% of input gene IDs are fail to map...
Entrez_genes_5d <- bitr(ranked5d_BL_up$ENSEMBLEID, fromType = "ENSEMBL", toType = "ENTREZID", OrgDb = org.Mm.eg.db)
'select()' returned 1:many mapping between keys and columns
Warnmeldung:
In bitr(ranked5d_BL_up$ENSEMBLEID, fromType = "ENSEMBL", toType = "ENTREZID", :
0.61% of input gene IDs are fail to map...
Entrez_genes_10d<-bitr(ranked10d_BL_up$ENSEMBLEID, fromType = "ENSEMBL", toType = "ENTREZID", OrgDb = org.Mm.eg.db)
'select()' returned 1:many mapping between keys and columns
Warnmeldung:
In bitr(ranked10d_BL_up$ENSEMBLEID, fromType = "ENSEMBL", toType = "ENTREZID", :
0.58% of input gene IDs are fail to map...

Merge the dataframes based on the common gene identifier

ranked1d_BL_Entrez <- merge(ranked1d_BL_up, Entrez_genes_1d, by.x= "ENSEMBLEID", by.y= "ENSEMBL")
ranked1d_BL_Entrez <- ranked1d_BL_Entrez %>%arrange(desc(get(column_to_order_by))) # rerank
ranked1d_BL_Entrez_fgsea<-subset(ranked1d_BL_Entrez, select = -ENSEMBLEID) # remove the ENSEMBLEID

ranked3d_BL_Entrez <- merge(ranked3d_BL_up, Entrez_genes_3d, by.x= "ENSEMBLEID", by.y= "ENSEMBL")
ranked3d_BL_Entrez <- ranked3d_BL_Entrez %>%arrange(desc(get(column_to_order_by))) # rerank
ranked3d_BL_Entrez_fgsea<-subset(ranked3d_BL_Entrez, select = -ENSEMBLEID) # remove the ENSEMBLEID

ranked5d_BL_Entrez <- merge(ranked5d_BL_up, Entrez_genes_5d, by.x= "ENSEMBLEID", by.y= "ENSEMBL")
ranked5d_BL_Entrez <- ranked5d_BL_Entrez %>%arrange(desc(get(column_to_order_by))) # rerank
ranked5d_BL_Entrez_fgsea<-subset(ranked5d_BL_Entrez, select = -ENSEMBLEID) # remove the ENSEMBLEID

ranked10d_BL_Entrez <- merge(ranked10d_BL_up, Entrez_genes_10d, by.x= "ENSEMBLEID", by.y= "ENSEMBL")
ranked10d_BL_Entrez <- ranked10d_BL_Entrez %>%arrange(desc(get(column_to_order_by))) # rerank
ranked10d_BL_Entrez_fgsea<-subset(ranked10d_BL_Entrez, select = -ENSEMBLEID) # remove th
geneList_1d_fgsea_VL<-ranked1d_BL_Entrez_fgsea[,1] # numeric vector
names(geneList_1d_fgsea_VL)<-as.character(ranked1d_BL_Entrez_fgsea[,2]) # name vector

geneList_3d_fgsea_VL<-ranked3d_BL_Entrez_fgsea[,1] # numeric vector
names(geneList_3d_fgsea_VL)<-as.character(ranked3d_BL_Entrez_fgsea[,2]) # name vector

geneList_5d_fgsea_VL<-ranked5d_BL_Entrez_fgsea[,1] # numeric vector
names(geneList_5d_fgsea_VL)<-as.character(ranked5d_BL_Entrez_fgsea[,2]) # name vector

geneList_10d_fgsea_VL<-ranked10d_BL_Entrez_fgsea[,1] # numeric vector
names(geneList_10d_fgsea_VL)<-as.character(ranked10d_BL_Entrez_fgsea[,2]) # name vector
msigdbr_species() #inteasd of msigdbr_show_species()

A tibble: 20 × 2

species_name species_common_name

1 Anolis carolinensis Carolina anole, green anole
2 Bos taurus bovine, cattle, cow, dairy cow, domestic cattle, domestic cow, ox, oxen
3 Caenorhabditis elegans NA
4 Canis lupus familiaris dog, dogs
5 Danio rerio leopard danio, zebra danio, zebra fish, zebrafish
6 Drosophila melanogaster fruit fly
7 Equus caballus domestic horse, equine, horse
8 Felis catus cat, cats, domestic cat
9 Gallus gallus bantam, chicken, chickens, Gallus domesticus
10 Homo sapiens human
11 Macaca mulatta rhesus macaque, rhesus macaques, Rhesus monkey, rhesus monkeys
12 Monodelphis domestica gray short-tailed opossum
13 Mus musculus house mouse, mouse
14 Ornithorhynchus anatinus duck-billed platypus, duckbill platypus, platypus
15 Pan troglodytes chimpanzee
16 Rattus norvegicus brown rat, Norway rat, rat, rats
17 Saccharomyces cerevisiae baker's yeast, brewer's yeast, S. cerevisiae
18 Schizosaccharomyces pombe 972h- NA
19 Sus scrofa pig, pigs, swine, wild boar
20 Xenopus tropicalis tropical clawed frog, western clawed frog

msigdbr_collections()

A tibble: 23 × 3

gs_cat gs_subcat num_genesets

1 C1 "" 299
2 C2 "CGP" 3384
3 C2 "CP" 29
4 C2 "CP:BIOCARTA" 292
5 C2 "CP:KEGG" 186
6 C2 "CP:PID" 196
7 C2 "CP:REACTOME" 1615
8 C2 "CP:WIKIPATHWAYS" 664
9 C3 "MIR:MIRDB" 2377
10 C3 "MIR:MIR_Legacy" 221

ℹ 13 more rows

ℹ Use print(n = ...) to see more rows

msigdbr_species()

A tibble: 20 × 2

species_name species_common_name

1 Anolis carolinensis Carolina anole, green anole
2 Bos taurus bovine, cattle, cow, dairy cow, domestic cattle, domestic cow, ox, oxen
3 Caenorhabditis elegans NA
4 Canis lupus familiaris dog, dogs
5 Danio rerio leopard danio, zebra danio, zebra fish, zebrafish
6 Drosophila melanogaster fruit fly
7 Equus caballus domestic horse, equine, horse
8 Felis catus cat, cats, domestic cat
9 Gallus gallus bantam, chicken, chickens, Gallus domesticus
10 Homo sapiens human
11 Macaca mulatta rhesus macaque, rhesus macaques, Rhesus monkey, rhesus monkeys
12 Monodelphis domestica gray short-tailed opossum
13 Mus musculus house mouse, mouse
14 Ornithorhynchus anatinus duck-billed platypus, duckbill platypus, platypus
15 Pan troglodytes chimpanzee
16 Rattus norvegicus brown rat, Norway rat, rat, rats
17 Saccharomyces cerevisiae baker's yeast, brewer's yeast, S. cerevisiae
18 Schizosaccharomyces pombe 972h- NA
19 Sus scrofa pig, pigs, swine, wild boar
20 Xenopus tropicalis tropical clawed frog, western clawed frog

hallmark_gs<-msigdbr(species = "Mus musculus" , category = "H") %>% dplyr::select(gs_name, entrez_gene)
head(hallmark_gs)

A tibble: 6 × 2

gs_name entrez_gene

1 HALLMARK_ADIPOGENESIS 11303
2 HALLMARK_ADIPOGENESIS 74610
3 HALLMARK_ADIPOGENESIS 52538
4 HALLMARK_ADIPOGENESIS 11363
5 HALLMARK_ADIPOGENESIS 11364
6 HALLMARK_ADIPOGENESIS 11409

m2 curated all (C2)

m2_gs<-msigdbr(species = "Mus musculus" , category = "C2") %>% dplyr::select(gs_name, entrez_gene)

m2 curated: WikiPathways (C2: WP)

m2_WP<-msigdbr(species = "Mus musculus" , category = "C2", subcategory = "CP:WIKIPATHWAYS") %>% dplyr::select(gs_name, entrez_gene)

m2 curated: KEGG (C2:KEGG)

geneList_TP_VL <- list(

  • d1 = geneList_1d_fgsea_VL, # Gene list for 1 day
  • d3 = geneList_3d_fgsea_VL, # Gene list for 3d
  • d5 = geneList_5d_fgsea_VL, # Gene list for 5 days
  • d10 = geneList_10d_fgsea_VL # Gene list for 10 days
  • )

set.seed(1)

ck_TP_VL_HM<-compareCluster(geneClusters = geneList_TP_VL, fun="GSEA",TERM2GENE = hallmark_gs,

  •                         pAdjustMethod="fdr", 
    
  •                         seed=TRUE, by="fgsea")
    

Fehler in .stopf("Duplicate values in %s not allowed", universeArg) :
Duplicate values in names(stats) not allowed
Zusätzlich: Warnmeldung:
In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are ties in the preranked stats (10.73% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.

@guidohooiveld
Copy link

guidohooiveld commented Jan 21, 2025

Because of its length, lack of formatting and reproducibility I did not bother to read your full post....

Yet, did you see the error in the last few lines?

Fehler in .stopf("Duplicate values in %s not allowed", universeArg) :
Duplicate values in names(stats) not allowed

This error is thrown by fgsea, and indicates that there are duplicated values present in your input. In other words, some genes in the ranked input list(s) are present multiple times.

Also note the warning on the ties in the list(s) of preranked stats!

BTW: what is GSEGSEA?

@Sidragull57
Copy link
Author

Sidragull57 commented Jan 21, 2025

Hallo,

Sorry it was GSEA analysis on the basis of signal to noise (SNR) ratio , I have checked my data fur duplications, and there is no duplicate in gene name ( Entrez ID) however in SNR values there are duplicates. I have performed GSEA analysis on the same date in September 2024 and now when I updated my R version to 4.4.2 and have re-runthe same code I got this error
......Ck_TP_VL_HM<-compareCluster(geneClusters = geneList_TP_VL, fun="GSEA",TERM2GENE = hallmark_gs,

                    pAdjustMethod="fdr", 
                    seed=TRUE, by="fgsea")

Fehler in .stopf("Duplicate values in %s not allowed", universeArg) :
Duplicate values in names(stats) not allowed
Zusätzlich: Warnmeldung:
In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are ties in the preranked stats (10.73% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.

@guidohooiveld
Copy link

Again, the error is due to duplicates in your input.

fgsea code throwing this error:
https://github.com/ctlab/fgsea/blob/da4b3b8ed313caeed3ea305cfb3333439ae52eea/R/util.R#L23-L26

If you would like to get some further assistance, please provide some reproducible code, or share/attach (part of ) your input (i.e. object geneList_TP_VL).

@guidohooiveld
Copy link

guidohooiveld commented Jan 21, 2025

Also, did you check the input (geneList_TP_VL) is really a list?

Thus, what is the output of str(geneList_TP_VL)?

@Sidragull57
Copy link
Author

yes I checked . It is a gene list of 4 clusters. I have run that in R and paste it below

) > set.seed(1) > > ck_TP_VL_HM<-compareCluster(geneClusters = geneList_TP_VL, fun="GSEA",TERM2GENE = hallmark_gs, + pAdjustMethod="fdr", + seed=TRUE, by="fgsea") Fehler in .stopf("Duplicate values in %s not allowed", universeArg) : Duplicate values in names(stats) not allowed Zusätzlich: Warnmeldung: In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (10.73% of the list). The order of those tied genes will be arbitrary, which may produce unexpected results.

str(geneList_TP_VL) List of 4 $ d1 : Named num [1:14813] 5.46 2.05 1.89 1.85 1.84 ... ..- attr(, "names")= chr [1:14813] "236930" "22042" "11815" "12380" ... $ d3 : Named num [1:14787] 2.28 1.79 1.73 1.73 1.73 ... ..- attr(, "names")= chr [1:14787] "208111" "244416" "76055" "224344" ... $ d5 : Named num [1:14800] 1.58 1.56 1.45 1.42 1.42 ... ..- attr(, "names")= chr [1:14800] "59002" "209027" "353169" "15366" ... $ d10: Named num [1:14759] 2.29 2.13 2 1.88 1.83 ... ..- attr(, "names")= chr [1:14759] "16449" "73316" "11898" "232201" ...
--
 
| >

Should I need to send my data. I am very surprised, on the same data the code worked few months before but now it provided thiis error.

kindly help me in this regard.

best
Sidra

@guidohooiveld
Copy link

guidohooiveld commented Jan 21, 2025

From your latest post I noticed the length of the input lists is not the same! Yet, this is not causing a problem (see code below in which some example data is used).

I also noted in your first post that you convert your lists of genes used as input to ENTREZID (from ENSEMBL), and that some of these conversions likely resulted in the presence of duplicated ENTREZID.
I conclude this because of this line: 'select()' returned 1:many mapping between keys and columns.
So I think this is likely the source of the error....!

From your code in first post

Entrez_genes_1d <- bitr(ranked1d_BL_up$ENSEMBLEID, fromType = "ENSEMBL", toType = "ENTREZID", OrgDb = org.Mm.eg.db)
'select()' returned 1:many mapping between keys and columns
Warnmeldung:
In bitr(ranked1d_BL_up$ENSEMBLEID, fromType = "ENSEMBL", toType = "ENTREZID", :
0.56% of input gene IDs are fail to map...

Considering your use case, since your data set is based on ensemblids, my suggestion would be to download the MSigDB hallmark gene sets as ensembl_gene rather than entrez_gene. Then you will avoid this conversion problem.

Code chunk to show you can download the MSigDB hallmark gene sets as ensembl_gene:

> ## load libraries
> library(msigdbr)
> library(clusterProfiler)
> 
> ## check MSigDb Hallmar gene sets
> ## note columns entrez_gene and ensembl_gene!
> hallmark_gs <- msigdbr(species = "Mus musculus" , category = "H")
> str(hallmark_gs)
tibble [7,384 × 18] (S3: tbl_df/tbl/data.frame)
 $ gs_cat              : chr [1:7384] "H" "H" "H" "H" ...
 $ gs_subcat           : chr [1:7384] "" "" "" "" ...
 $ gs_name             : chr [1:7384] "HALLMARK_ADIPOGENESIS" "HALLMARK_ADIPOGENESIS" "HALLMARK_ADIPOGENESIS" "HALLMARK_ADIPOGENESIS" ...
 $ gene_symbol         : chr [1:7384] "Abca1" "Abcb8" "Acaa2" "Acadl" ...
 $ entrez_gene         : int [1:7384] 11303 74610 52538 11363 11364 11409 104112 11429 11430 11512 ...
 $ ensembl_gene        : chr [1:7384] "ENSMUSG00000015243" "ENSMUSG00000028973" "ENSMUSG00000036880" "ENSMUSG00000026003" ...
 $ human_gene_symbol   : chr [1:7384] "ABCA1" "ABCB8" "ACAA2" "ACADL" ...
 $ human_entrez_gene   : int [1:7384] 19 11194 10449 33 34 35 47 50 51 112 ...
 $ human_ensembl_gene  : chr [1:7384] "ENSG00000165029" "ENSG00000197150" "ENSG00000167315" "ENSG00000115361" ...
 $ gs_id               : chr [1:7384] "M5905" "M5905" "M5905" "M5905" ...
 $ gs_pmid             : chr [1:7384] "26771021" "26771021" "26771021" "26771021" ...
 $ gs_geoid            : chr [1:7384] "" "" "" "" ...
 $ gs_exact_source     : chr [1:7384] "" "" "" "" ...
 $ gs_url              : chr [1:7384] "" "" "" "" ...
 $ gs_description      : chr [1:7384] "Genes up-regulated during adipocyte differentiation (adipogenesis)." "Genes up-regulated during adipocyte differentiation (adipogenesis)." "Genes up-regulated during adipocyte differentiation (adipogenesis)." "Genes up-regulated during adipocyte differentiation (adipogenesis)." ...
 $ taxon_id            : int [1:7384] 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 ...
 $ ortholog_sources    : chr [1:7384] "EggNOG|Ensembl|HGNC|HomoloGene|Inparanoid|NCBI|OMA|OrthoDB|OrthoMCL|Panther|Treefam" "EggNOG|Ensembl|HGNC|HomoloGene|Inparanoid|NCBI|OMA|OrthoDB|OrthoMCL|Panther|PhylomeDB|Treefam" "EggNOG|Ensembl|HGNC|HomoloGene|Inparanoid|NCBI|OMA|OrthoDB|OrthoMCL|Panther|Treefam" "EggNOG|Ensembl|HGNC|HomoloGene|Inparanoid|NCBI|OMA|OrthoDB|OrthoMCL|Panther|PhylomeDB|Treefam" ...
 $ num_ortholog_sources: num [1:7384] 11 12 11 12 12 11 11 12 12 12 ...
> 
> 
> ## entrez_gene
> hallmark_gs.ENTREZ <- msigdbr(species = "Mus musculus" , category = "H") %>%
+                dplyr::select(gs_name, entrez_gene)
> head(hallmark_gs.ENTREZ)
# A tibble: 6 × 2
  gs_name               entrez_gene
  <chr>                       <int>
1 HALLMARK_ADIPOGENESIS       11303
2 HALLMARK_ADIPOGENESIS       74610
3 HALLMARK_ADIPOGENESIS       52538
4 HALLMARK_ADIPOGENESIS       11363
5 HALLMARK_ADIPOGENESIS       11364
6 HALLMARK_ADIPOGENESIS       11409
> 
> 
> 
> ## ensembl_gene
> hallmark_gs.ENSEMBL <- msigdbr(species = "Mus musculus" , category = "H") %>%
+                dplyr::select(gs_name, ensembl_gene)
> head(hallmark_gs.ENSEMBL)
# A tibble: 6 × 2
  gs_name               ensembl_gene      
  <chr>                 <chr>             
1 HALLMARK_ADIPOGENESIS ENSMUSG00000015243
2 HALLMARK_ADIPOGENESIS ENSMUSG00000028973
3 HALLMARK_ADIPOGENESIS ENSMUSG00000036880
4 HALLMARK_ADIPOGENESIS ENSMUSG00000026003
5 HALLMARK_ADIPOGENESIS ENSMUSG00000062908
6 HALLMARK_ADIPOGENESIS ENSMUSG00000029545
> 

Code chunk to show that size of ranked lists do not need to be equal.
Note that the human example dataset is used!

> #### STEP 1: download gene sets
> ## using the code from the biomedical knowledge mining book.
> ## retrieve all *human* MSigDb gene sets for category H ("hallmark gene sets")
> ## https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp
> ## note that code below also works with 'species = "Mus musculus"'
> hallmark_gs.entrez <- msigdbr(species = "Homo sapiens" , category = "H") %>%
+                dplyr::select(gs_name, entrez_gene)
> 
> head(hallmark_gs.entrez)
# A tibble: 6 × 2
  gs_name               entrez_gene
  <chr>                       <int>
1 HALLMARK_ADIPOGENESIS          19
2 HALLMARK_ADIPOGENESIS       11194
3 HALLMARK_ADIPOGENESIS       10449
4 HALLMARK_ADIPOGENESIS          33
5 HALLMARK_ADIPOGENESIS          34
6 HALLMARK_ADIPOGENESIS          35
> 
> hallmark_gs.descr <- msigdbr(species = "Homo sapiens" , category = "H") %>%
+                dplyr::select(gs_name, gs_description) %>% dplyr::distinct()
> 
> head(hallmark_gs.descr)
# A tibble: 6 × 2
  gs_name                      gs_description                                   
  <chr>                        <chr>                                            
1 HALLMARK_ADIPOGENESIS        Genes up-regulated during adipocyte differentiat…
2 HALLMARK_ALLOGRAFT_REJECTION Genes up-regulated during transplant rejection.  
3 HALLMARK_ANDROGEN_RESPONSE   Genes defining response to androgens.            
4 HALLMARK_ANGIOGENESIS        Genes up-regulated during formation of blood ves…
5 HALLMARK_APICAL_JUNCTION     Genes encoding components of apical junction com…
6 HALLMARK_APICAL_SURFACE      Genes encoding proteins over-represented on the …
> 
> 
> #### STEP 2: generate GSEA input data
> ## note that OP uses lists as input that differ in length!
> 
> data(geneList,package = "DOSE")
> length(geneList)
[1] 12495
> 
> inputList <- list(List1 = geneList[1:length(geneList)],# all genes
+                   List2 = geneList[1:(length(geneList)-150)],# 150 less genes
+                   List3 = geneList[126:length(geneList)] )# 125 less genes
> 
> str(inputList)
List of 3
 $ List1: Named num [1:12495] 4.57 4.51 4.42 4.14 3.88 ...
  ..- attr(*, "names")= chr [1:12495] "4312" "8318" "10874" "55143" ...
 $ List2: Named num [1:12345] 4.57 4.51 4.42 4.14 3.88 ...
  ..- attr(*, "names")= chr [1:12345] "4312" "8318" "10874" "55143" ...
 $ List3: Named num [1:12370] 1.91 1.9 1.9 1.9 1.89 ...
  ..- attr(*, "names")= chr [1:12370] "1164" "3161" "3595" "11004" ...
> 
> ## run GSEA through compareCluster
> ## note: no sign. cutoffs are applied
> MSigDb.gsea <- compareCluster(geneClusters=inputList, fun = "GSEA", TERM2GENE=hallmark_gs.entrez,
+                              TERM2NAME=hallmark_gs.descr, pvalueCutoff = 1, pAdjustMethod = "none",
+                              eps=0, minGSSize = 10, maxGSSize = 500)
> 
> MSigDb.gsea <- setReadable(MSigDb.gsea, 'org.Hs.eg.db', 'ENTREZID')
> 
> MSigDb.gsea
#
# Result of Comparing 3 gene clusters 
#
#.. @fun         GSEA 
#.. @geneClusters       List of 3
 $ List1: Named num [1:12495] 4.57 4.51 4.42 4.14 3.88 ...
  ..- attr(*, "names")= chr [1:12495] "4312" "8318" "10874" "55143" ...
 $ List2: Named num [1:12345] 4.57 4.51 4.42 4.14 3.88 ...
  ..- attr(*, "names")= chr [1:12345] "4312" "8318" "10874" "55143" ...
 $ List3: Named num [1:12370] 1.91 1.9 1.9 1.9 1.89 ...
  ..- attr(*, "names")= chr [1:12370] "1164" "3161" "3595" "11004" ...
#...Result      'data.frame':   150 obs. of  12 variables:
 $ Cluster        : Factor w/ 3 levels "List1","List2",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ ID             : chr  "HALLMARK_E2F_TARGETS" "HALLMARK_G2M_CHECKPOINT" "HALLMARK_MYC_TARGETS_V1" "HALLMARK_MTORC1_SIGNALING" ...
 $ Description    : chr  "Genes encoding cell cycle related targets of E2F transcription factors." "Genes involved in the G2/M checkpoint, as in progression through the cell division cycle." "A subgroup of genes regulated by MYC - version 1 (v1)." "Genes up-regulated through activation of mTORC1 complex." ...
 $ setSize        : int  187 197 193 196 177 199 195 200 83 180 ...
 $ enrichmentScore: num  0.774 0.752 0.623 0.577 0.589 ...
 $ NES            : num  3.41 3.35 2.77 2.56 2.57 ...
 $ pvalue         : num  2.29e-49 6.31e-47 1.14e-22 7.16e-18 1.45e-17 ...
 $ p.adjust       : num  2.29e-49 6.31e-47 1.14e-22 7.16e-18 1.45e-17 ...
 $ qvalue         : num  4.57e-48 6.31e-46 7.58e-22 3.58e-17 5.79e-17 ...
 $ rank           : num  1691 997 3058 2057 2454 ...
 $ leading_edge   : chr  "tags=64%, list=14%, signal=56%" "tags=46%, list=8%, signal=43%" "tags=57%, list=24%, signal=44%" "tags=47%, list=16%, signal=40%" ...
 $ core_enrichment: chr  "CDCA8/CDC20/CENPE/MYBL2/MELK/CCNB2/TOP2A/E2F8/RRM2/DLGAP5/CDCA3/TACC3/CENPM/RAD51AP1/UBE2S/CDK1/MAD2L1/GINS1/BI"| __truncated__ "CDC45/CDC20/KIF23/CENPE/MYBL2/CCNB2/NDC80/TOP2A/UBE2C/PBK/NUSAP1/TPX2/TACC3/NEK2/SLC7A5/UBE2S/CCNA2/CDK1/MAD2L1"| __truncated__ "CDC45/CDC20/CCNA2/MAD2L1/MCM5/MCM2/CTPS1/MCM6/MCM4/PRDX4/SNRPA1/CCT5/RFC4/TYMS/KPNA2/H2AZ1/MYC/PCNA/PSMA7/PSMD3"| __truncated__ "RRM2/SLC7A5/FADS2/PDK1/AURKA/PSAT1/MCM2/CDC25A/PLK1/PHGDH/WARS1/TFRC/DHCR7/MCM4/CTSC/MTHFD2/ME1/ASNS/GMPS/BUB1/"| __truncated__ ...
#.. number of enriched terms found for each gene cluster:
#..   List1: 50 
#..   List2: 50 
#..   List3: 50 
#
#...Citation
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, 
W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. 
clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. 
The Innovation. 2021, 2(3):100141 

> 
> 
> 

@Sidragull57
Copy link
Author

Hallo,

Thank you very musch. yes it worked for hallmark but for GSEA-Kegg, we need Entrez ID, from ChatGPT I got the following code

..geneList_fgsea <- geneList_fgsea[!duplicated(names(geneList_fgsea))]
geneList_fgsea <- sort(geneList_fgsea, decreasing = TRUE)

to remove duplicates and it resolved the issue.

Thank you very much for your quick response and support.

Regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants