PROCEDURE
Notice
The web version of PhosMap is for quick start of visualization only due to the low-level hardware of R shiny server. It is single-threaded and we recommend users to analyze small data sets using the demo server. For larger datasets, upgraded hardware is necessary according to the possible computational cost of the data. We recommend users to use the local docker version of PhosMap.
Introduction of example data
Here, we reanalysis (phospho)proteomic profilings of WiDr colorectal cancer cells harbouring the BRAF(V600E) mutation after treatment using vemurafenibin a time course of 0, 2, 6, 24, and 48 hour[1]. The raw files were deposited in ProteomeXchange Consortium(PXD007740).
The raw data were processed in Firmiana, a one-stop proteomic cloud platform[2], to obtain quantitative peptide and protein files.
You can download example data in https://github.com/liuzan-info/PhosMap/tree/master/examplefile/mascot and https://github.com/liuzan-info/PhosMap/tree/master/examplefile/maxquant.
Preprocessing for Maxquant data
Import MaxQuant data
How to import your MaxQuant data
- Go to the ‘Import data’ tab.
- Choose 'Maxquant' to start with data from Maxquant.
- Click ‘Browse’ to upload phosphoproteomics experimental design file in .txt format, and phospho (STY)Sites.txt. Proteomics experimental design file is optional.
- Uploaded data will be shown in the 'Data Overview' secondary tab.
- You can also choose ‘load exmaple data’ to use exmaple files.
Quality control and merging
Function
Generate merged phosphoproteomics data frame based on peptides files.
How to get analysis results
- Go to the ‘Preprocessing’ tab.
- Modify the parameters in Step1 according to your needs.
- Click the running button in Step1 and the file will apear on the right.
Parameter Selection Explanation
-
Minimum Score: The minimum score of credible peptides. Default is 40; adjust higher for more confidence or lower to retain more data.
-
Minimum Localization Probability: Determines the confidence level for modification site accuracy, typically set at 0.75, indicating 75% confidence.
-
Minimum Detection Frequency: Minimum detection frequency for per locus, equivalents to the number of samples minus the number of '0' value.
Interpretation of analysis results
We performed quality control for identified phosphopeptides using PhosMap, those phosphopeptides that met 1% FDR at peptide level and had ion score greater than 20 and the highest confidence probabilities of p-sites computed by Mascot, were kept. We merged phosphopeptides list with quantitative value from all experiments to generate a matrix for analysis.
Data normalization
Function
PhosMap provides two kinds of normalizations, a total sum scaling normalization and normalizing phosphoproteomics data based on proteomics data.
How to get analysis results
- Go to the ‘Preprocessing’ tab.
- Modify the parameters according to your needs.
- Click the running button in the Step2 and the normalized data of p-sites based on a total sum scaling will apear on the right.
- Click the running button in Step3 and normalized data of p-sites based on proteomics data will apear on the right.
Parameter Selection Explanation
-
Normalization Method: The approach used to address unequal loading of phosphopeptides. 'Global' or 'median' refer to scaling the intensities of phosphopeptides globally or based on the median value.
-
Imputation Method: Strategies for replacing missing values. 'Globally' or 'by group' refer to imputing missing values across the entire dataset or within experimental groups.
- '0': Assigns a value of zero to missing data points.
- 'Minimum': Uses the smallest value in the dataset for imputation.
- 'Minimum/10': Imputes missing values with the smallest value divided by 10.
- 'BPCA': Utilizes Bayesian Principal Component Analysis.
- 'LLS': Employs Local Least Squares.
- 'KNNMethod': Applies k-Nearest Neighbors for imputation.
- 'RowMedian': Replaces missing data with the row's median value.
- 'ImpSeq': Sequential imputation method.
- 'ImpSeqProb': Probabilistic approach to imputation.
- 'ColMedian': Fills in missing values with the column's median.
-
Top: Compute row maximum each psites, sort row maximum in decreasing order and keep top N (percentage). The more this value, the more phosphopeptides are included.
-
With Proteomics Data: This option should be selected if you need normalization with protein profiling data.
-
Intensity Type: Users have options: 'Intensity' for raw abundance measurements, 'iBAQ' for intensity-Based Absolute Quantification which estimates absolute protein amounts, and 'LFQ intensity' for Label-Free Quantification, allowing comparison of protein abundances across samples without labels.
-
Minimum Unique Peptide: Sets the minimum number of unique peptides that must be detected for a protein's quantitation to be considered reliable. A minimum of 1 means even proteins identified by a single unique peptide are included.
Preprocessing for Firmiana data
Import Firmiana data
How to import your Firmiana data
- Go to the ‘Import data’ tab.
- Choose 'Firmiana' to start with data from Firmiana.
- Click ‘Browse’ to upload phosphoproteomics experimental design file in .txt format.
- Zip your Mascot xml files and Phosphoproteomics peptide files, and then upload. The folder tree is shown below. File names of Mascot xml files and Phosphoproteomics peptide files must be consistent with ’Experiment_Code’ of phosphoproteomics experimental design file.
- Proteomics data is optional. Click ‘Browse’ to upload proteomics experimental design file in .txt format. Zip your Profiling_gene_txt and upload. The folder tree is shown below. File names of Profiling_gene_txt must be consistent with ‘Expriment_Code’ of proteomics experimental design file.
- Uploaded data will be shown in the 'Data Overview' secondary tab.
- You can also choose ‘load exmaple data’ to use exmaple files.
Parser
Function
If you start with .xml files from mascot results, you can run this button to parser them to sites score files, based on which .csv files of phosphorylation sites with confidence score will be genereated.
How to get analysis results
- Go to the 'Preprocessing' tab.
- Click the running button in Step1 and the file will apear on the right.
Quality control and merging
Function
Generate merged phosphoproteomics data frame based on peptides files.
How to get analysis results
- Go to the ‘Preprocessing’ tab.
- Modify the parameters according to your needs.
- Click the running button in Step2 and the file will apear on the right.
Parameter Selection Explanation
-
Minimum Score: The minimum score of credible peptides. Default is 20; adjust higher for more confidence or lower to retain more data.
-
Minimum FDR: The minimum FDR of credible peptides. The default FDR threshold is 0.01, which means that only 1% of the identified peptides may be false positives. Users should set this according to the level of specificity and sensitivity required for their analysis.
Interpretation of analysis results
We performed quality control for identified phosphopeptides using PhosMap, those phosphopeptides that met 1% FDR at peptide level and had ion score greater than 20 and the highest confidence probabilities of p-sites computed by Mascot, were kept. We merged phosphopeptides list with quantitative value from all experiments to generate a matrix for analysis.
Mapping p-sites to protein
Function
Mapping protein gi number to gene symbol and outputing expression profile matrix with gene symbol.
Constructing the data frame with unique phosphorylation site for each protein sequence.
How to get analysis results
- Go to the ‘Preprocessing’ tab.
- Modify the parameters according to your needs.
- Click the running button in Step3 and the file will apear on the right.
Parameter Selection Explanation
-
Species: Users select 'human', 'mouse' or 'rattus' to map peptide sequences against the human protein database.
-
ID type: Offers multiple identifier options for proteins:
- 'RefSeq_Protein_GI': Uses NCBI's Reference Sequence (RefSeq) GenInfo Identifier numbers. Example: GI 4502027.
- 'RefSeq_Protein_Accession': Utilizes the RefSeq accession numbers. Example: NP_001101.1.
- 'Uniprot_Protein_Accession': Employs accession numbers from the Universal Protein Resource (UniProt). Example: P60709.
- 'GeneID': Refers to the numeric gene identifier used by NCBI. Example: ACTB.
-
fasta type: Indicates the database format for protein sequences. Options are 'refseq' for NCBI's RefSeq database or 'uniprot' for the UniProt database. Selection should align with the user's sequence data and the ID type chosen to ensure consistency in mapping results.
Interpretation of analysis results
Combining the phosphopeptides sequence, modification position, attached protein ID and the built-in human protein reference database of PhosMap, all p-sites were mapped to the corresponding protein sequence and represented by unique p-sites identifier (upsID) that consisted of a protein GI number/accession, gene symbol and location of the p-site in the protein sequence. In addition, the matched proteome data with phosphoproteome were collected at each time point in Ressa, et al. study. Finally, 3,649 unique p-sites were obtained and their quantitative values were normalized by matched protein profiling data using PhosMap.
Data normalization
Function
PhosMap provides two kinds of normalizations, a total sum scaling normalization and normalizing phosphoproteomics data based on proteomics data.
How to get analysis results
- Go to the ‘Preprocessing’ tab.
- Modify the parameters according to your needs.
- Click the running button in the Step4 and the normalized data of p-sites based on a total sum scaling will apear on the right.
- Click the running button in Step5 and normalized data of p-sites based on proteomics data will apear on the right.
Parameter Selection Explanation
-
Minimum Detection Frequency: Minimum detection frequency for per locus, equivalents to the number of samples minus the number of '0' value.
-
Normalization Method: The approach used to address unequal loading of phosphopeptides. 'Global' or 'median' refer to scaling the intensities of phosphopeptides globally or based on the median value.
-
Imputation Method: Strategies for replacing missing values. 'Globally' or 'by group' refer to imputing missing values across the entire dataset or within experimental groups.
- '0': Assigns a value of zero to missing data points.
- 'Minimum': Uses the smallest value in the dataset for imputation.
- 'Minimum/10': Imputes missing values with the smallest value divided by 10.
- 'BPCA': Utilizes Bayesian Principal Component Analysis.
- 'LLS': Employs Local Least Squares.
- 'KNNMethod': Applies k-Nearest Neighbors for imputation.
- 'RowMedian': Replaces missing data with the row's median value.
- 'ImpSeq': Sequential imputation method.
- 'ImpSeqProb': Probabilistic approach to imputation.
- 'ColMedian': Fills in missing values with the column's median.
-
Top: Compute row maximum each psites, sort row maximum in decreasing order and keep top N (percentage). The more this value, the more phosphopeptides are included.
-
Control Label: Allows selection of a reference label for comparative analysis.
-
US Cutoff: A user-set threshold for inclusion based on abundance or statistical significance.
Preprocessing for Spectronaut data
Import Spectronaut data
How to import your Spectronaut data
- Go to the ‘Import data’ tab.
- Choose 'Spectronaut' under 'DIA' to start with data from Spectronaut.
- Click ‘Browse’ to upload phosphoproteomics experimental design file in .txt format.
- Click ‘Browse’ to upload Report file generated by Spectronaut in xls format.
Parser & p-site Quality Control
Function
If you start with .xls files from Spectronaut results, you can run this button to parser them to sites score files, based on which .csv files of phosphorylation sites with confidence score will be genereated.
How to get analysis results
- Go to the 'Preprocessing' tab.
- Click the running button in Step1 and the file will apear on the right.
Parameter Selection Explanation
- Minimum Detection Frequency: Minimum detection frequency for per locus, equivalents to the number of samples minus the number of '0' value.
Normalizaiton & Imputation & Filtering
Function
PhosMap performs a total sum scaling normalization and imputation for missing values with various methods.
How to get analysis results
- Go to the 'Preprocessing' tab.
- Click the running button in Step2 and the file will apear on the right.
Parameter Selection Explanation
Normalization Method: The approach used to address unequal loading of phosphopeptides. 'Global' or 'median' refer to scaling the intensities of phosphopeptides globally or based on the median value.
Imputation Method: Strategies for replacing missing values. 'Globally' or 'by group' refer to imputing missing values across the entire dataset or within experimental groups.
- '0': Assigns a value of zero to missing data points.
- 'Minimum': Uses the smallest value in the dataset for imputation.
- 'Minimum/10': Imputes missing values with the smallest value divided by 10.
- 'BPCA': Utilizes Bayesian Principal Component Analysis.
- 'LLS': Employs Local Least Squares.
- 'KNNMethod': Applies k-Nearest Neighbors for imputation.
- 'RowMedian': Replaces missing data with the row's median value.
- 'ImpSeq': Sequential imputation method.
- 'ImpSeqProb': Probabilistic approach to imputation.
- 'ColMedian': Fills in missing values with the column's median.
top: Compute row maximum each psites, sort row maximum in decreasing order and keep top N (percentage). The more this value, the more phosphopeptides are included.
Preprocessing for Dia-NN data
Import Dia-NN data
How to import your Dia-NN data
- Go to the ‘Import data’ tab.
- Choose 'Dia-NN' under 'DIA' to start with data from Dia-NN.
- Click ‘Browse’ to upload phosphoproteomics experimental design file in .txt format.
- Click ‘Browse’ to upload Report file generated by Dia-NN in tsv format.
Parser & p-site Quality Control
Function
If you start with .tsv files from Dia-NN results, you can run this button to parser them to sites score files, based on which .csv files of phosphorylation sites with confidence score will be genereated.
How to get analysis results
- Go to the 'Preprocessing' tab.
- Click the running button in Step1 and the file will apear on the right.
Parameter Selection Explanation
- Minimum Detection Frequency: Minimum detection frequency for per locus, equivalents to the number of samples minus the number of '0' value.
- PTM.Q.Value threshold: The minimum Q value of credible peptides. Default is 0.01; adjust higher for more confidence or lower to retain more data.
- Species: Users select 'human', 'mouse' or 'rattus' to map peptide sequences against the human protein database.
Normalizaiton & Imputation & Filtering
Function
PhosMap performs a total sum scaling normalization and imputation for missing values with various methods.
How to get analysis results
- Go to the 'Preprocessing' tab.
- Click the running button in Step2 and the file will apear on the right.
Parameter Selection Explanation
Normalization Method: The approach used to address unequal loading of phosphopeptides. 'Global' or 'median' refer to scaling the intensities of phosphopeptides globally or based on the median value.
Imputation Method: Strategies for replacing missing values. 'Globally' or 'by group' refer to imputing missing values across the entire dataset or within experimental groups.
- '0': Assigns a value of zero to missing data points.
- 'Minimum': Uses the smallest value in the dataset for imputation.
- 'Minimum/10': Imputes missing values with the smallest value divided by 10.
- 'BPCA': Utilizes Bayesian Principal Component Analysis.
- 'LLS': Employs Local Least Squares.
- 'KNNMethod': Applies k-Nearest Neighbors for imputation.
- 'RowMedian': Replaces missing data with the row's median value.
- 'ImpSeq': Sequential imputation method.
- 'ImpSeqProb': Probabilistic approach to imputation.
- 'ColMedian': Fills in missing values with the column's median.
top: Compute row maximum each psites, sort row maximum in decreasing order and keep top N (percentage). The more this value, the more phosphopeptides are included.
Analysis and visualization
PhosMap incorporated six analysis modules: dimension reduction analysis, differential expression analysis, time course analysis, kinase activity prediction, phosphorylation motif enrichment analysis and survival analysis.
Upload data
Function
In this step, you can upload your preprocessed data to PhosMap, such as the phosphorylation dataframe. If you have not preprocessed your data, you must preprocess it with PhosMap (go to the ‘Preprocessing’ tab) or do it yourself.
How to upload your data
- Go to the 'Upload data' under 'Analysis' tab.
- Choose 'Load example data' or follow the prompts to upload your own corresponding four files. If you have not preprocessed your data, you can click 'Go to preprocessing' to preprocess it.
Dimension reduction analysis
Function
In PhosMap, Dimension reduction analysis methods allowed for PCA, t-SNE and UMAP.
The meaning of the parameters
- ‘Title’ refers to the main title of the plot.
- ‘Legend title’ refers to the title of the legend in the plot.
- ‘Random seed’ is a parameter for t-SNE that sets the seed for the random number generator. This can be used to ensure reproducibility of results.
- ‘Perplexity’ is a numerical value for t-SNE, with a default value of 2. It balances the focus between preserving the local and global structure of the data.
- ‘Neighbors’ is a parameter for UMAP that refers to the size of the local neighborhood (in terms of the number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved.
How to get analysis results
- Go to the 'Dimension reduction analysis' under ‘Analysis’ tab.
- Modify the parameters according to your needs.
- Click the ‘Analysis’ button.
- The PCA, t-SNE and UMAP plot after running will appear on the right.
- Click the download button to download the plot file.
Parameter Selection Explanation
Interpretation of analysis results
To extract an overview of the effect of the different time course treatments, we performed PCA analysis in the downstream analysis module of PhosMap. We could see that phosphorylation expression profiles of colorectal cancer cells after longer (24h and 48h) vemurafenibin treatment were quite different from those after short treatment (2h and 6h). In addition, it shows that principal component 1 (PC1), with 31.77%, is superior to 20% from original literature and demonstrates phosphorylation expression profile normalized by matched proteomics data has an advantage over representing the variation over time in the BRAFi-treated samples.
Differential expression analysis
Function
In PhosMap, differential expression analysis methods allowed for limma, SAM and ANOVA Data analysis.
The meaning of the parameters
-
‘Control’ refers to the control group in the experiment.
-
‘Experiment’ refers to the experimental group in the experiment.
-
‘P-value threshold’ is the threshold for determining statistical significance based on the p-value.
-
‘P-value adjust method’ is the method used to adjust p-values for multiple comparisons.
-
‘FC threshold’ is the fold change threshold for determining significant changes in phosphorylation levels.
-
‘nperms’ is a parameter for the SAM method that specifies the number of permutations to perform.
-
‘Minimum FDR’ is the minimum false discovery rate threshold for determining statistical significance.
-
‘Clustering distance rows’ is a parameter for heatmap generation that specifies the distance metric used for clustering rows.
-
‘Clustering method’ is a parameter for heatmap generation that specifies the clustering method used to cluster rows and columns.
How to get analysis results
- Go to the 'Differential Expression Analysis' under ‘Analysis’ tab.
- Go to the 'limma', 'SAM' or 'ANOVA' secondary tab.
- Choose Control and Experiment used for differential Expression Analysis.
- Choose 'Interactive mode' and click the 'Analysis’ button. The interactive plot after running will appear on the right.
- Choose 'Static mode' and click the 'Analysis’ button. The static plot after running will appear on the right.
- Click ‘Plot Heatmap’ button. The heatmap will apear in the pop-up window.
- Click the download button to download the plot file.
Parameter Selection Explanation
-
limma:
-
Control: Identifies the control group in the experiment. Choose based on the group that represents the normal or untreated condition.
-
Experiment: Specifies the group subjected to the experimental condition or treatment.
-
p-value Threshold: Sets the cutoff for statistical significance, typically at 0.05. Lower this threshold for stricter criteria.
-
p-value Adjust Method: Selects the correction method for multiple comparisons. Available methods include:
- 'none': No adjustment for multiple comparisons.
- 'holm': Holm's sequential Bonferroni method.
- 'hochberg': Hochberg's step-up procedure.
- 'hommel': Hommel's correction.
- 'bonferroni': Bonferroni correction, very stringent, increases with the number of tests.
- 'BH': Benjamini-Hochberg method, controls the false discovery rate.
- 'BY': Benjamini-Yekutieli method, suitable for when tests are dependent.
- 'fdr': A general term for any method that controls the false discovery rate.
-
FC Threshold: Establishes the minimum fold change for a p-site to be considered significantly different, commonly set at 2.
-
Title: Title of the plot.
-
X Axis Label: Denotes the label for the X-axis, usually 'log2FC' for fold change.
-
Y Axis Label: Denotes the label for the Y-axis, often '-log10(p-value)' to represent statistical significance.
-
"UP" Colour: The color representing upregulated genes.
-
"DOWN" Colour: The color indicating downregulated genes.
-
"NOT" Colour: The color for genes without significant change.
-
SAM:
- Control: Identifies the control group in the experiment. Choose based on the group that represents the normal or untreated condition.
- Experiment: Specifies the group subjected to the experimental condition or treatment.
- nperms: Number of permutations to perform during the SAM analysis. A higher number of permutations can provide a more accurate estimation of the false discovery rate but will increase computation time.
- Minimum FDR: The threshold for the False Discovery Rate, the expected proportion of false positives among the declared significant tests. A common threshold is 0.05, indicating a 5% expected rate of false discoveries.
-
ANOVA:
- FC threshold: Establishes the minimum fold change for a p-site to be considered significantly different, commonly set at 2.
- p-value Threshold: Sets the cutoff for statistical significance, typically at 0.05. Lower this threshold for stricter criteria.
- p-value adjust method: Selects the correction method for multiple comparisons. Available methods include:
- 'none': No adjustment for multiple comparisons.
- 'holm': Holm's sequential Bonferroni method.
- 'hochberg': Hochberg's step-up procedure.
- 'hommel': Hommel's correction.
- 'bonferroni': Bonferroni correction, very stringent, increases with the number of tests.
- 'BH': Benjamini-Hochberg method, controls the false discovery rate.
- 'BY': Benjamini-Yekutieli method, suitable for when tests are dependent.
- 'fdr': A general term for any method that controls the false discovery rate.
-
Heatmap:
-
Scale: Adjusts the data for heatmap visualization, affecting how patterns and gradients are displayed. Options include:
- 'none': No scaling is performed, displaying the raw values.
- 'row': Normalizes the data within each row.
- 'column': Normalizes the data within each column.
-
Cluster by Row: If selected, rows will be grouped based on similarity, which is useful for identifying patterns in data.
-
Clustering Distance Rows: Selects the metric used to calculate the distance between rows in the heatmap, which affects how rows are clustered based on their similarity.
- 'euclidean': A standard distance measure that calculates the root of square differences.
- 'correlation': Measures the degree to which rows are correlated. This method is useful when the pattern of change is more important than the magnitude of change.
-
Clustering Method: The algorithm used to cluster rows or columns in the heatmap, which influences how the patterns of similarity are grouped.
- 'ward.D2': Minimizes the sum of squared differences within all clusters.
- 'ward.D': An alternative Ward's method.
- 'single': Uses the minimum of the distances between all observations of the two sets.
- 'complete': Uses the maximum distances between all observations of the two sets.
- 'average': Uses the average of the distances of each observation of the two sets.
- 'mcquitty': Similar to average, but uses a linkage function that is a variant of the WPGMA method (weighted pair group method with averaging).
- 'median': Uses the median of the distances between all observations of the two sets.
- 'centroid': Uses the centroid of the clusters.
-
Miss Row Name: If selected, indicates that rownames are not displayed in the heatmap.
Interpretation of analysis results
In order to show differential expression analysis between two experimental conditions. We use the limma method integrated into differential expression analysis module of PhosMap to identify 128 significant differently expressed p-sites (DEPs) between the samples with BRAFi-treated for two hours and control samples (P value < 0.05 and fold change > 2). 139 p-sites were up-regulated in the BRAFi-treated samples. The most disparate difference is observed in DAP_S51, whose phosphoserine is related to the MTOR pathway. 99 p-sites were down-regulated in the BRAFi-treated samples.
For the multiple experimental conditions, we leveraged the embedded ANOVA analysis of PhosMap and identified 548 DEPs among the five time points (P value < 0.1 and fold change > 2).
Time Course Analysis
Function
Fuzzy clustering was applied to time course analysis for discovering patterns associated with time points in PhosMap.The corresponding line chart combined with membership for each cluster was also drawn.
The meaning of the parameters
- ‘Minimum membership value’ is a threshold for determining the minimum membership value for a data point to be included in a cluster.
- ‘Iteration’ is the number of iterations to perform in the clustering algorithm.
- ‘Number of clusters’ is the number of clusters to generate in the clustering algorithm.
How to get analysis results
- Go to the ‘Time course Analysis (fuzzy clustering)’ under ‘Analysis’ tab.
- Modify the parameters according to your needs.
- Click the ‘Analysis’ button. The plot after running will appear on the right.
- Click the download button to download the plot file.
Interpretation of analysis results
These 548 DEPs were used as inputs in the time course analysis module of PhosMap, then 9 strong expression patterns were generated. Two major clusters show significant downregulation at the phosphoproteomics signalling level upon BRAFi treatment in line with the original literature. Cluster 1 responds within 2 hours, an early treatment response. Cluster 2 responds within 24 hours, a late treatment response.
Parameter Selection Explanation
- p-value Threshold: Sets the cutoff for statistical significance, typically at 0.05. Lower this threshold for stricter criteria.
- p-value Adjust Method: Selects the correction method for multiple comparisons. Available methods include:
- 'none': No adjustment for multiple comparisons.
- 'holm': Holm's sequential Bonferroni method.
- 'hochberg': Hochberg's step-up procedure.
- 'hommel': Hommel's correction.
- 'bonferroni': Bonferroni correction, very stringent, increases with the number of tests.
- 'BH': Benjamini-Hochberg method, controls the false discovery rate.
- 'BY': Benjamini-Yekutieli method, suitable for when tests are dependent.
- 'fdr': A general term for any method that controls the false discovery rate.
- FC threshold: Establishes the minimum fold change for a p-site to be considered significantly different, commonly set at 2.
- Minimum Membership Value: Determines how strictly data points are assigned to clusters in fuzzy clustering. A higher value means a data point must have a stronger association with a cluster to be included, making clusters more distinct. A lower value allows for more overlap between clusters, which may reflect subtle gradations in data.
- Iteration: Specifies how many times the clustering algorithm will be run to refine the clusters. More iterations can lead to more accurate clustering but will take longer to compute. Fewer iterations will be quicker but may result in less precise clustering.
- Number of Clusters: The predefined number of clusters to divide the data into.
Kinase activity prediction (KSEA)
Function
In PhosMap, KSEA was used to predict kinase activity.
The meaning of the parameters
- ‘Control’ refers to the control group in the experiment.
- ‘Experiment’ refers to the experimental group in the experiment.
- ‘Species’ refers to the species of the organism being studied.
- ‘Scale’ is a parameter for scaling the data before generating the heatmap.
- ‘Clustering distance rows’ is a parameter for heatmap generation that specifies the distance metric used for clustering rows.
- ‘Clustering method’ is a parameter for heatmap generation that specifies the clustering method used to cluster rows and columns.
How to get analysis results
- Go to the ‘Kinase-Substrate Enrichment Analysis’ under ‘Analysis’ tab.
- Select ‘Multiple groups’ or ‘Two groups’ according to the number of groups of your data.
- Click the first 'Analysis' button. If ‘Multiple groups’ is selected, after running, the plot will appear on the right. Click ‘view result’ to view and download the kinase prediction time course result. If ‘Two groups’ is selected, only the phoshorylation dataframe will appear on the right.
- Select a cluster if ‘Multiple groups’ is selected. Click the second ‘Analysis’ button. After running, the heatmap will appear on the right.
- Click the download button to download the plot file.
Parameter Selection Explanation
- Select a Cluster: Determines which cluster from previous time course analyses to use for enrichment analysis. Different clusters may yield different insights into kinase-substrate relationships.
- Species: Users select 'human', 'mouse' or 'rattus' to map peptide sequences against the human protein database.
- Scale: Influences the representation of data. Without scaling ('none'), the raw data is analyzed; if scaled, differences in expression levels may be emphasized ('row') or standardized across samples ('column').
- Clustering Distance Rows: The metric for measuring distances between data points in clustering. 'Euclidean' distance measures actual distances, whereas a metric like 'correlation' would consider similarity in patterns of expression.
- Clustering Method: The algorithm used to cluster rows or columns in the heatmap, which influences how the patterns of similarity are grouped.
- 'ward.D2': Minimizes the sum of squared differences within all clusters.
- 'ward.D': An alternative Ward's method.
- 'single': Uses the minimum of the distances between all observations of the two sets.
- 'complete': Uses the maximum distances between all observations of the two sets.
- 'average': Uses the average of the distances of each observation of the two sets.
- 'mcquitty': Similar to average, but uses a linkage function that is a variant of the WPGMA method (weighted pair group method with averaging).
- 'median': Uses the median of the distances between all observations of the two sets.
- 'centroid': Uses the centroid of the clusters.
- Title: The title of the plot.
Interpretation of analysis results
Afterwards, the substrates from the two clusters are imported into the KSEA module of PhosMap to infer kinase activities. The results indicate that CDK1/2, MAPK1/3 and AKT1 are suppressed during BRAFi treatment.
Motif enrichment analysis
Function
PhosMap allowed for performing MEA on user defined phosphopeptides lists to provide clues for finding candidate kinases that are not present in the database.
The meaning of parameters
- ‘Fasta type’ refers to the type of fasta file used as input for the analysis.
- ‘Selected row number for plotting motif logo’ is the number of rows to be selected for generating the motif logo plot.
- ‘Matched seqs threshold’ is the threshold for determining the minimum number of matched sequences required for a motif to be considered significant.
- ‘Scale’ is a parameter for scaling the data before generating the heatmap.
- ‘Distance metric’ is a parameter for heatmap generation that specifies the distance metric used for clustering rows.
- ‘Clustering method’ is a parameter for heatmap generation that specifies the clustering method used to cluster rows and columns.
How to get analysis results
- Go to the ‘Motif Enrichment Analysis’ under ‘Analysis’ tab.
- Modify the parameters according to your needs.
- Click the ‘Analysis’ button.
- The foreground dataframe mapped to motifs is shown on the right after running.
- Select row number for plotting logo.
- Click the first ‘Plot’ button, and the logo will appear on the right.
- Modify the parameters below.
- Click the second ‘Plot’ button.
- The heatmap will appear on the right.
- Click the download button to download the plot file.
Parameter Selection Explanation
- species: Users select 'human', 'mouse' or 'rattus' to map peptide sequences against the human protein database.
- fasta type: Indicates the database format for protein sequences. Options are 'refseq' for NCBI's RefSeq database or 'uniprot' for the UniProt database. Selection should align with the user's sequence data and the ID type chosen to ensure consistency in mapping results.
- pvalue threshold: Sets the cutoff for statistical significance, typically at 0.05. Lower this threshold for stricter criteria.
- Matched Seqs Threshold: Determines the minimum number of sequences that must match a motif for it to be included in the heatmap. A higher threshold may focus the analysis on more prevalent motifs, while a lower threshold allows inclusion of less common motifs.
- Scale: Controls whether and how data normalization is applied. 'None' means no normalization; scaling can otherwise adjust the data to emphasize variability ('row') or comparability ('column').
- Distance Metric: Specifies how the similarity between data points is measured. 'Euclidean' measures direct distance, which could prioritize absolute differences in values.
- Clustering Method: The algorithm used to cluster rows or columns in the heatmap, which influences how the patterns of similarity are grouped.
- 'ward.D2': Minimizes the sum of squared differences within all clusters.
- 'ward.D': An alternative Ward's method.
- 'single': Uses the minimum of the distances between all observations of the two sets.
- 'complete': Uses the maximum distances between all observations of the two sets.
- 'average': Uses the average of the distances of each observation of the two sets.
- 'mcquitty': Similar to average, but uses a linkage function that is a variant of the WPGMA method (weighted pair group method with averaging).
- 'median': Uses the median of the distances between all observations of the two sets.
- 'centroid': Uses the centroid of the clusters.
- Title: The title of the plot.
Interpretation of analysis results
The 3,649 identified phosphor-peptides as foreground sequences are used for MEA of PhosMap and the results further strengthen the evidence of CDK and MAPK pathway deactivation in BRAF mutant CRC cells in response to BRAFi treatment.
Survival analysis
Function
This module is used to identify phosphorylation sites or kinases associated with clinical outcomes of patients. Using kinases or phosphorylation locations files and patients’ survival information as input matrices, coxph function from survival R package was used to calculate the hazard ratio (HR) and P-value.
How to get analysis results
- Go to the ’Survival Analysis’ under ‘Analysis’ tab.
- Modify the parameters according to your needs.
- Click the ‘Analysis’ button.
- The summary dataframe list will appear on the right.
- Click the ‘Plot’ button.
- The plot after running will appear on the right.
- Click the download button to download the plot file.
Parameter Selection Explanation
- pvalue adjust method: Selects the correction method for multiple comparisons. Available methods include:
- 'none': No adjustment for multiple comparisons.
- 'holm': Holm's sequential Bonferroni method.
- 'hochberg': Hochberg's step-up procedure.
- 'hommel': Hommel's correction.
- 'bonferroni': Bonferroni correction, very stringent, increases with the number of tests.
- 'BH': Benjamini-Hochberg method, controls the false discovery rate.
- 'BY': Benjamini-Yekutieli method, suitable for when tests are dependent.
- 'fdr': A general term for any method that controls the false discovery rate.
- pvalue threshold: Sets the cutoff for statistical significance, typically at 0.05. Lower this threshold for stricter criteria.
- "high" colour: The color coding for higher values or significant results in the output. The choice of color can affect the interpretability of the visualization.
- "low" colour: The color coding for lower values or non-significant results in the output. Similar to the "high" color, the choice here can influence visual contrast and readability.
References
- Feng, J., Ding, C., Qiu, N., Ni, X., Zhan, D., Liu, W., Xia, X., Li, P., Lu, B. and Zhao, Q. (2017) Firmiana: towards a one-stop proteomic cloud platform for data processing and analysis. Nature biotechnology, 35, 409-412.
- Ressa, A., Bosdriesz, E., De Ligt, J., Mainardi, S., Maddalo, G., Prahallad, A., Jager, M., De La Fonteijne, L., Fitzpatrick, M. and Groten, S. (2018) A system-wide approach to monitor responses to synergistic BRAF and EGFR inhibition in colorectal cancer cells. Molecular & Cellular Proteomics, 17, 1892-1908.