PREDIPATH predictions using the predipath-DB presence/absence data
Clickable table of contents
1. Preparation and description of the data
1. Preparation and description of the data
Data was given as a table of 65 lines by 65 columns. We kept only the 58 genomes (not 59!) that were annotated NPA-NP, PA-NP or PA-P (NPA: non plant associated, PA: plant associated, NP: non pathogenic, P: pathogenic). The columns were originally named cluster_1, cluster_2... We shorten these names in CL01, CL02... :
We then decided to shorten and noramlize the names of the columns. Here is the short description of the columns:
Name SUM CL01 1 CL02 1 CL03 4 CL04 1 CL05 58 CL06 1 CL07 1 CL08 1 CL09 1 CL10 1 CL11 57 CL12 1 CL13 10 CL14 5 CL15 22 CL16 6 CL17 2 CL18 44 CL19 1 CL20 1 CL21 55 CL22 1 CL23 1 CL24 14 CL25 30 CL26 32 CL27 30 CL28 33 CL29 33 CL30 29 CL31 33 CL32 33 CL33 37 CL34 39 CL35 40 CL36 39 CL37 34 CL38 33 CL39 37 CL40 32 CL41 42 CL42 37 CL43 38 CL44 37 CL45 41 CL46 37 CL47 37 CL48 36 CL49 39 CL50 40 CL51 2 CL52 2 CL53 1 CL54 2 CL55 1 CL56 1 CL57 1 CL58 1 CL59 1 CL60 1 CL61 1 CL62 1 CL63 1 CL64 1 CL65 39With a small R script, it was easy to get rid of 1 constant column named CL05.
With another small R script, it was easy to get rid of 35 equivalent or redundant columns
Column Equal 1 CL01 CL02 CL04 CL06 CL07 CL08 CL09 CL10 CL12 CL23 CL53 CL55 3 CL03 10 CL11 12 CL13 13 CL14 14 CL15 15 CL16 16 CL17 CL52 17 CL18 18 CL19 CL20 20 CL21 21 CL22 23 CL24 24 CL25 25 CL26 CL40 26 CL27 27 CL28 CL29 CL31 CL32 CL38 29 CL30 32 CL33 CL39 CL42 CL44 CL46 CL47 33 CL34 CL36 CL49 CL65 34 CL35 CL50 36 CL37 40 CL41 42 CL43 44 CL45 47 CL48 50 CL51 53 CL54 55 CL56 CL57 CL58 CL59 CL60 CL61 CL62 CL63 CL64For the remaining 29 columns, we removed 8 columns with a near zero variance:
names2rm [1,] "CL01" [2,] "CL11" [3,] "CL17" [4,] "CL19" [5,] "CL22" [6,] "CL51" [7,] "CL54" [8,] "CL56"So for the remaining part of the analysis, we had only 21 columns and 58 genomes in three classes:
Description des 3 classes pour les 58 génomes =============================================== Num Effectif Pourcentage NPA-NP 0 7 12 % PA-NP 1 9 16 % PA-P 2 42 72 % Comptages par classe pour les 21 colonnes de données présence/absence dans predipath-db ======================================================================================== NPA-NP PA-NP PA-P total CL03 2 1 1 4 CL13 4 5 1 10 CL14 3 2 0 5 CL15 7 6 9 22 CL16 4 2 0 6 CL18 3 2 39 44 CL21 6 9 40 55 CL24 0 0 14 14 CL25 0 0 30 30 CL26 0 0 32 32 CL27 0 0 30 30 CL28 0 0 33 33 CL30 0 0 29 29 CL33 0 0 37 37 CL34 0 1 38 39 CL35 0 1 39 40 CL37 0 0 34 34 CL41 0 1 41 42 CL43 0 0 38 38 CL45 0 1 40 41 CL48 0 0 36 36 Pourcentages par classe ======================== NPA-NP PA-NP PA-P CL03 29 % 11 % 2 % CL13 57 % 56 % 2 % CL14 43 % 22 % 0 % CL15 100 % 67 % 21 % CL16 57 % 22 % 0 % CL18 43 % 22 % 93 % CL21 86 % 100 % 95 % CL24 0 % 0 % 33 % CL25 0 % 0 % 71 % CL26 0 % 0 % 76 % CL27 0 % 0 % 71 % CL28 0 % 0 % 79 % CL30 0 % 0 % 69 % CL33 0 % 0 % 88 % CL34 0 % 11 % 90 % CL35 0 % 11 % 93 % CL37 0 % 0 % 81 % CL41 0 % 11 % 98 % CL43 0 % 0 % 90 % CL45 0 % 11 % 95 % CL48 0 % 0 % 86 %Some lines were equal, with distinct classes, thus leading to a contradiction to discriminate between the classes. 33 lines were removed, so only 25 profiles of lines remain with 21 columns of data.
Equal lines =========== Line Equals 1 000773975.1 6 14 2 001267535.1 3 10 4 001484765.1 5 002752575.1 38 7 900068895.1 8 000026185.1 9 000196615.1 16 11 000745075.1 12 13 001269445.1 15 001517405.1 17 000026985.1 19 58 18 000027205.1 20 25 26 28 31 32 33 35 43 56 21 000165815.1 22 000240705.2 39 41 42 44 45 50 51 53 54 55 23 000367545.1 24 000367565.1 27 000367625.1 29 000367665.1 30 000404125.1 36 34 000590885.1 37 001050515.1 40 002732175.1 46 47 49 48 002732335.1 52 002732425.1 57 002803865.1 Classes and contradictions between classes ========================================== Line Class Equal.Lines Classes [1,] 000773975.1 0 3 0 0 1 [2,] 001267535.1 0 3 0 0 1 [3,] 001484765.1 0 1 0 [4,] 002752575.1 0 2 0 2 [5,] 900068895.1 0 1 0 [6,] 000026185.1 1 1 1 [7,] 000196615.1 1 2 1 1 [8,] 000745075.1 1 2 1 1 [9,] 001269445.1 1 1 1 [10,] 001517405.1 1 1 1 [11,] 000026985.1 2 3 2 2 2 [12,] 000027205.1 2 11 2 2 2 2 2 2 2 2 2 2 2 [13,] 000165815.1 2 1 2 [14,] 000240705.2 2 11 2 2 2 2 2 2 2 2 2 2 2 [15,] 000367545.1 2 1 2 [16,] 000367565.1 2 1 2 [17,] 000367625.1 2 1 2 [18,] 000367665.1 2 1 2 [19,] 000404125.1 2 2 2 2 [20,] 000590885.1 2 1 2 [21,] 001050515.1 2 1 2 [22,] 002732175.1 2 4 2 2 2 2 [23,] 002732335.1 2 1 2 [24,] 002732425.1 2 1 2 [25,] 002803865.1 2 1 2Here are the genomes that are an issue:
Contradiction 1 Classe CL03 CL13 CL14 CL15 CL16 CL18 CL21 CL24 CL25 CL26 CL27 CL28 CL30 CL33 CL34 CL35 CL37 CL41 CL43 CL45 CL48 000773975.1-Erwinia_typographi|M043b 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 002865965.1-Erwinia_sp.|B116 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 001422605.1-Erwinia_sp.|Leaf53 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Contradiction 2 Classe CL03 CL13 CL14 CL15 CL16 CL18 CL21 CL24 CL25 CL26 CL27 CL28 CL30 CL33 CL34 CL35 CL37 CL41 CL43 CL45 CL48 001267535.1-Erwinia_iniecta|B120 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 001267545.1-Erwinia_iniecta|B149 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 000336255.1-Erwinia_toletana|DAPP-PG735 1 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Contradiction 3 Classe CL03 CL13 CL14 CL15 CL16 CL18 CL21 CL24 CL25 CL26 CL27 CL28 CL30 CL33 CL34 CL35 CL37 CL41 CL43 CL45 CL48 002752575.1-Erwinia_sp.|OLMDLW33 0 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 001571305.1-Erwinia_persicina|NBRC102418 2 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 02. Prediction for the P and NP classes
With only 7 genomes in class NPA-NP and 9 genomes in the PA-NP class, it is difficult to have any robust result. So, since also a same profile is found on the NPA-NP class and in the PA-NP class, we merged them in the NP class and we tried to predict the two classes NP and P using the 58 genomes.
Description des 2 classes pour les 58 génomes =============================================== Num Effectif Pourcentage NP 1 16 28 % P 2 42 72 %Below are the counts and percentages for these two classes that we also tried to order:
And the data can be found here (class 1 is NP, class 2 is P):
There is no simple solution to our problem so we use logistic regression to try to predict the class.
A simple binary logistic regression with only one variable at a time gives a first idea of the importance of each variables on the prediction of pathogenicity:
column AUROC CL41 0.9568 CL43 0.9524 CL45 0.9449 CL33 0.9405 CL35 0.9330 CL48 0.9286 CL34 0.9211 CL37 0.9048 CL28 0.8929 CL26 0.8810 CL25 0.8571 CL27 0.8571 CL30 0.8452 CL18 0.8080 CL15 0.7991 CL13 0.7693 CL16 0.6875 CL24 0.6667 CL14 0.6562 CL03 0.5818 CL21 0.5074A multiple binary logistic regression is able to predict the class using 5 variables only:
classe ~ CL16 + CL21 + CL34 + CL35 + CL43 Deviance Residuals: Min 1Q Median 3Q Max -0.45904 0.00000 0.00002 0.00002 2.14597 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 20.86 14530.08 0.001 0.999 CL16 -41.15 21211.30 -0.002 0.998 CL21 -23.06 14530.09 -0.002 0.999 CL34 -45.13 68159.64 -0.001 0.999 CL35 24.76 48196.14 0.001 1.000 CL43 45.13 48826.18 0.001 0.999 Null deviance: 68.3243 on 57 degrees of freedom Residual deviance: 6.5017 on 52 degrees of freedom AIC: 18.502 Number of Fisher Scoring iterations: 21 Auroc : 0.9933Using this multiple binary logistic regression, only one genome is not well classified, 001571305.1 Erwinia_persicina|NBRC102418, as expected.
Here are the graphical representation of the 5 variables:
But it may be easier to read with three distinct graphics:
3. Concluding remarks
Though there is a statistical solution, may be a profile could be defined using the 5 variables to predict pathogenicity.
Retour à la page principale de (gH)