predipath-DB presence/absence data

PREDIPATH predictions using the predipath-DB presence/absence data

Clickable table of contents

 1. Preparation and description of the data

 2. Prediction for the P and NP classes

 3. Concluding remarks

1. Preparation and description of the data
Data was given as a table of 65 lines by 65 columns. We kept only the 58 genomes (not 59!) that were annotated NPA-NP, PA-NP or PA-P (NPA: non plant associated, PA: plant associated, NP: non pathogenic, P: pathogenic). The columns were originally named cluster_1, cluster_2... We shorten these names in CL01, CL02... :

We then decided to shorten and noramlize the names of the columns. Here is the short description of the columns:
 Name SUM
 CL01 1
 CL02 1
 CL03 4
 CL04 1
 CL05 58
 CL06 1
 CL07 1
 CL08 1
 CL09 1
 CL10 1
 CL11 57
 CL12 1
 CL13 10
 CL14 5
 CL15 22
 CL16 6
 CL17 2
 CL18 44
 CL19 1
 CL20 1
 CL21 55
 CL22 1
 CL23 1
 CL24 14
 CL25 30
 CL26 32
 CL27 30
 CL28 33
 CL29 33
 CL30 29
 CL31 33
 CL32 33
 CL33 37
 CL34 39
 CL35 40
 CL36 39
 CL37 34
 CL38 33
 CL39 37
 CL40 32
 CL41 42
 CL42 37
 CL43 38
 CL44 37
 CL45 41
 CL46 37
 CL47 37
 CL48 36
 CL49 39
 CL50 40
 CL51 2
 CL52 2
 CL53 1
 CL54 2
 CL55 1
 CL56 1
 CL57 1
 CL58 1
 CL59 1
 CL60 1
 CL61 1
 CL62 1
 CL63 1
 CL64 1
 CL65 39
 
With a small R script, it was easy to get rid of 1 constant column named CL05.

With another small R script, it was easy to get rid of 35 equivalent or redundant columns
 Column Equal
 1 CL01 CL02 CL04 CL06 CL07 CL08 CL09 CL10 CL12 CL23 CL53 CL55
 3 CL03
 10 CL11
 12 CL13
 13 CL14
 14 CL15
 15 CL16
 16 CL17 CL52
 17 CL18
 18 CL19 CL20
 20 CL21
 21 CL22
 23 CL24
 24 CL25
 25 CL26 CL40
 26 CL27
 27 CL28 CL29 CL31 CL32 CL38
 29 CL30
 32 CL33 CL39 CL42 CL44 CL46 CL47
 33 CL34 CL36 CL49 CL65
 34 CL35 CL50
 36 CL37
 40 CL41
 42 CL43
 44 CL45
 47 CL48
 50 CL51
 53 CL54
 55 CL56 CL57 CL58 CL59 CL60 CL61 CL62 CL63 CL64
 
For the remaining 29 columns, we removed 8 columns with a near zero variance:
 names2rm
 [1,] "CL01"
 [2,] "CL11"
 [3,] "CL17"
 [4,] "CL19"
 [5,] "CL22"
 [6,] "CL51"
 [7,] "CL54"
 [8,] "CL56"
 
So for the remaining part of the analysis, we had only 21 columns and 58 genomes in three classes:
 
 Description des 3 classes pour les 58 génomes
 ===============================================
 
 Num Effectif Pourcentage
 NPA-NP 0 7 12 %
 PA-NP 1 9 16 %
 PA-P 2 42 72 %
 
 Comptages par classe pour les 21 colonnes de données présence/absence dans predipath-db
 ========================================================================================
 
 NPA-NP PA-NP PA-P total
 CL03 2 1 1 4
 CL13 4 5 1 10
 CL14 3 2 0 5
 CL15 7 6 9 22
 CL16 4 2 0 6
 CL18 3 2 39 44
 CL21 6 9 40 55
 CL24 0 0 14 14
 CL25 0 0 30 30
 CL26 0 0 32 32
 CL27 0 0 30 30
 CL28 0 0 33 33
 CL30 0 0 29 29
 CL33 0 0 37 37
 CL34 0 1 38 39
 CL35 0 1 39 40
 CL37 0 0 34 34
 CL41 0 1 41 42
 CL43 0 0 38 38
 CL45 0 1 40 41
 CL48 0 0 36 36
 
 Pourcentages par classe
 ========================
 
 NPA-NP PA-NP PA-P
 CL03 29 % 11 % 2 %
 CL13 57 % 56 % 2 %
 CL14 43 % 22 % 0 %
 CL15 100 % 67 % 21 %
 CL16 57 % 22 % 0 %
 CL18 43 % 22 % 93 %
 CL21 86 % 100 % 95 %
 CL24 0 % 0 % 33 %
 CL25 0 % 0 % 71 %
 CL26 0 % 0 % 76 %
 CL27 0 % 0 % 71 %
 CL28 0 % 0 % 79 %
 CL30 0 % 0 % 69 %
 CL33 0 % 0 % 88 %
 CL34 0 % 11 % 90 %
 CL35 0 % 11 % 93 %
 CL37 0 % 0 % 81 %
 CL41 0 % 11 % 98 %
 CL43 0 % 0 % 90 %
 CL45 0 % 11 % 95 %
 CL48 0 % 0 % 86 %
 
Some lines were equal, with distinct classes, thus leading to a contradiction to discriminate between the classes. 33 lines were removed, so only 25 profiles of lines remain with 21 columns of data.
 Equal lines
 ===========
 
 Line Equals
 1 000773975.1 6 14
 2 001267535.1 3 10
 4 001484765.1
 5 002752575.1 38
 7 900068895.1
 8 000026185.1
 9 000196615.1 16
 11 000745075.1 12
 13 001269445.1
 15 001517405.1
 17 000026985.1 19 58
 18 000027205.1 20 25 26 28 31 32 33 35 43 56
 21 000165815.1
 22 000240705.2 39 41 42 44 45 50 51 53 54 55
 23 000367545.1
 24 000367565.1
 27 000367625.1
 29 000367665.1
 30 000404125.1 36
 34 000590885.1
 37 001050515.1
 40 002732175.1 46 47 49
 48 002732335.1
 52 002732425.1
 57 002803865.1
 
 Classes and contradictions between classes
 ==========================================
 
 Line Class Equal.Lines Classes
 [1,] 000773975.1 0 3 0 0 1
 [2,] 001267535.1 0 3 0 0 1
 [3,] 001484765.1 0 1 0
 [4,] 002752575.1 0 2 0 2
 [5,] 900068895.1 0 1 0
 [6,] 000026185.1 1 1 1
 [7,] 000196615.1 1 2 1 1
 [8,] 000745075.1 1 2 1 1
 [9,] 001269445.1 1 1 1
 [10,] 001517405.1 1 1 1
 [11,] 000026985.1 2 3 2 2 2
 [12,] 000027205.1 2 11 2 2 2 2 2 2 2 2 2 2 2
 [13,] 000165815.1 2 1 2
 [14,] 000240705.2 2 11 2 2 2 2 2 2 2 2 2 2 2
 [15,] 000367545.1 2 1 2
 [16,] 000367565.1 2 1 2
 [17,] 000367625.1 2 1 2
 [18,] 000367665.1 2 1 2
 [19,] 000404125.1 2 2 2 2
 [20,] 000590885.1 2 1 2
 [21,] 001050515.1 2 1 2
 [22,] 002732175.1 2 4 2 2 2 2
 [23,] 002732335.1 2 1 2
 [24,] 002732425.1 2 1 2
 [25,] 002803865.1 2 1 2
 
Here are the genomes that are an issue:
 Contradiction 1
 Classe CL03 CL13 CL14 CL15 CL16 CL18 CL21 CL24 CL25 CL26 CL27 CL28 CL30 CL33 CL34 CL35 CL37 CL41 CL43 CL45 CL48
 000773975.1-Erwinia_typographi|M043b 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 002865965.1-Erwinia_sp.|B116 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 001422605.1-Erwinia_sp.|Leaf53 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 
 Contradiction 2
 Classe CL03 CL13 CL14 CL15 CL16 CL18 CL21 CL24 CL25 CL26 CL27 CL28 CL30 CL33 CL34 CL35 CL37 CL41 CL43 CL45 CL48
 001267535.1-Erwinia_iniecta|B120 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 001267545.1-Erwinia_iniecta|B149 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 000336255.1-Erwinia_toletana|DAPP-PG735 1 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 
 Contradiction 3
 Classe CL03 CL13 CL14 CL15 CL16 CL18 CL21 CL24 CL25 CL26 CL27 CL28 CL30 CL33 CL34 CL35 CL37 CL41 CL43 CL45 CL48
 002752575.1-Erwinia_sp.|OLMDLW33 0 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 001571305.1-Erwinia_persicina|NBRC102418 2 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 
2. Prediction for the P and NP classes
With only 7 genomes in class NPA-NP and 9 genomes in the PA-NP class, it is difficult to have any robust result. So, since also a same profile is found on the NPA-NP class and in the PA-NP class, we merged them in the NP class and we tried to predict the two classes NP and P using the 58 genomes.
 Description des 2 classes pour les 58 génomes
 ===============================================
 
 Num Effectif Pourcentage
 NP 1 16 28 %
 P 2 42 72 %
 
Below are the counts and percentages for these two classes that we also tried to order:

Decreasing counts for the 21 presence/absence data in predipath-db ==================================================================== NP P total CL21 15 40 55 CL18 5 39 44 CL41 1 41 42 CL45 1 40 41 CL35 1 39 40 CL34 1 38 39 CL43 0 38 38 CL33 0 37 37 CL48 0 36 36 CL37 0 34 34 CL28 0 33 33 CL26 0 32 32 CL25 0 30 30 CL27 0 30 30 CL30 0 29 29 CL15 13 9 22 CL24 0 14 14 CL13 9 1 10 CL16 6 0 6 CL14 5 0 5 CL03 3 1 4 Decreasing percentages for the 21 columns ========================================= NP P CL21 94 % 95 % CL15 81 % 21 % CL13 56 % 2 % CL16 38 % 0 % CL14 31 % 0 % CL18 31 % 93 % CL03 19 % 2 % CL34 6 % 90 % CL35 6 % 93 % CL45 6 % 95 % CL41 6 % 98 % CL24 0 % 33 % CL30 0 % 69 % CL25 0 % 71 % CL27 0 % 71 % CL26 0 % 76 % CL28 0 % 79 % CL37 0 % 81 % CL48 0 % 86 % CL33 0 % 88 % CL43 0 % 90 %

And the data can be found here (class 1 is NP, class 2 is P):

name='pnpd' class='cadrejaune' id='pnpdata' rows='15' cols='180'> Classe Especies CL03 CL13 CL14 CL15 CL16 CL18 CL21 CL24 CL25 CL26 CL27 CL28 CL30 CL33 CL34 CL35 CL37 CL41 CL43 CL45 CL48 Erwinia_sp.|9145 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Erwinia_sp.|B116 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Erwinia_sp.|ErVv1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Erwinia_sp.|Leaf53 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Erwinia_iniecta|B120 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Erwinia_iniecta|B149 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Erwinia_sp.|OLMDLW33 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Erwinia_gerundensis|NA 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Erwinia_billingiae|Eb661 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Erwinia_oleae|DAPP-PG531 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Erwinia_typographi|M043b 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Erwinia_billingiae|MYb121 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Erwinia_billingiae|OSU19-1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Erwinia_tasmaniensis|Et1_99 0 1 0 1 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 1 0 Erwinia_toletana|DAPP-PG735 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 Erwinia_teleogrylli|SCU-B244 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Erwinia_amylovora|E-2 0 0 0 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|OR6 0 0 0 0 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|OR1 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|CA3R 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|EA110 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|Ea266 0 0 0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|Ea356 0 0 0 0 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|Ea644 0 0 0 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 0 1 Erwinia_amylovora|LA092 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|LA637 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|LA636 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|LA635 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|UT5P4 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|UPN527 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|CTBT1-1 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|CTBT3-1 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|MAGFLF2 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 Erwinia_pyrifoliae|Ejp617 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 Erwinia_pyrifoliae|Ep1_96 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 1 0 1 1 1 1 Erwinia_amylovora|01SFR-BO 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|ACW56400 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|CFBP1232 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 Erwinia_amylovora|CFBP2585 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|CFPB1430 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|CTMF03-1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 Erwinia_amylovora|CTST01-1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 Erwinia_amylovora|MANB02-1 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|NHSB01-1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|NHWL02-2 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|VTBL01-1 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|VTDMSF02 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_pyrifoliae|EpK1_15 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 1 0 1 1 1 1 Erwinia_tracheiphila|PSU-1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 Erwinia_amylovora|ATCC49946 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|NBRC12687 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_amylovora|WSDA87-73 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 Erwinia_pyrifoliae|DSM12163 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 1 0 1 1 1 1 Erwinia_tracheiphila|BuffGH 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 2 Erwinia_amylovora|RISTBO01-2 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 2 Erwinia_mallotivora|BT-MARDI 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0 2 Erwinia_persicina|NBRC102418 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 Erwinia_piriflorinigrans|CFBP5888 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 There is no simple solution to our problem so we use logistic regression to try to predict the class. A simple binary logistic regression with only one variable at a time gives a first idea of the importance of each variables on the prediction of pathogenicity: <pre class='cadre'> column AUROC CL41 0.9568 CL43 0.9524 CL45 0.9449 CL33 0.9405 CL35 0.9330 CL48 0.9286 CL34 0.9211 CL37 0.9048 CL28 0.8929 CL26 0.8810 CL25 0.8571 CL27 0.8571 CL30 0.8452 CL18 0.8080 CL15 0.7991 CL13 0.7693 CL16 0.6875 CL24 0.6667 CL14 0.6562 CL03 0.5818 CL21 0.5074 </pre> A multiple binary logistic regression is able to predict the class using 5 variables only: <pre class='cadre'> classe ~ CL16 + CL21 + CL34 + CL35 + CL43 Deviance Residuals: Min 1Q Median 3Q Max -0.45904 0.00000 0.00002 0.00002 2.14597 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 20.86 14530.08 0.001 0.999 CL16 -41.15 21211.30 -0.002 0.998 CL21 -23.06 14530.09 -0.002 0.999 CL34 -45.13 68159.64 -0.001 0.999 CL35 24.76 48196.14 0.001 1.000 CL43 45.13 48826.18 0.001 0.999 Null deviance: 68.3243 on 57 degrees of freedom Residual deviance: 6.5017 on 52 degrees of freedom AIC: 18.502 Number of Fisher Scoring iterations: 21 Auroc : 0.9933 </pre> Using this multiple binary logistic regression, only one genome is not well classified, 001571305.1 Erwinia_persicina|NBRC102418, as expected. Here are the graphical representation of the 5 variables: <a href='../tables2gilles/pre01-06tout.png'><img src='../tables2gilles/pre01-06tout.png' alt='non su' width='800' /></a> But it may be easier to read with three distinct graphics: <a href='../tables2gilles/pre01-06cl0.png'><img src='../tables2gilles/pre01-06cl0.png' alt='non su' width='400' /></a> <a href='../tables2gilles/pre01-06cl1.png'><img src='../tables2gilles/pre01-06cl1.png' alt='non su' width='400' /></a> <a href='../tables2gilles/pre01-06cl2.png'><img src='../tables2gilles/pre01-06cl2.png' alt='non su' width='800' /></a> </blockquote> <h2 id='tdm3'>3. Concluding remarks</h2> <blockquote> Though there is a statistical solution, may be a profile could be defined using the 5 variables to predict pathogenicity. </blockquote> </td></tr></table>   <a href="http://www.info.univ-angers.fr/~gh/"> <img src="return.gif" alt="retour gH" /></a> Retour à la page principale de (gH) </blockquote> </body> </html>

PREDIPATH predictions using the predipath-DB presence/absence data

Clickable table of contents

1. Preparation and description of the data

2. Prediction for the P and NP classes